Entity and Key Phrase Recognition with AWS in the Spirit of Christmas
Table of Contents
- Scraping the Data
- Setting Up Amazon Comprehend in R
- Entity Recognition of Christmas Traditions
- Key Phrase Recognition of Christmas Traditions
This article was written as part of a university project and is aimed at demonstrating how to use some features of Amazon Comprehend which is one of the numerous machine learning tools offered by Amazon Web Services (AWS).
Since Christmas is approaching fast, which happens to be my favourite holiday of the year, I decided to center this use-case of using Amazon Comprehend in R around this festive topic. Everyone is probably well aware of their own Christmas traditions, however we tend to know much less about the customs of other countries and cultures. Therefore, I opted for collecting information on Christmas traditions in 90 countries around the world and analysed it with the help of Custom Entity and Key Phrase Recognition in order to find out what the most widespread traditions and shared values are.
Now let the Christmas spirit take over!
Scraping the Data
All the data on Christmas traditions was scraped from this website which stores information on many Christmas related topics. Luckily, it also has a subpage titled ‘Christmas Around the World’ which features the Christmas traditions of 90 countries in total. It is also worth mentioning that the SelectorGadget Chrome extension made the web scraping part a lot easier so I highly recommend using it for similar projects.
To make the scraping faster first I extracted the relative links pointing to the subpage of every country and then created the full URLs by adding the base part to each of them.
There were two attributes that I needed from every page for the analysis: the name of the country and the text of the article about its Christmas traditions. To make this process efficient I created a function which was able to scrape a page, extract the name of the country as well as the text of the article and write these into a named list. It is important to note that since the articles were divided into paragraphs their text was first stored in a list with as many elements as the number of paragraphs. However, I decided to merge these into one single element for easier usability later on.
With the help of the function I could scrape all the pages with a single line of code and then create a data frame out of the resulting named list. After this step I cleaned the texts by removing empty spaces and characters that originally indicated line breaks. As a result, I ended up with a data frame containing 90 rows and 2 columns: one for the name of the country and one for the text of the article.
At this point I had all the data for the analysis so I could move on to the next step and set up AWS.
Setting Up Amazon Comprehend in R
To be able to use Amazon Comprehend in R the first thing to do is create an AWS account which is a fairly easy and not very time-consuming task. Once this is done and you are logged in an access key has to be generated on the Identity and Access Management (IAM) page so that R is able to securely communicate with the API. To generate the key first click on ‘Users’ in the left side panel, then click on your username and navigate to the ‘Security credentials’ tab. Here by clicking on the ‘Create access key’ button you can generate your access key which means that a .csv file is going to be downloaded to your computer. It is best to save it to the folder of the R project you are working on.
Now that we have the access key we have to set it up in R and install the necessary package: aws.comprehend to be able to perform the analysis.
Having done this step we can move on to the actual analysis.
Entity Recognition of Christmas Traditions
To start with, Amazon Comprehend has a limit meaning that it can only process strings of a maximum of 5000 characters at a time. To check if the scraped articles were below this limit I created a histogram where the red vertical line indicates exactly 5000 characters.
Since a considerable amount of the articles consisted of more than 5000 characters this had to be taken care of in order to proceed with the Entity Recognition. Therefore I created a function that could split the text into smaller parts set by one of its parameters.
After this, with the help of a for loop I looped through the data frame that contained the countries in one column and the articles in another. The list provided at the beginning of the for loop was the countries column of the data frame. After detecting the entities in an article about a given country I always dropped the duplicates because it does not make a tradition more wide-spread if it appears several times when speaking about the same country. Inside the loop I used an if statement to split the text into smaller parts when it was necessary. From the results of the entity detection function I only kept the entities column and filtered the position of the entity string, the confidence score and the category.
The resulting data frame contained two columns: one for the recognized entities and one for the countries.
By using the above mentioned data frame I could visualize which entities appeared the most frequently in the texts of the articles.
Based on this chart I could conclude that Christian Christmas traditions are relatively common among different countries. Furthermore, I did not encounter any custom or important day that I had not known before therefore I assumed that the main concepts of celebrating Christmas are probably similar among most of the 90 countries I collected information on.
Since Santa and Santa Claus were both among the top 10 most frequent entities I decided to check how many countries have this tradition and it turned out that 38 out of 90 do. However, what was even more interesting is that these countries are literally all over the globe so Santa Claus brings smiles to children’s faces on every continent.
Having done the entity recognition let’s jump into the last part which is key phrase recognition.
Key Phrase Recognition of Christmas Traditions
Key phrase recognition works very similarly to entity recognition, meaning that it also has the limit of 5000 characters at a time.
The for loop that I used to get the key phrases from every article was very similar to the one above. I looped through the same data frame country by country and filtered the duplicates. Furthermore, I used an if statement to split the text when it was longer than 5000 characters. From the results of the phrase detection function I only kept the detected phrases and dropped the position of the phrase string and the confidence score. The extra step in this part was that once I had my data frame with the phrases in one column and the countries in the other I transformed all phrases to consist only of lower case characters so that the same phrase with different capitalization would not count as two different ones.
By using the resulting data frame I visualized the most common phrases.
Based on this chart I could conclude that there was some overlap between the key phrase and the entity recognition since Christmas, Christmas Eve and Christmas Day appeared in both. What was nice to see is that this holiday is truly about people and children all around the world. Oh, and let’s not forget about the presents either.
To sum up, Amazon Comprehend is a very powerful tool and by using it we can get insights from text based data without a lot of experience with machine learning tools. However, it is important to mention that the service is not free, in case you are interested the pricing is available on this site. By combining web scraping, Custom Entity Recognition and Key Phrase Recognition we could investigate Christmas traditions around the world that appeared to be similar to each other with many countries sharing the same customs and values.
Enjoy the festive spirit of December!