Stack Abuse: Python for NLP: Introduction to the TextBlob Library

Introduction

This is the seventh article in my series of articles on Python for NLP. In my previous article, I explained how to perform topic modeling using Latent Dirichlet Allocation and Non-Negative Matrix factorization. We used the Scikit-Learn library to perform topic modeling.

In this article, we will explore TextBlob, which is another extremely powerful NLP library for Python. TextBlob is built upon NLTK and provides an easy to use interface to the NLTK library. We will see how TextBlob can be used to perform a variety of NLP tasks ranging from parts-of-speech tagging to sentiment analysis, and language translation to text classification.

The detailed download instructions for the library can be found at the official link. I would suggest that you install the TextBlob library as well as the sample corpora.

Here is the gist of the instructions linked above, but be sure to check the official documentation for more instructions on installing if you need it:

$   pip install -U textblob 

And to install the corpora:

$   python -m textblob.download_corpora 

Let’s now see the different functionalities of the TextBlob library.

Tokenization

Tokenization refers to splitting a large paragraph into sentences or words. Typically, a token refers to a word in a text document. Tokenization is pretty straight forward with TextBlob. All you have to do is import the TextBlob object from the textblob library, pass it the document that you want to tokenize, and then use the sentences and words attributes to get the tokenized sentences and attributes. Let’s see this in action:

The first step is to import the TextBlob object:

from textblob import TextBlob   

Next, you need to define a string that contains the text of the document. We will create string that contains the first paragraph of the Wikipedia article on artificial intelligence.

document = ("In computer science, artificial intelligence (AI), \               sometimes called machine intelligence, is intelligence \             demonstrated by machines, in contrast to the natural intelligence \             displayed by humans and animals. Computer science defines AI \             research as the study of \"intelligent agents\": any device that \             perceives its environment and takes actions that maximize its\             chance of successfully achieving its goals.[1] Colloquially,\             the term \"artificial intelligence\" is used to describe machines\             that mimic \"cognitive\" functions that humans associate with other\             human minds, such as \"learning\" and \"problem solving\".[2]") 

The next step is to pass this document as a parameter to the TextBlob class. The returned object can then be used to tokenize the document to words and sentences.

text_blob_object = TextBlob(document)   

Now to get the tokenized sentences, we can use the sentences attribute:

document_sentence = text_blob_object.sentences  print(document_sentence)   print(len(document_sentence))   

In the output, you will see the tokenized sentences along with the number of sentences.

[Sentence("In computer science, artificial intelligence (AI),             sometimes called machine intelligence, is intelligence             demonstrated by machines, in contrast to the natural intelligence             displayed by humans and animals."), Sentence("Computer science defines AI             research as the study of "intelligent agents": any device that             perceives its environment and takes actions that maximize its            chance of successfully achieving its goals."), Sentence("[1] Colloquially,            the term "artificial intelligence" is used to describe machines            that mimic "cognitive" functions that humans associate with other            human minds, such as "learning" and "problem solving"."), Sentence("[2]")] 4   

Similarly, the words attribute returns the tokenized words in the document.

document_words = text_blob_object.words  print(document_words)   print(len(document_words))   

The output looks like this:

['In', 'computer', 'science', 'artificial', 'intelligence', 'AI', 'sometimes', 'called', 'machine', 'intelligence', 'is', 'intelligence', 'demonstrated', 'by', 'machines', 'in', 'contrast', 'to', 'the', 'natural', 'intelligence', 'displayed', 'by', 'humans', 'and', 'animals', 'Computer', 'science', 'defines', 'AI', 'research', 'as', 'the', 'study', 'of', 'intelligent', 'agents', 'any', 'device', 'that', 'perceives', 'its', 'environment', 'and', 'takes', 'actions', 'that', 'maximize', 'its', 'chance', 'of', 'successfully', 'achieving', 'its', 'goals', '1', 'Colloquially', 'the', 'term', 'artificial', 'intelligence', 'is', 'used', 'to', 'describe', 'machines', 'that', 'mimic', 'cognitive', 'functions', 'that', 'humans', 'associate', 'with', 'other', 'human', 'minds', 'such', 'as', 'learning', 'and', 'problem', 'solving', '2'] 84   

Lemmatization

Lemmatization refers to reducing the word to its root form as found in a dictionary.

To perform lemmatization via TextBlob, you have to use the Word object from the textblob library, pass it the word that you want to lemmatize and then call the lemmatize method.

from textblob import Word  word1 = Word("apples")   print("apples:", word1.lemmatize())  word2 = Word("media")   print("media:", word2.lemmatize())  word3 = Word("greater")   print("greater:", word3.lemmatize("a"))   

In the script above, we perform lemmatization on the words “apples”, “media”, and “greater”. In the output, you will see the words “apple”, (which is singular for the apple), “medium” (which is singular for the medium) and “great” (which is the positive degree for the word greater). Notice that for the word greater, we pass “a” as a parameter to the lemmatize method. This specifically tells the method that the word should be treated as an adjective. By default, the words are treated as nouns by the lemmatize() method. The complete list for the parts of speech components is as follows:

ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v'   

Parts of Speech (POS) Tagging

Like the spaCy and NLTK libraries, the TextBlob library also contains functionalities for the POS tagging.

To find POS tags for the words in a document, all you have to do is use the tags attribute as shown below:

for word, pos in text_blob_object.tags:       print(word + " => " + pos) 

In the script above, print the tags for all the words in the first paragraph of the Wikipedia article on Artificial Intelligence. The output of the script above looks like this:

In => IN   computer => NN   science => NN   artificial => JJ   intelligence => NN   AI => NNP   sometimes => RB   called => VBD   machine => NN   intelligence => NN   is => VBZ   intelligence => NN   demonstrated => VBN   by => IN   machines => NNS   in => IN   contrast => NN   to => TO   the => DT   natural => JJ   intelligence => NN   displayed => VBN   by => IN   humans => NNS   and => CC   animals => NNS   Computer => NNP   science => NN   defines => NNS   AI => NNP   research => NN   as => IN   the => DT   study => NN   of => IN   intelligent => JJ   agents => NNS   any => DT   device => NN   that => WDT   perceives => VBZ   its => PRP$     environment => NN   and => CC   takes => VBZ   actions => NNS   that => IN   maximize => VB   its => PRP$     chance => NN   of => IN   successfully => RB   achieving => VBG   its => PRP$     goals => NNS   [ => RB 1 => CD   ] => NNP Colloquially => NNP   the => DT   term => NN   artificial => JJ   intelligence => NN   is => VBZ   used => VBN   to => TO   describe => VB   machines => NNS   that => IN   mimic => JJ   cognitive => JJ   functions => NNS   that => WDT   humans => NNS   associate => VBP   with => IN   other => JJ   human => JJ   minds => NNS   such => JJ   as => IN   learning => VBG   and => CC   problem => NN   solving => NN   [ => RB 2 => CD   ] => NNS 

The POS tags have been printed in the abbreviation form. To see the full form of each abbreviation, please consult this link.

Convert Text to Singular and Plural

TextBlob also allows you to convert text into a plural or singular form using the pluralize and singularize methods, respectively. Look at the following example:

text = ("Football is a good game. It has many health benefit")   text_blob_object = TextBlob(text)   print(text_blob_object.words.pluralize())   

In the output, you will see the plural of all the words:

['Footballs', 'iss', 'some', 'goods', 'games', 'Its', 'hass', 'manies', 'healths', 'benefits'] 

Similarly, to singularize words you can use singularize method as follows:

text = ("Footballs is a goods games. Its has many healths benefits")  text_blob_object = TextBlob(text)   print(text_blob_object.words.singularize())   

The output of the script above looks like this:

['Football', 'is', 'a', 'good', 'game', 'It', 'ha', 'many', 'health', 'benefit'] 

Noun Phrase Extraction

Noun phrase extraction, as the name suggests, refers to extracting phrases that contain nouns. Let’s find all the noun phrases in the first paragraph of the Wikipedia article on artificial intelligence that we used earlier.

To find noun phrases, you simply have to use the noun_phrase attributes on the TextBlob object. Look at the following example:

text_blob_object = TextBlob(document)   for noun_phrase in text_blob_object.noun_phrases:       print(noun_phrase) 

The output looks like this:

computer science   artificial intelligence   ai   machine intelligence   natural intelligence   computer   science defines   ai   intelligent agents   colloquially   artificial intelligence   describe machines   human minds   

You can see all the noun phrases in our document.

Getting Words and Phrase Counts

In a previous section, we used Python’s built-in len method to count the number of sentences, words and noun-phrases returned by the TextBlob object. We can use TextBlob’s built-in methods for the same purpose.

To find the frequency of occurrence of a particular word, we have to pass the name of the word as the index to the word_counts list of the TextBlob object.

In the following example, we will count the number of instances of the word “intelligence” in the first paragraph of the Wikipedia article on Artificial Intelligence.

text_blob_object = TextBlob(document)   text_blob_object.word_counts['intelligence']   

Another way is to simply call the count method on the words attribute, and pass the name of the word whose frequency of occurrence is to be found as shown below:

text_blob_object.words.count('intelligence')   

It is important to mention that by default the search is not case-sensitive. If you want your search to be case sensitive, you need to pass True as the value for the case_sensitive parameter, as shown below:

text_blob_object.words.count('intelligence', case_sensitive=True)   

Like word counts, noun phrases can also be counted in the same way. The following example finds the phrase “artificial intelligence” in the paragraph.

text_blob_object = TextBlob(document)   text_blob_object.noun_phrases.count('artificial intelligence')   

In the output, you will see 2.

Converting to Upper and Lowercase

TextBlob objects are very similar to strings. You can convert them to upper case or lower case, change their values, and concatenate them together as well. In the following script, we convert the text from the TextBlob object to upper case:

text = "I love to watch football, but I have never played it"   text_blob_object = TextBlob(text)  print(text_blob_object.upper())   

In the output, you will the string in the upper case:

I LOVE TO WATCH FOOTBALL, BUT I HAVE NEVER PLAYED IT   

Similarly to convert the text to lowercase, we can use the lower() method as shown below:

text = "I LOVE TO WATCH FOOTBALL, BUT I HAVE NEVER PLAYED IT"   text_blob_object = TextBlob(text)  print(text_blob_object.lower())   

Finding N-Grams

N-Grams refer to n combination of words in a sentence. For instance, for a sentence “I love watching football”, some 2-grams would be (I love), (love watching) and (watching football). N-Grams can play a crucial role in text classification.

In TextBlob, N-grams can be found by passing the number of N-Grams to the ngrams method of the TextBlob object. Look at the following example:

text = "I love to watch football, but I have never played it"   text_blob_object = TextBlob(text)   for ngram in text_blob_object.ngrams(2):       print(ngram) 

The output of the script looks like this:

['I', 'love'] ['love', 'to'] ['to', 'watch'] ['watch', 'football'] ['football', 'but'] ['but', 'I'] ['I', 'have'] ['have', 'never'] ['never', 'played'] ['played', 'it'] 

This is especially helpful when training language models or doing any type of text prediction.

Spelling Corrections

Spelling correction is one of the unique functionalities of the TextBlob library. With the correct method of the TextBlob object, you can correct all the spelling mistakes in your text. Look at the following example:

text = "I love to watchf footbal, but I have neter played it"   text_blob_object = TextBlob(text)  print(text_blob_object.correct())   

In the script above we made three spelling mistakes: “watchf” instead of “watch”, “footbal” instead of “football”, “neter” instead of “never”. In the output, you will see that these mistakes have been corrected by TextBlob, as shown below:

I love to watch football, but I have never played it   

Language Translation

One of the most powerful capabilities of the TextBlob library is to translate from one language to another. On the backend, the TextBlob language translator uses the Google Translate API

To translate from one language to another, you simply have to pass the text to the TextBlob object and then call the translate method on the object. The language code for the language that you want your text to be translated to is passed as a parameter to the method. Let’s take a look at an example:

text_blob_object_french = TextBlob(u'Salut comment allez-vous?')   print(text_blob_object_french.translate(to='en'))   

In the script above, we pass a sentence in the French language to the TextBlob object. Next, we call the translate method on the object and pass language code en to the to parameter. The language code en corresponds to the English language. In the output, you will see the translation of the French sentence as shown below:

Hi, how are you?   

Let’s take another example where we will translate from Arabic to English:

text_blob_object_arabic = TextBlob(u'مرحبا كيف حالك؟')   print(text_blob_object_arabic.translate(to='en'))   

Output:

Hi, how are you?   

Finally, using the detect_language method, you can also detect the language of the sentence. Look at the following script:

text_blob_object = TextBlob(u'Hola como estas?')   print(text_blob_object.detect_language())   

In the output, you will see es, which stands for the Spanish language.

The language code for all the languages can be found at this link.

Text Classification

TextBlob also provides basic text classification capabilities. Though, I would not recommend TextBlob for text classification owing to its limited capabilities, however, if you have a really limited data and you want to quickly develop a very basic text classification model, then you may use TextBlob. For advanced models, I would recommend machine learning libraries such as Scikit-Learn or Tensorflow.

Let’s see how we can perform text classification with TextBlob. The first thing we need is a training dataset and test data. The classification model will be trained on the training dataset and will be evaluated on the test dataset.

Suppose we have the following training and test data:

train_data = [       ('This is an excellent movie', 'pos'),     ('The move was fantastic I like it', 'pos'),     ('You should watch it, it is brilliant', 'pos'),     ('Exceptionally good', 'pos'),     ("Wonderfully directed and executed. I like it", 'pos'),     ('It was very boring', 'neg'),     ('I did not like the movie', 'neg'),     ("The movie was horrible", 'neg'),     ('I will not recommend', 'neg'),     ('The acting is pathetic', 'neg') ] test_data = [       ('Its a fantastic series', 'pos'),     ('Never watched such a brillent movie', 'pos'),     ("horrible acting", 'neg'),     ("It is a Wonderful movie", 'pos'),     ('waste of money', 'neg'),     ("pathetic picture", 'neg') ] 

The dataset contains some dummy reviews about movies. You can see our training and test datasets consist of lists of tuples where the first element of the tuple is the text or a sentence while the second member of the tuple is the corresponding review or sentiment of the text.

We will train our dataset on the train_data and will evaluate it on the test_data. To do so, we will use the NaiveBayesClassifier class from the textblob.classifiers library. The following script imports the library:

from textblob.classifiers import NaiveBayesClassifier   

To train the model, we simply have to pass the training data to the constructor of the NaiveBayesClassifier class. The class will return an object trained on the dataset and capable of making predictions on the test set.

classifier = NaiveBayesClassifier(train_data)   

Let’s first make a prediction on a single sentence. To do so, we need to call the classify method and pass it the sentence. Look at the following example:

print(classifier.classify("It is very boring"))   

It looks like a negative review. When you execute the above script, you will see neg in the output.

Similarly, the following script will return pos since the review is positive.

print(classifier.classify("It's a fantastic series"))   

You can also make a prediction by passing our classifier to the classifier parameter of the TextBlob object. You then have to call the classify method on the TextBlob object to view the prediction.

sentence = TextBlob("It's a fantastic series.", classifier=classifier)   print(sentence.classify())   

Finally, to find the accuracy of your algorithm on the test set, call the accuracy method on your classifier and pass it the test_data that we just created. Look at the following script:

classifier.accuracy(test_data)   

In the output, you will see 0.66 which is the accuracy of the algorithm.

To find the most important features for the classification, the show_informative_features method can be used. The number of most important features to see is passed as a parameter.

classifier.show_informative_features(3)   

The output looks like this:

Most Informative Features               contains(it) = False             neg : pos    =      2.2 : 1.0             contains(is) = True              pos : neg    =      1.7 : 1.0            contains(was) = True              neg : pos    =      1.7 : 1.0 

In this section, we tried to find the sentiment of the movie review using text classification. In reality, you don’t have to perform text classification to find the sentiment of a sentence in TextBlob. The TextBlob library comes with a built-in sentiment analyzer which we will see in the next section.

Sentiment Analysis

In this section, we will analyze the sentiment of the public reviews for different foods purchased via Amazon. We will use the TextBlob sentiment analyzer to do so.

The dataset can be downloaded from this Kaggle link.

As a first step, we need to import the dataset. We will only import the first 20,000 records due to memory constraints. You can import more records if you want. The following script imports the dataset:

import pandas as pd   import numpy as np  reviews_datasets = pd.read_csv(r'E:\Datasets\Reviews.csv')   reviews_datasets = reviews_datasets.head(20000)   reviews_datasets.dropna()   

To see how our dataset looks, we will use the head method of the pandas data frame:

reviews_datasets.head()   

The output looks like this:

From the output, you can see that the text review about the food is contained by the Text column. The score column contains ratings of the user for the particular product with 1 being the lowest and 5 being the highest rating.

Let’s see the distribution of rating:

import seaborn as sns   import matplotlib.pyplot as plt   %matplotlib inline sns.distplot(reviews_datasets['Score'])   

You can see that most of the ratings are highly positive i.e. 5. Let’s plot the bar plot for the ratings to have a better look at the number of records for each rating.

sns.countplot(x='Score', data=reviews_datasets)   

The output shows that more than half of reviews have 5-star ratings.

Let’s randomly select a review and find its polarity using TextBlob. Let’s take a look at review number 350.

reviews_datasets['Text'][350]   

Output:

'These chocolate covered espresso beans are wonderful!  The chocolate is very dark and rich and the "bean" inside is a very delightful blend of flavors with just enough caffine to really give it a zing.'   

It looks like the review is positive. Let’s verify this using the TextBlob library. To find the sentiment, we have to use the sentiment attribute of the TextBlog object. The sentiment object returns a tuple that contains polarity and subjectivity of the review.

The value of polarity can be between -1 and 1 where the reviews with negative polarities have negative sentiments while the reviews with positive polarities have positive sentiments.

The subjectivity value can be between 0 and 1. Subjectivity quantifies the amount of personal opinion and factual information contained in the text. The higher subjectivity means that the text contains personal opinion rather than factual information.

Let’s find the sentiment of the 350th review.

text_blob_object = TextBlob(reviews_datasets['Text'][350])   print(text_blob_object.sentiment)   

The output looks like this:

Sentiment(polarity=0.39666666666666667,subjectivity=0.6616666666666667)   

The output shows that the review is positive with a high subjectivity.

Let’s now add a column for sentiment polarity in our dataset. Execute the following script:

def find_pol(review):       return TextBlob(review).sentiment.polarity  reviews_datasets['Sentiment_Polarity'] = reviews_datasets['Text'].apply(find_pol)   reviews_datasets.head()   

Now let’s see the distribution of polarity in our dataset. Execute the following script:

sns.distplot(reviews_datasets['Sentiment_Polarity'])   

The output of the script above looks like this:

It is evident from the figure above that most of the reviews are positive and have polarity between 0 and 0.5. This is natural since most of the reviews in the dataset have 5-star ratings.

Let’s now plot the average polarity for each score rating.

sns.barplot(x='Score', y='Sentiment_Polarity', data=reviews_datasets)   

Output:

The output clearly shows that the reviews with high rating scores have high positive polarities.

Let’s now see some of the most negative reviews i.e. the reviews with a polarity value of -1.

most_negative = reviews_datasets[reviews_datasets.Sentiment_Polarity == -1].Text.head()   print(most_negative)   

The output looks like this:

545     These chips are nasty.  I thought someone had ...   1083    All my fault. I thought this would be a carton...   1832    Pop Chips are basically a horribly over-priced...   2087    I do not consider Gingerbread, Spicy Eggnog, C...   2763    This popcorn has alot of hulls I order 4 bags ...   Name: Text, dtype: object   

Let’s print the value of review number 545.

reviews_datasets['Text'][545]   

In the output, you will see the following review:

'These chips are nasty.  I thought someone had spilled a drink in the bag, no the chips were just soaked with grease.  Nasty!!'   

The output clearly shows that the review is highly negative.

Let’s now see some of the most positive reviews. Execute the following script:

most_positive = reviews_datasets[reviews_datasets.Sentiment_Polarity == 1].Text.head()   print(most_positive)   

The output looks like this:

106     not what I was expecting in terms of the compa...   223     This is an excellent tea.  One of the best I h...   338     I like a lot of sesame oil and use it in salad...   796     My mother and father were the recipient of the...   1031    The Kelloggs Muselix are delicious and the del...   Name: Text, dtype: object   

Let’s see review 106 in detail:

reviews_datasets['Text'][106]   

Output:

"not what I was expecting in terms of the company's reputation for excellent home delivery products" 

You can see that though the review was not very positive, it has been assigned a polarity of 1 due to the presence of words like excellent and reputation. It is important to know that sentiment analyzer is not 100% error-proof and might predict wrong sentiment in a few cases, such as the one we just saw.

Let’s now see review number 223 which also has been marked as positive.

reviews_datasets['Text'][223]   

The output looks like this:

"This is an excellent tea.  One of the best I have ever had.  It is especially great when you prepare it with a samovar." 

The output clearly depicts that the review is highly positive.

Conclusion

Python’s TextBlob library is one of the most famous and widely used natural language processing libraries. This article explains several functionalities of the TextBlob library, such as tokenization, stemming, sentiment analysis, text classification and language translation in detail.

Planet Python

Leave a Reply

Your email address will not be published. Required fields are marked *