Stack Abuse: Brief Introduction to OpenGL in Python with PyOpenGL

Introduction

In this tutorial, we’re going to learn how to use PyOpenGL library in Python. OpenGL is a graphics library which is supported by multiple platforms including Windows, Linux, and MacOS, and is available for use in multiple other languages as well; however, the scope of this post will be limited to its usage in the Python programming language.

OpenGL, as compared to other similar graphics libraries, is fairly simple. We’ll start with setting it up on our system, followed by writing a simple example demonstrating the usage of the library.

Installation

The easiest way to install OpenGL using Python is through the pip package manager. If you have pip installed in your system, run the following command to download and install OpenGL:

$   pip install PyOpenGL PyOpenGL_accelerate 

I’d recommend copying the above command to help avoid typos.

Once this command finishes execution, if the installation is successful, you should get the following output at the end:

Successfully installed PyOpenGL-3.1.0 PyOpenGL-accelerate-3.1.0   

If this doesn’t work, you can also download it manually. For that, this link, scroll down to the ‘downloading and installation’ heading, and download all the files over there. After that, navigate to the folder where you downloaded those files, and run the following command in the terminal or command prompt:

$   python setup.py 

It is pertinent to mention that you require Visual C++ 14.0 build tools installed on your system in order to work with OpenGL libraries in Python.

Now that we have successfully installed OpenGL on our system, let’s get our hands dirty with it.

Coding Exercise

The first thing we need to do to use OpenGL in our code is to import it. To do that, run the following command:

import OpenGL   

Before we proceed, there are a few other libraries that you need to import whenever you intend to use this library in your program. Below is the code for those imports:

import OpenGL.GL   import OpenGL.GLUT   import OpenGL.GLU   print("Imports successful!") # If you see this printed to the console then installation was successful   

Now that we are done with the necessary imports, let’s first create a window in which our graphics will be shown. The code for that is given below, along with its explanation in the comments:

def showScreen():       glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT) # Remove everything from screen (i.e. displays all white)  glutInit() # Initialize a glut instance which will allow us to customize our window   glutInitDisplayMode(GLUT_RGBA) # Set the display mode to be colored   glutInitWindowSize(500, 500)   # Set the width and height of your window   glutInitWindowPosition(0, 0)   # Set the position at which this windows should appear   wind = glutCreateWindow("OpenGL Coding Practice") # Give your window a title   glutDisplayFunc(showScreen)  # Tell OpenGL to call the showScreen method continuously   glutIdleFunc(showScreen)     # Draw any graphics or shapes in the showScreen function at all times   glutMainLoop()  # Keeps the window created above displaying/running in a loop   

Copy the imports above, as well as this code in a single python (.py) file, and execute it. You should see a white square dimension screen pop up. Now, if we wish to draw any shapes or make any other kind of graphics, we need to do that in our “showScreen” function.

Let’s now try to make a square using OpenGL, but before we do we need to understand the coordinate system that OpenGL follows.

The (0,0) point is the bottom left of your window, if you go up from there, you’re moving along the y-axis, and if you go right from there, you’re moving along the x-axis. So, the top left point of your window would be (0, 500), top right would be (500, 500), bottom right would be (500, 0).

Note: We’re talking about the window we created above, which had a dimension of 500 x 500 in our example, and not your computer’s full screen.

Now that we’ve got that out of the way, lets code a square. The explanation to the code can be found in the comments.

from OpenGL.GL import *   from OpenGL.GLUT import *   from OpenGL.GLU import *  w, h = 500,500  # ---Section 1--- def square():       # We have to declare the points in this sequence: bottom left, bottom right, top right, top left     glBegin(GL_QUADS) # Begin the sketch     glVertex2f(100, 100) # Coordinates for the bottom left point     glVertex2f(200, 100) # Coordinates for the bottom right point     glVertex2f(200, 200) # Coordinates for the top right point     glVertex2f(100, 200) # Coordinates for the top left point     glEnd() # Mark the end of drawing  # This alone isn't enough to draw our square  # ---Section 2---  def showScreen():       glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT) # Remove everything from screen (i.e. displays all white)     glLoadIdentity() # Reset all graphic/shape's position     square() # Draw a square using our function     glutSwapBuffers()  #---Section 3---  glutInit()   glutInitDisplayMode(GLUT_RGBA) # Set the display mode to be colored   glutInitWindowSize(500, 500)   # Set the w and h of your window   glutInitWindowPosition(0, 0)   # Set the position at which this windows should appear   wind = glutCreateWindow("OpenGL Coding Practice") # Set a window title   glutDisplayFunc(showScreen)   glutIdleFunc(showScreen) # Keeps the window open   glutMainLoop()  # Keeps the above created window displaying/running in a loop   

Running the code above would draw a square, but that square would not be visible since it’s color would be the same as the color of our window, so we need to assign it a different color as well, for that we will make some changes in “Section 2” of the code above i.e. the showScreen function. Add the following line below the glLoadIdentity statement and above the square() statement:

glColor3f(1.0, 0.0, 3.0) # Set the color to pink   

However, our code is still not complete. What it currently does is draw the square once, and then clear the screen again. We don’t want that. Actually, we won’t even be able to spot the moment when it actually draws the square because it would appear and disappear in a split second. Lets write another function to avoid this.

# Add this function before Section 2 of the code above i.e. the showScreen function def iterate():       glViewport(0, 0, 500,500)     glMatrixMode(GL_PROJECTION)     glLoadIdentity()     glOrtho(0.0, 500, 0.0, 500, 0.0, 1.0)     glMatrixMode (GL_MODELVIEW)     glLoadIdentity() 

Call this iterate function in “Section 2” of the code above. Add it below glLoadIdentity and above the glColor3d statement in the showScreen function.

Let’s now compile all this into a single code file so that there are no ambiguities:

from OpenGL.GL import *   from OpenGL.GLUT import *   from OpenGL.GLU import *  w,h= 500,500   def square():       glBegin(GL_QUADS)     glVertex2f(100, 100)     glVertex2f(200, 100)     glVertex2f(200, 200)     glVertex2f(100, 200)     glEnd()  def iterate():       glViewport(0, 0, 500, 500)     glMatrixMode(GL_PROJECTION)     glLoadIdentity()     glOrtho(0.0, 500, 0.0, 500, 0.0, 1.0)     glMatrixMode (GL_MODELVIEW)     glLoadIdentity()  def showScreen():       glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT)     glLoadIdentity()     iterate()     glColor3f(1.0, 0.0, 3.0)     square()     glutSwapBuffers()  glutInit()   glutInitDisplayMode(GLUT_RGBA)   glutInitWindowSize(500, 500)   glutInitWindowPosition(0, 0)   wind = glutCreateWindow("OpenGL Coding Practice")   glutDisplayFunc(showScreen)   glutIdleFunc(showScreen)   glutMainLoop()   

When you run this, a window should appear with a pink colored square box in it.

Output:

Conclusion

In this tutorial, we learned about OpenGL, how to download and install it, followed by using it an a short example program. In this example we also practiced making a basic shape using OpenGL, which gave us an insight into some complex function calls that need to be made whenever we need to draw something using this library. To conclude, OpenGL is very resourceful and gets more and more complex as we dive deeper into it.

Planet Python

Stack Abuse: Python for NLP: Working with the Gensim Library (Part 2)

This is my 11th article in the series of articles on Python for NLP and 2nd article on the Gensim library in this series. In a previous article, I provided a brief introduction to Python’s Gensim library. I explained how we can create dictionaries that map words to their corresponding numeric Ids. We further discussed how to create a bag of words corpus from dictionaries. In this article, we will study how we can perform topic modeling using the Gensim library.

I have explained how to do topic modeling using Python’s Scikit-Learn library, in my previous article. In that article, I explained how Latent Dirichlet Allocation (LDA) and Non-Negative Matrix factorization (NMF) can be used for topic modeling.

In this article, we will use the Gensim library for topic modeling. The approaches employed for topic modeling will be LDA and LSI (Latent Semantim Indexing).

Installing Required Libraries

We will perform topic modeling on the text obtained from Wikipedia articles. To scrape Wikipedia articles, we will use the Wikipedia API. To download the Wikipedia API library, execute the following command:

$   pip install wikipedia 

Otherwise, if you use Anaconda distribution of Python, you can use one of the following commands:

$   conda install -c conda-forge wikipedia $   conda install -c conda-forge/label/cf201901 wikipedia 

To visualize our topic model, we will use the pyLDAvis library. To download the library, execute the following pip command:

$   pip install pyLDAvis 

Again, if you use the Anaconda distribution instead you can execute one of the following commands:

$   conda install -c conda-forge pyldavis $   conda install -c conda-forge/label/gcc7 pyldavis $   conda install -c conda-forge/label/cf201901 pyldavis 

Topic Modeling with LDA

In this section, we will perform topic modeling of the Wikipedia articles using LDA.

We will download four Wikipedia articles on the topics “Global Warming”, “Artifical Intelligence”, “Eiffel Tower”, and “Mona Lisa”. Next, we will preprocess the articles, followed by the topic modeling step. Finally, we will see how we can visualize the LDA model.

Scraping Wikipedia Articles

Execute the following script:

import wikipedia   import nltk  nltk.download('stopwords')   en_stop = set(nltk.corpus.stopwords.words('english'))  global_warming = wikipedia.page("Global Warming")   artificial_intelligence = wikipedia.page("Artificial Intelligence")   mona_lisa = wikipedia.page("Mona Lisa")   eiffel_tower = wikipedia.page("Eiffel Tower")  corpus = [global_warming.content, artificial_intelligence.content, mona_lisa.content, eiffel_tower.content]   

In the script above, we first import the wikipedia and nltk libraries. We also download the English nltk stopwords. We will use these stopwords later.

Next, we downloaded the article from Wikipedia by specifying the topic to the page object of the wikipedia library. The object returned contains information about the downloaded page.

To retrieve the contents of the webpage, we can use the content attribute. The content of all the four articles is stored in the list named corpus.

Data Preprocessing

To perform topic modeling via LDA, we need a data dictionary and the bag of words corpus. From the last article (linked above), we know that to create a dictionary and bag of words corpus we need data in the form of tokens.

Furthermore, we need to remove things like punctuations and stop words from our dataset. For the sake of uniformity, we will convert all the tokens to lower case and will also lemmatize them. Also, we will remove all the tokens having less than 5 characters.

Look at the following script:

import re   from nltk.stem import WordNetLemmatizer  stemmer = WordNetLemmatizer()  def preprocess_text(document):           # Remove all the special characters         document = re.sub(r'\W', ' ', str(document))          # remove all single characters         document = re.sub(r'\s+[a-zA-Z]\s+', ' ', document)          # Remove single characters from the start         document = re.sub(r'\^[a-zA-Z]\s+', ' ', document)          # Substituting multiple spaces with single space         document = re.sub(r'\s+', ' ', document, flags=re.I)          # Removing prefixed 'b'         document = re.sub(r'^b\s+', '', document)          # Converting to Lowercase         document = document.lower()          # Lemmatization         tokens = document.split()         tokens = [stemmer.lemmatize(word) for word in tokens]         tokens = [word for word in tokens if word not in en_stop]         tokens = [word for word in tokens if len(word)  > 5]          return tokens 

In the above script, we create a method named preprocess_text that accepts a text document as a parameter. The method uses regex operations to perform a variety of tasks. Let’s briefly review what’s happening in the function above:

document = re.sub(r'\W', ' ', str(X[sen]))   

The above line replaces all the special characters and numbers by a space. However, when you remove punctuations, single characters with no meaning appear in the text. For instance, when you replace punctuation in the text Eiffel's, the words Eiffel and s appear. Here the s has no meaning, therefore we need to replace it by space. The following script does that:

document = re.sub(r'\s+[a-zA-Z]\s+', ' ', document)   

The above script removes single characters within the text only. To remove a single character at the beginning of the text, the following code is used.

document = re.sub(r'\^[a-zA-Z]\s+', ' ', document)   

When you remove single spaces within the text, multiple empty spaces can appear. The following code replaces multiple empty spaces by a single space:

document = re.sub(r'\s+', ' ', document, flags=re.I)   

When you scrape a document online, a string b is often appended with the document, which signifies that the document is binary. To remove the prefixed b, the following script is used:

document = re.sub(r'^b\s+', '', document)   

The rest of the method is self-explanatory. The document is converted into lower case and then split into tokens. The tokens are lemmatized and the stop words are removed. Finally, all the tokens having less than five characters are ignored. The rest of the tokens are returned to the calling function.

Modeling Topics

This section is the meat of the article. Here we will see how the Gensim library’s built-in function can be used for topic modeling. But before that, we need to create a corpus of all the tokens (words) in the four Wikipedia articles that we scraped. Look at the following script:

processed_data = [];   for doc in corpus:       tokens = preprocess_text(doc)     processed_data.append(tokens) 

The script above is straight forward. We iterate through the corpus list that contains the four Wikipedia articles in the form of strings. In each iteration, we pass the document to the preprocess_text method that we created earlier. The method returns tokens for that particular document. The tokens are stored in the processed_data list.

At the end of the for loop all tokens from all four articles will be stored in the processed_data list. We can now use this list to create a dictionary and corresponding bag of words corpus. The following script does that:

from gensim import corpora  gensim_dictionary = corpora.Dictionary(processed_data)   gensim_corpus = [gensim_dictionary.doc2bow(token, allow_update=True) for token in processed_data]   

Next, we will save our dictionary as well as the bag of words corpus using pickle. We will use the saved dictionary later to make predictions on the new data.

import pickle  pickle.dump(gensim_corpus, open('gensim_corpus_corpus.pkl', 'wb'))   gensim_dictionary.save('gensim_dictionary.gensim')   

Now, we have everything needed to create LDA model in Gensim. We will use the LdaModel class from the gensim.models.ldamodel module to create the LDA model. We need to pass the bag of words corpus that we created earlier as the first parameter to the LdaModel constructor, followed by the number of topics, the dictionary that we created earlier, and the number of passes (number of iterations for the model).

Execute the following script:

import gensim  lda_model = gensim.models.ldamodel.LdaModel(gensim_corpus, num_topics=4, id2word=gensim_dictionary, passes=20)   lda_model.save('gensim_model.gensim')   

Yes, it is that simple. In the script above we created the LDA model from our dataset and saved it.

Next, let’s print 10 words for each topic. To do so, we can use the print_topics method. Execute the following script:

topics = lda_model.print_topics(num_words=10)   for topic in topics:       print(topic) 

The output looks like this:

(0, '0.036*"painting" + 0.018*"leonardo" + 0.009*"louvre" + 0.009*"portrait" + 0.006*"museum" + 0.006*"century" + 0.006*"french" + 0.005*"giocondo" + 0.005*"original" + 0.004*"picture"')  (1, '0.016*"intelligence" + 0.014*"machine" + 0.012*"artificial" + 0.011*"problem" + 0.010*"learning" + 0.009*"system" + 0.008*"network" + 0.007*"research" + 0.007*"knowledge" + 0.007*"computer"')  (2, '0.026*"eiffel" + 0.008*"second" + 0.006*"french" + 0.006*"structure" + 0.006*"exposition" + 0.005*"tallest" + 0.005*"engineer" + 0.004*"design" + 0.004*"france" + 0.004*"restaurant"')  (3, '0.031*"climate" + 0.026*"change" + 0.024*"warming" + 0.022*"global" + 0.014*"emission" + 0.013*"effect" + 0.012*"greenhouse" + 0.011*"temperature" + 0.007*"carbon" + 0.006*"increase"') 

The first topic contains words like painting, louvre, portrait, french museum, etc. We can assume that these words belong to a topic related to a picture with the French connection.

Similarly, the second contains words like intelligence, machine, research, etc. We can assume that these words belong to the topic related to Artificial Intelligence.

Similarly, the words from the third and fourth topics point to the fact that these words are part of the topic Eiffel Tower and Global Warming, respectively.

We can clearly, see that the LDA model has successfully identified the four topics in our data set.

It is important to mention here that LDA is an unsupervised learning algorithm and in real-world problems, you will not know about the topics in the dataset beforehand. You will simply be given a corpus, the topics will be created using LDA and then the names of the topics are up to you.

Let’s now create 8 topics using our dataset. We will print 5 words per topic:

lda_model = gensim.models.ldamodel.LdaModel(gensim_corpus, num_topics=8, id2word=gensim_dictionary, passes=15)   lda_model.save('gensim_model.gensim')   topics = lda_model.print_topics(num_words=5)   for topic in topics:       print(topic) 

The output looks like this:

(0, '0.000*"climate" + 0.000*"change" + 0.000*"eiffel" + 0.000*"warming" + 0.000*"global"') (1, '0.018*"intelligence" + 0.016*"machine" + 0.013*"artificial" + 0.012*"problem" + 0.010*"learning"') (2, '0.045*"painting" + 0.023*"leonardo" + 0.012*"louvre" + 0.011*"portrait" + 0.008*"museum"') (3, '0.000*"intelligence" + 0.000*"machine" + 0.000*"problem" + 0.000*"artificial" + 0.000*"system"') (4, '0.035*"climate" + 0.030*"change" + 0.027*"warming" + 0.026*"global" + 0.015*"emission"') (5, '0.031*"eiffel" + 0.009*"second" + 0.007*"french" + 0.007*"structure" + 0.007*"exposition"') (6, '0.000*"painting" + 0.000*"machine" + 0.000*"system" + 0.000*"intelligence" + 0.000*"problem"') (7, '0.000*"climate" + 0.000*"change" + 0.000*"global" + 0.000*"machine" + 0.000*"intelligence"') 

Again, the number of topics that you want to create is up to you. Keep trying different numbers until you find suitable topics. For our dataset, the suitable number of topics is 4 since we already know that our corpus contains words from four different articles. Revert back to four topics by executing the following script:

lda_model = gensim.models.ldamodel.LdaModel(gensim_corpus, num_topics=4, id2word=gensim_dictionary, passes=20)   lda_model.save('gensim_model.gensim')   topics = lda_model.print_topics(num_words=10)   for topic in topics:       print(topic) 

This time, you will see different results since the initial values for the LDA parameters are chosen randomly. The results this time are as follows:

(0, '0.031*"climate" + 0.027*"change" + 0.024*"warming" + 0.023*"global" + 0.014*"emission" + 0.013*"effect" + 0.012*"greenhouse" + 0.011*"temperature" + 0.007*"carbon" + 0.006*"increase"')  (1, '0.026*"eiffel" + 0.008*"second" + 0.006*"french" + 0.006*"structure" + 0.006*"exposition" + 0.005*"tallest" + 0.005*"engineer" + 0.004*"design" + 0.004*"france" + 0.004*"restaurant"')  (2, '0.037*"painting" + 0.019*"leonardo" + 0.009*"louvre" + 0.009*"portrait" + 0.006*"museum" + 0.006*"century" + 0.006*"french" + 0.005*"giocondo" + 0.005*"original" + 0.004*"subject"')  (3, '0.016*"intelligence" + 0.014*"machine" + 0.012*"artificial" + 0.011*"problem" + 0.010*"learning" + 0.009*"system" + 0.008*"network" + 0.007*"knowledge" + 0.007*"research" + 0.007*"computer"') 

You can see that words for the first topic are now mostly related to Global Warming, while the second topic contains words related to Eiffel tower.

Evaluating the LDA Model

As I said earlier, unsupervised learning models are hard to evaluate since there is no concrete truth against which we can test the output of our model.

Suppose we have a new text document and we want to find its topic using the LDA model we just created, we can do so using the following script:

test_doc = 'Great structures are build to remember an event happened in the history.'   test_doc = preprocess_text(test_doc)   bow_test_doc = gensim_dictionary.doc2bow(test_doc)  print(lda_model.get_document_topics(bow_test_doc))   

In the script above, we created a string, created its dictionary representation and then converted the string into the bag of words corpus. The bag of words representation is then passed to the get_document_topics method. The output looks like this:

[(0, 0.08422605), (1, 0.7446843), (2, 0.087012805), (3, 0.08407689)] 

The output shows that there is 8.4% chance that the new document belongs to topic 1 (see the words for topic 1 in the last output). Similarly, there is a 74.4% chance that this document belongs to the second topic. If we look at the second topic, it contains words related to the Eiffel Tower. Our test document also contains words related to structures and buildings. Therefore, it has been assigned the second topic.

Another way to evaluate the LDA model is via Perplexity and Coherence Score.

As a rule of thumb for a good LDA model, the perplexity score should be low while coherence should be high. The Gensim library has a CoherenceModel class which can be used to find the coherence of LDA model. For perplexity, the LdaModel object contains log_perplexity method which takes a bag of words corpus as a parameter and returns the corresponding perplexity.

print('\nPerplexity:', lda_model.log_perplexity(gensim_corpus))  from gensim.models import CoherenceModel  coherence_score_lda = CoherenceModel(model=lda_model, texts=processed_data, dictionary=gensim_dictionary, coherence='c_v')   coherence_score = coherence_score_lda.get_coherence()  print('\nCoherence Score:', coherence_score)   

The CoherenceModel class takes the LDA model, the tokenized text, the dictionary, and the dictionary as parameters. To get the coherence score, the get_coherence method is used. The output looks like this:

Perplexity: -7.492867099178969  Coherence Score: 0.718387005948207   

Visualizing the LDA

To visualize our data, we can use the pyLDAvis library that we downloaded at the beginning of the article. The library contains a module for Gensim LDA model. First we need to prepare the visualization by passing the dictionary, a bag of words corpus and the LDA model to the prepare method. Next, we need to call the display on the gensim module of the pyLDAvis library, as shown below:

gensim_dictionary = gensim.corpora.Dictionary.load('gensim_dictionary.gensim')   gensim_corpus = pickle.load(open('gensim_corpus_corpus.pkl', 'rb'))   lda_model = gensim.models.ldamodel.LdaModel.load('gensim_model.gensim')  import pyLDAvis.gensim  lda_visualization = pyLDAvis.gensim.prepare(lda_model, gensim_corpus, gensim_dictionary, sort_topics=False)   pyLDAvis.display(lda_visualization)   

In the output, you will see the following visualization:

Each circle in the above image corresponds to one topic. From the output of the LDA model using 4 topics, we know that the first topic is related to Global Warming, the second topic is related to the Eiffel Tower, the third topic is related to Mona Lisa, while the fourth topic is related to Artificial Intelligence.

The distance between circles shows how different the topics are from each other. You can see that circle 2 and 3 are overlapping. This is because of the fact that topic 2 (Eiffel Tower) and topic 3 (Mona Lisa) have many words in common such as “French”, “France”, “Museum”, “Paris”, etc.

If you hover over any word on the right, you will only see the circle for the topic that contains the word. For instance, if you hover over the word “climate”, you will see that the topic 2 and 4 disappear since they don’t contain the word climate. The size of topic 1 will increase since most of the occurrences of the word “climate” are within the first topic. A very small percentage is in topic 3, as shown in the following image:

Similarly, if you hover click any of the circles, a list of most frequent terms for that topic will appear on the right along with the frequency of occurrence in that very topic. For instance, if you hover over circle 2, which corresponds to the topic “Eiffel Tower”, you will see the following results:

From the output, you can see that the circle for the second topic i.e. “Eiffel Tower” has been selected. From the list on right, you can see the most occurring terms for the topic. The term “eiffel” is on the top. Also, it is evident that the term “eiffel” occurred mostly within this topic.

On the other hand, if you look at the term “french”, you can clearly see that around half of the occurrences for the term are within this topic. This is because topic 3, i.e. “Mona Lisa” also contains the term “French” quite a few times. To verify this, click on the circle for topic 3 and hover over the term “french”.

Topic Modeling via LSI

In the previous section, we saw how to perform topic modeling via LDA. Let’s see how we can perform topic modeling via Latent Semantic Indexing (LSI).

To do so, all you have to do is use the LsiModel class. The rest of the process remains absolutely similar to what we followed before with LDA.

Look at the following script:

from gensim.models import LsiModel  lsi_model = LsiModel(gensim_corpus, num_topics=4, id2word=gensim_dictionary)   topics = lsi_model.print_topics(num_words=10)   for topic in topics:       print(topic) 

The output looks like this:

(0, '-0.337*"intelligence" + -0.297*"machine" + -0.250*"artificial" + -0.240*"problem" + -0.208*"system" + -0.200*"learning" + -0.166*"network" + -0.161*"climate" + -0.159*"research" + -0.153*"change"')  (1, '-0.453*"climate" + -0.377*"change" + -0.344*"warming" + -0.326*"global" + -0.196*"emission" + -0.177*"greenhouse" + -0.168*"effect" + 0.162*"intelligence" + -0.158*"temperature" + 0.143*"machine"')  (2, '0.688*"painting" + 0.346*"leonardo" + 0.179*"louvre" + 0.175*"eiffel" + 0.170*"portrait" + 0.147*"french" + 0.127*"museum" + 0.117*"century" + 0.109*"original" + 0.092*"giocondo"')  (3, '-0.656*"eiffel" + 0.259*"painting" + -0.184*"second" + -0.145*"exposition" + -0.145*"structure" + 0.135*"leonardo" + -0.128*"tallest" + -0.116*"engineer" + -0.112*"french" + -0.107*"design"') 

Conclusion

Topic modeling is an important NLP task. A variety of approaches and libraries exist that can be used for topic modeling in Python. In this article, we saw how to do topic modeling via the Gensim library in Python using the LDA and LSI approaches. We also saw how to visualize the results of our LDA model.

Planet Python

Stack Abuse: Python for NLP: Working with the Gensim Library (Part 1)

This is the 10th article in my series of articles on Python for NLP. In my previous article, I explained how the StanfordCoreNLP library can be used to perform different NLP tasks.

In this article, we will explore the Gensim library, which is another extremely useful NLP library for Python. Gensim was primarily developed for topic modeling. However, it now supports a variety of other NLP tasks such as converting words to vectors (word2vec), document to vectors (doc2vec), finding text similarity, and text summarization.

In this article and the next article of the series, we will see how the Gensim library is used to perform these tasks.

Installing Gensim

If you use pip installer to install your Python libraries, you can use the following command to download the Gensim library:

$   pip install gensim 

Alternatively, if you use the Anaconda distribution of Python, you can execute the following command to install the Gensim library:

$   conda install -c anaconda gensim 

Let’s now see how we can perform different NLP tasks using the Gensim library.

Creating Dictionaries

Statistical algorithms work with numbers, however, natural languages contain data in the form of text. Therefore, a mechanism is needed to convert words to numbers. Similarly, after applying different types of processes on the numbers, we need to convert numbers back to text.

One way to achieve this type of functionality is to create a dictionary that assigns a numeric ID to every unique word in the document. The dictionary can then be used to find the numeric equivalent of a word and vice versa.

Creating Dictionaries using In-Memory Objects

It is super easy to create dictionaries that map words to IDs using Python’s Gensim library. Look at the following script:

import gensim   from gensim import corpora   from pprint import pprint  text = ["""In computer science, artificial intelligence (AI),                sometimes called machine intelligence, is intelligence              demonstrated by machines, in contrast to the natural intelligence              displayed by humans and animals. Computer science defines              AI research as the study of intelligent agents: any device that              perceives its environment and takes actions that maximize its chance              of successfully achieving its goals."""]  tokens = [[token for token in sentence.split()] for sentence in text]   gensim_dictionary = corpora.Dictionary(tokens)  print("The dictionary has: " +str(len(gensim_dictionary)) + " tokens")  for k, v in gensim_dictionary.token2id.items():       print(f'{k:{15}} {v:{10}}') 

In the script above, we first import the gensim library along with the corpora module from the library. Next, we have some text (which is the first part of the first paragraph of the Wikipedia article on Artificial Intelligence) stored in the text variable.

To create a dictionary, we need a list of words from our text (also known as tokens). In the following line, we split our document into sentences and then the sentences into words.

tokens = [[token for token in sentence.split()] for sentence in text]   

We are now ready to create our dictionary. To do so, we can use the Dictionary object of the corpora module and pass it the list of tokens.

Finally, to print the contents of the newly created dictionary, we can use the token2id object of the Dictionary class. The output of the script above looks like this:

The dictionary has: 46 tokens   (AI),                    0 AI                       1   Computer                 2   In                       3   achieving                4   actions                  5   agents:                  6   and                      7   animals.                 8   any                      9   artificial              10   as                      11   by                      12   called                  13   chance                  14   computer                15   contrast                16   defines                 17   demonstrated            18   device                  19   displayed               20   environment             21   goals.                  22   humans                  23   in                      24   intelligence            25   intelligence,           26   intelligent             27   is                      28   its                     29   machine                 30   machines,               31   maximize                32   natural                 33   of                      34   perceives               35   research                36   science                 37   science,                38   sometimes               39   study                   40   successfully            41   takes                   42   that                    43   the                     44   to                      45   

The output shows each unique word in our text along with the numeric ID that the word has been assigned. The word or token is the key of the dictionary and the ID is the value. You can also see the Id assigned to the individual word using the following script:

print(gensim_dictionary.token2id["study"])   

In the script above, we pass the word “study” as the key to our dictionary. In the output, you should see the corresponding value i.e. the ID of the word “study”, which is 40.

Similarly, you can use the following script to find the key or word for a specific ID.

print(list(gensim_dictionary.token2id.keys())[list(gensim_dictionary.token2id.values()).index(40)])   

To print the tokens and their corresponding IDs we used a for-loop. However, you can directly print the tokens and their IDs by printing the dictionary, as shown here:

print(gensim_dictionary.token2id)   

The output is as follows:

{'(AI),': 0, 'AI': 1, 'Computer': 2, 'In': 3, 'achieving': 4, 'actions': 5, 'agents:': 6, 'and': 7, 'animals.': 8, 'any': 9, 'artificial': 10, 'as': 11, 'by': 12, 'called': 13, 'chance': 14, 'computer': 15, 'contrast': 16, 'defines': 17, 'demonstrated': 18, 'device': 19, 'displayed': 20, 'environment': 21, 'goals.': 22, 'humans': 23, 'in': 24, 'intelligence': 25, 'intelligence,': 26, 'intelligent': 27, 'is': 28, 'its': 29, 'machine': 30, 'machines,': 31, 'maximize': 32, 'natural': 33, 'of': 34, 'perceives': 35, 'research': 36, 'science': 37, 'science,': 38, 'sometimes': 39, 'study': 40, 'successfully': 41, 'takes': 42, 'that': 43, 'the': 44, 'to': 45} 

The output might not be as clear as the one printed using the loop, although it still serves its purpose.

Let’s now see how we can add more tokens to an existing dictionary using a new document. Look at the following script:

text = ["""Colloquially, the term "artificial intelligence" is used to              describe machines that mimic "cognitive" functions that humans            associate with other human minds, such as "learning" and "problem solving"""]  tokens = [[token for token in sentence.split()] for sentence in text]   gensim_dictionary.add_documents(tokens)  print("The dictionary has: " + str(len(gensim_dictionary)) + " tokens")   print(gensim_dictionary.token2id)   

In the script above we have a new document that contains the second part of the first paragraph of the Wikipedia article on Artificial Intelligence. We split the text into tokens and then simply call the add_documents method to add the tokens to our existing dictionary. Finally, we print the updated dictionary on the console.

The output of the code looks like this:

The dictionary has: 65 tokens   {'(AI),': 0, 'AI': 1, 'Computer': 2, 'In': 3, 'achieving': 4, 'actions': 5, 'agents:': 6, 'and': 7, 'animals.': 8, 'any': 9, 'artificial': 10, 'as': 11, 'by': 12, 'called': 13, 'chance': 14, 'computer': 15, 'contrast': 16, 'defines': 17, 'demonstrated': 18, 'device': 19, 'displayed': 20, 'environment': 21, 'goals.': 22, 'humans': 23, 'in': 24, 'intelligence': 25, 'intelligence,': 26, 'intelligent': 27, 'is': 28, 'its': 29, 'machine': 30, 'machines,': 31, 'maximize': 32, 'natural': 33, 'of': 34, 'perceives': 35, 'research': 36, 'science': 37, 'science,': 38, 'sometimes': 39, 'study': 40, 'successfully': 41, 'takes': 42, 'that': 43, 'the': 44, 'to': 45, '"artificial': 46, '"cognitive"': 47, '"learning"': 48, '"problem': 49, 'Colloquially,': 50, 'associate': 51, 'describe': 52, 'functions': 53, 'human': 54, 'intelligence"': 55, 'machines': 56, 'mimic': 57, 'minds,': 58, 'other': 59, 'solving': 60, 'such': 61, 'term': 62, 'used': 63, 'with': 64} 

You can see that now we have 65 tokens in our dictionary, while previously we had 45 tokens.

Creating Dictionaries using Text Files

In the previous section, we had in-memory text. What if we want to create a dictionary by reading a text file from the hard drive? To do so, we can use the simple_process method from the gensim.utils library. The advantage of using this method is that it reads the text file line by line and returns the tokens in the line. You don’t have to load the complete text file in the memory in order to create a dictionary.

Before executing the next example, create a file “file1.txt” and add the following text to the file (this is the first half of the first paragraph of the Wikipedia article on Global Warming).

Global warming is a long-term rise in the average temperature of the Earth's climate system, an aspect of climate change shown by temperature measurements and by multiple effects of the warming. Though earlier geological periods also experienced episodes of warming, the term commonly refers to the observed and continuing increase in average air and ocean temperatures since 1900 caused mainly by emissions of greenhouse gasses in the modern industrial economy.   

Now let’s create a dictionary that will contain tokens from the text file “file1.txt”:

from gensim.utils import simple_preprocess   from smart_open import smart_open   import os  gensim_dictionary = corpora.Dictionary(simple_preprocess(sentence, deacc=True) for sentence in open(r'E:\text files\file1.txt', encoding='utf-8'))  print(gensim_dictionary.token2id)   

In the script above we read the text file “file1.txt” line-by-line using the simple_preprocess method. The method returns tokens in each line of the document. The tokens are then used to create the dictionary. In the output, you should see the tokens and their corresponding IDs, as shown below:

{'average': 0, 'climate': 1, 'earth': 2, 'global': 3, 'in': 4, 'is': 5, 'long': 6, 'of': 7, 'rise': 8, 'system': 9, 'temperature': 10, 'term': 11, 'the': 12, 'warming': 13, 'an': 14, 'and': 15, 'aspect': 16, 'by': 17, 'change': 18, 'effects': 19, 'measurements': 20, 'multiple': 21, 'shown': 22, 'also': 23, 'earlier': 24, 'episodes': 25, 'experienced': 26, 'geological': 27, 'periods': 28, 'though': 29, 'air': 30, 'commonly': 31, 'continuing': 32, 'increase': 33, 'observed': 34, 'ocean': 35, 'refers': 36, 'temperatures': 37, 'to': 38, 'caused': 39, 'economy': 40, 'emissions': 41, 'gasses': 42, 'greenhouse': 43, 'industrial': 44, 'mainly': 45, 'modern': 46, 'since': 47} 

Similarly, we can create a dictionary by reading multiple text files. Create another file “file2.txt” and add the following text to the file (the second part of the first paragraph of the Wikipedia article on Global Warming):

In the modern context the terms global warming and climate change are commonly used interchangeably, but climate change includes both global warming and its effects, such as changes to precipitation and impacts that differ by region.[7][8] Many of the observed warming changes since the 1950s are unprecedented in the instrumental temperature record, and in historical and paleoclimate proxy records of climate change over thousands to millions of years.   

Save the “file2.txt” in the same directory as the “file1.txt”.

The following script reads both the files and then creates a dictionary based on the text in the two files:

from gensim.utils import simple_preprocess   from smart_open import smart_open   import os  class ReturnTokens(object):       def __init__(self, dir_path):         self.dir_path = dir_path      def __iter__(self):         for file_name in os.listdir(self.dir_path):             for sentence in open(os.path.join(self.dir_path, file_name), encoding='utf-8'):                 yield simple_preprocess(sentence)  path_to_text_directory = r"E:\text files"   gensim_dictionary = corpora.Dictionary(ReturnTokens(path_to_text_directory))  print(gensim_dictionary.token2id)   

In the script above we have a method ReturnTokens, which takes the directory path that contains “file1.txt” and “file2.txt” as the only parameter. Inside the method we iterate through all the files in the directory and then read each file line by line. The simple_preprocess method creates tokens for each line. The tokens for each line are returned to the calling function using the “yield” keyword.

In the output, you should see the following tokens along with their IDs:

{'average': 0, 'climate': 1, 'earth': 2, 'global': 3, 'in': 4, 'is': 5, 'long': 6, 'of': 7, 'rise': 8, 'system': 9, 'temperature': 10, 'term': 11, 'the': 12, 'warming': 13, 'an': 14, 'and': 15, 'aspect': 16, 'by': 17, 'change': 18, 'effects': 19, 'measurements': 20, 'multiple': 21, 'shown': 22, 'also': 23, 'earlier': 24, 'episodes': 25, 'experienced': 26, 'geological': 27, 'periods': 28, 'though': 29, 'air': 30, 'commonly': 31, 'continuing': 32, 'increase': 33, 'observed': 34, 'ocean': 35, 'refers': 36, 'temperatures': 37, 'to': 38, 'caused': 39, 'economy': 40, 'emissions': 41, 'gasses': 42, 'greenhouse': 43, 'industrial': 44, 'mainly': 45, 'modern': 46, 'since': 47, 'are': 48, 'context': 49, 'interchangeably': 50, 'terms': 51, 'used': 52, 'as': 53, 'both': 54, 'but': 55, 'changes': 56, 'includes': 57, 'its': 58, 'precipitation': 59, 'such': 60, 'differ': 61, 'impacts': 62, 'instrumental': 63, 'many': 64, 'record': 65, 'region': 66, 'that': 67, 'unprecedented': 68, 'historical': 69, 'millions': 70, 'over': 71, 'paleoclimate': 72, 'proxy': 73, 'records': 74, 'thousands': 75, 'years': 76} 

Creating Bag of Words Corpus

Dictionaries contain mappings between words and their corresponding numeric values. Bag of words corpora in the Gensim library are based on dictionaries and contain the ID of each word along with the frequency of occurrence of the word.

Creating Bag of Words Corpus from In-Memory Objects

Look at the following script:

import gensim   from gensim import corpora   from pprint import pprint  text = ["""In computer science, artificial intelligence (AI),              sometimes called machine intelligence, is intelligence            demonstrated by machines, in contrast to the natural intelligence            displayed by humans and animals. Computer science defines            AI research as the study of intelligent agents: any device that            perceives its environment and takes actions that maximize its chance            of successfully achieving its goals."""]  tokens = [[token for token in sentence.split()] for sentence in text]  gensim_dictionary = corpora.Dictionary()   gensim_corpus = [gensim_dictionary.doc2bow(token, allow_update=True) for token in tokens]  print(gensim_corpus)   

In the script above, we have text which we split into tokens. Next, we initialize a Dictionary object from the corpora module. The object contains a method doc2bow, which basically performs two tasks:

  • It iterates through all the words in the text, if the word already exists in the corpus, it increments the frequency count for the word
  • Otherwise it inserts the word into the corpus and sets its frequency count to 1

The output of the above script looks like this:

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 2), (8, 1), (9, 1), (10, 1), (11, 1), (12, 2), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 3), (26, 1), (27, 1), (28, 1), (29, 3), (30, 1), (31, 1), (32, 1), (33, 1), (34, 2), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 2), (44, 2), (45, 1)]] 

The output might not make sense to you. Let me explain it. The first tuple (0,1) basically means that the word with ID 0 occurred 1 time in the text. Similarly, (25, 3) means that the word with ID 25 occurred three times in the document.

Let’s now print the word and the frequency count to make things clear. Add the following lines of code at the end of the previous script:

word_frequencies = [[(gensim_dictionary[id], frequence) for id, frequence in couple] for couple in gensim_corpus]   print(word_frequencies)   

The output looks like this:

[[('(AI),', 1), ('AI', 1), ('Computer', 1), ('In', 1), ('achieving', 1), ('actions', 1), ('agents:', 1), ('and', 2), ('animals.', 1), ('any', 1), ('artificial', 1), ('as', 1), ('by', 2), ('called', 1), ('chance', 1), ('computer', 1), ('contrast', 1), ('defines', 1), ('demonstrated', 1), ('device', 1), ('displayed', 1), ('environment', 1), ('goals.', 1), ('humans', 1), ('in', 1), ('intelligence', 3), ('intelligence,', 1), ('intelligent', 1), ('is', 1), ('its', 3), ('machine', 1), ('machines,', 1), ('maximize', 1), ('natural', 1), ('of', 2), ('perceives', 1), ('research', 1), ('science', 1), ('science,', 1), ('sometimes', 1), ('study', 1), ('successfully', 1), ('takes', 1), ('that', 2), ('the', 2), ('to', 1)]] 

From the output, you can see that the word “intelligence” appears three times. Similarly, the word “that” appears twice.

Creating Bag of Words Corpus from Text Files

Like dictionaries, we can also create a bag of words corpus by reading a text file. Look at the following code:

from gensim.utils import simple_preprocess   from smart_open import smart_open   import os  tokens = [simple_preprocess(sentence, deacc=True) for sentence in open(r'E:\text files\file1.txt', encoding='utf-8')]  gensim_dictionary = corpora.Dictionary()   gensim_corpus = [gensim_dictionary.doc2bow(token, allow_update=True) for token in tokens]   word_frequencies = [[(gensim_dictionary[id], frequence) for id, frequence in couple] for couple in gensim_corpus]  print(word_frequencies)   

In the script above, we created a bag of words corpus using “file1.txt”. In the output, you should see the words in the first paragraph for the Global Warming article on Wikipedia.

[[('average', 1), ('climate', 1), ('earth', 1), ('global', 1), ('in', 1), ('is', 1), ('long', 1), ('of', 1), ('rise', 1), ('system', 1), ('temperature', 1), ('term', 1), ('the', 2), ('warming', 1)], [('climate', 1), ('of', 2), ('temperature', 1), ('the', 1), ('warming', 1), ('an', 1), ('and', 1), ('aspect', 1), ('by', 2), ('change', 1), ('effects', 1), ('measurements', 1), ('multiple', 1), ('shown', 1)], [('of', 1), ('warming', 1), ('also', 1), ('earlier', 1), ('episodes', 1), ('experienced', 1), ('geological', 1), ('periods', 1), ('though', 1)], [('average', 1), ('in', 1), ('term', 1), ('the', 2), ('and', 2), ('air', 1), ('commonly', 1), ('continuing', 1), ('increase', 1), ('observed', 1), ('ocean', 1), ('refers', 1), ('temperatures', 1), ('to', 1)], [('in', 1), ('of', 1), ('the', 1), ('by', 1), ('caused', 1), ('economy', 1), ('emissions', 1), ('gasses', 1), ('greenhouse', 1), ('industrial', 1), ('mainly', 1), ('modern', 1), ('since', 1)]] 

The output, shows that the words like “of”, “the”, “by”, and “and” occur twice.

Similarly, you can create a bag of words corpus using multiple text files, as shown below:

from gensim.utils import simple_preprocess   from smart_open import smart_open   import os  class ReturnTokens(object):       def __init__(self, dir_path):         self.dir_path = dir_path      def __iter__(self):         for file_name in os.listdir(self.dir_path):             for sentence in open(os.path.join(self.dir_path, file_name), encoding='utf-8'):                 yield simple_preprocess(sentence)  path_to_text_directory = r"E:\text files"  gensim_dictionary = corpora.Dictionary()   gensim_corpus = [gensim_dictionary.doc2bow(token, allow_update=True) for token in ReturnTokens(path_to_text_directory)]   word_frequencies = [[(gensim_dictionary[id], frequence) for id, frequence in couple] for couple in gensim_corpus]  print(word_frequencies)   

The output of the script above looks like this:

[[('average', 1), ('climate', 1), ('earth', 1), ('global', 1), ('in', 1), ('is', 1), ('long', 1), ('of', 1), ('rise', 1), ('system', 1), ('temperature', 1), ('term', 1), ('the', 2), ('warming', 1)], [('climate', 1), ('of', 2), ('temperature', 1), ('the', 1), ('warming', 1), ('an', 1), ('and', 1), ('aspect', 1), ('by', 2), ('change', 1), ('effects', 1), ('measurements', 1), ('multiple', 1), ('shown', 1)], [('of', 1), ('warming', 1), ('also', 1), ('earlier', 1), ('episodes', 1), ('experienced', 1), ('geological', 1), ('periods', 1), ('though', 1)], [('average', 1), ('in', 1), ('term', 1), ('the', 2), ('and', 2), ('air', 1), ('commonly', 1), ('continuing', 1), ('increase', 1), ('observed', 1), ('ocean', 1), ('refers', 1), ('temperatures', 1), ('to', 1)], [('in', 1), ('of', 1), ('the', 1), ('by', 1), ('caused', 1), ('economy', 1), ('emissions', 1), ('gasses', 1), ('greenhouse', 1), ('industrial', 1), ('mainly', 1), ('modern', 1), ('since', 1)], [('climate', 1), ('global', 1), ('in', 1), ('the', 2), ('warming', 1), ('and', 1), ('change', 1), ('commonly', 1), ('modern', 1), ('are', 1), ('context', 1), ('interchangeably', 1), ('terms', 1), ('used', 1)], [('climate', 1), ('global', 1), ('warming', 1), ('and', 2), ('change', 1), ('effects', 1), ('to', 1), ('as', 1), ('both', 1), ('but', 1), ('changes', 1), ('includes', 1), ('its', 1), ('precipitation', 1), ('such', 1)], [('in', 1), ('of', 1), ('temperature', 1), ('the', 3), ('warming', 1), ('by', 1), ('observed', 1), ('since', 1), ('are', 1), ('changes', 1), ('differ', 1), ('impacts', 1), ('instrumental', 1), ('many', 1), ('record', 1), ('region', 1), ('that', 1), ('unprecedented', 1)], [('climate', 1), ('in', 1), ('of', 2), ('and', 2), ('change', 1), ('to', 1), ('historical', 1), ('millions', 1), ('over', 1), ('paleoclimate', 1), ('proxy', 1), ('records', 1), ('thousands', 1), ('years', 1)]] 

Creating TF-IDF Corpus

The bag of words approach works fine for converting text to numbers. However, it has one drawback. It assigns a score to a word based on its occurrence in a particular document. It doesn’t take into account the fact that the word might also have a high frequency of occurrences in other documents as well. TF-IDF resolves this issue.

The term frequency is calculated as:

Term frequency = (Frequency of the word in a document)/(Total words in the document)   

And the Inverse Document Frequency is calculated as:

IDF(word) = Log((Total number of documents)/(Number of documents containing the word))   

Using the Gensim library, we can easily create a TF-IDF corpus:

import gensim   from gensim import corpora   from pprint import pprint  text = ["I like to play Football",          "Football is the best game",        "Which game do you like to play ?"]  tokens = [[token for token in sentence.split()] for sentence in text]  gensim_dictionary = corpora.Dictionary()   gensim_corpus = [gensim_dictionary.doc2bow(token, allow_update=True) for token in tokens]  from gensim import models   import numpy as np  tfidf = models.TfidfModel(gensim_corpus, smartirs='ntc')  for sent in tfidf[gensim_corpus]:       print([[gensim_dictionary[id], np.around(frequency, decimals=2)] for id, frequency in sent]) 

To find the TF-IDF value, we can use the TfidfModel class from the models module of the Gensim library. We simply have to pass the bag of word corpus as a parameter to the constructor of the TfidfModel class. In the output, you will see all of the words in the three sentences, along with their TF-IDF values:

[['Football', 0.3], ['I', 0.8], ['like', 0.3], ['play', 0.3], ['to', 0.3]] [['Football', 0.2], ['best', 0.55], ['game', 0.2], ['is', 0.55], ['the', 0.55]] [['like', 0.17], ['play', 0.17], ['to', 0.17], ['game', 0.17], ['?', 0.47], ['Which', 0.47], ['do', 0.47], ['you', 0.47]] 

Downloading Built-In Gensim Models and Datasets

Gensim comes with a variety of built-in datasets and word embedding models that can be directly used.

To download a built-in model or dataset, we can use the downloader class from the gensim library. We can then call the load method on the downloader class to download the desired package. Look at the following code:

import gensim.downloader as api  w2v_embedding = api.load("glove-wiki-gigaword-100")   

With the commands above, we download the “glove-wiki-gigaword-100” word embedding model, which is basically based on Wikipedia text and is 100 dimensional. Let’s try to find the words similar to “toyota” using our word embedding model. Use the following code to do so:

w2v_embedding.most_similar('toyota')   

In the output, you should see the following results:

[('honda', 0.8739858865737915),  ('nissan', 0.8108116984367371),  ('automaker', 0.7918163537979126),  ('mazda', 0.7687169313430786),  ('bmw', 0.7616022825241089),  ('ford', 0.7547588348388672),  ('motors', 0.7539199590682983),  ('volkswagen', 0.7176680564880371),  ('prius', 0.7156582474708557),  ('chrysler', 0.7085398435592651)] 

You can see all the results are very relevant to the word “toyota”. The number in the fraction corresponds to the similarity index. Higher similarity index means that the word is more relevant.

Conclusion

The Gensim library is one of the most popular Python libraries for NLP. In this article, we briefly explored how the Gensim library can be used to perform tasks like a dictionary and corpus creation. We also saw how to download built-in Gensim modules. In our next article, we will see how to perform topic modeling via the Gensim library.

Planet Python

Stack Abuse: Introduction to Reinforcement Learning with Python

Introduction

Reinforcement Learning is definitely one of the most active and stimulating areas of research in AI.

The interest in this field grew exponentially over the last couple of years, following great (and greatly publicized) advances, such as DeepMind’s AlphaGo beating the word champion of GO, and OpenAI AI models beating professional DOTA players.

Thanks to all of these advances, Reinforcement Learning is now being applied in a variety of different fields, from healthcare to finance, from chemistry to resource management.

In this article, we will introduce the fundamental concepts and terminology of Reinforcement Learning, and we will apply them in a practical example.

What is Reinforcement Learning?

Reinforcement Learning (RL) is a branch of machine learning concerned with actors, or agents, taking actions is some kind of environment in order to maximize some type of reward that they collect along the way.

This is deliberately a very loose definition, which is why reinforcement learning techniques can be applied to a very wide range of real-world problems.

Imagine someone playing a video game. The player is the agent, and the game is the environment. The rewards the player gets (i.e. beat an enemy, complete a level), or doesn’t get (i.e. step into a trap, lose a fight) will teach him how to be a better player.

As you’ve probably noticed, reinforcement learning doesn’t really fit into the categories of supervised/unsupervised/semi-supervised learning.

In supervised learning, for example, each decision taken by the model is independent, and doesn’t affect what we see in the future.

In reinforcement learning, instead, we are interested in a long term strategy for our agent, which might include sub-optimal decisions at intermediate steps, and a trade-off between exploration (of unknown paths), and exploitation of what we already know about the environment.

Brief History of Reinforcement Learning

For several decades (since the 1950s!), reinforcement learning followed two separate threads of research, one focusing on trial and error approaches, and one based on optimal control.

Optimal control methods are aimed at designing a controller to minimize a measure of a dynamical system’s behaviour over time. To achieve this, they mainly used dynamic programming algorithms, which we will see are the foundations of modern reinforcement learning techniques.

Trial-and-error approaches, instead, have deep roots in the psychology of animal learning and neuroscience, and this is where the term reinforcement comes from: actions followed (reinforced) by good or bad outcomes have the tendency to be reselected accordingly.

Arising from the interdisciplinary study of these two fields came a field called Temporal Difference (TD) Learning.

The modern machine learning approaches to RL are mainly based on TD-Learning, which deals with rewards signals and a value function (we’ll see more in detail what these are in the following paragraphs).

Terminology

We will now take a look at the main concepts and terminology of Reinforcement Learning.

Agent

A system that is embedded in an environment, and takes actions to change the state of the environment. Examples include mobile robots, software agents, or industrial controllers.

Environment

The external system that the agent can “perceive” and act on.

Environments in RL are defined as Markov Decision Processes (MDPs). A MDP is a tuple:

$ $ (S, A, P, R, \gamma) $ $

where:

  • S is a finite set of states
  • A is a finite set of actions
  • P is a state transition probability matrix
$ $ P_{ss’}^{a} = \mathbb{P}[S_{t+1} = s’| S_t = s, A_t = a] $ $
  • R is a reward function
$ $ R_s^a = \mathbb{E}[R_{t+1}|S_t=s, A_t = a] $ $
  • γ is a discount factor, γ ∈ [0,1]

Markov decision process

A lot of real-world scenarios can be represented as Markov Decision Processes, from a simple chess board to a much more complex video game.

In a chess environment, the states are all the possible configurations of the board (there are a lot). The actions refer to moving the pieces, surrendering, etc.

The rewards are based on whether we win or lose the game, so that winning actions have higher return than losing ones.

State transition probabilities enforce the game rules. For example, an illegal action (move a rook diagonally) will have zero probability.

Reward Function

The reward function maps states to their rewards. This is the information that the agents use to learn how to navigate the environment.

A lot of research goes into designing a good reward function and overcoming the problem of sparse rewards, when the often sparse nature of rewards in the environment doesn’t allow the agent to learn properly from it.

Return Gt is defined as the discounted sum of rewards from timestep t.

$ $ G_t=\sum_{k=0}^{\infty} \gamma^k R_{t+k+1} $ $

γ is called the discount factor, and it works by reducing the amount of the rewards as we move into the future.

Discounting rewards allows us to represent uncertainty about the future, but it also helps us model human behavior better, since it has been shown that humans/animals have a preference for immediate rewards.

Value Function

The value function is probably the most important piece of information we can hold about a RL problem.

Formally, the value function is the expected return starting from state s. In practice, the value function tells us how good it is for the agent to be in a certain state. The higher the value of a state, the higher the amount of reward we can expect:

$ $ v_\pi (s) = \mathbb{E}_\pi [G_t|S_t = s] $ $

The actual name for this function is state-value function, to distinguish it from another important element in RL: the action-value function.

The action-value function gives us the value, i.e. the expected return, for using action a in a certain state s:

$ $ q_\pi (s, a) = \mathbb{E}_\pi [G_t|S_t = s, A_t = a] $ $

Policy

The policy defines the behaviour of our agent in the MDP.

Formally, policies are distributions over actions given states. A policy maps states to the probability of taking each action from that state:

$ $ \pi (a|s) = \mathbb{P}[A_t = a|S_t=s] $ $

The ultimate goal of RL is to find an optimal (or a good enough) policy for our agent. In the video game example, you can think of the policy as the strategy that the player follows, i.e, the actions the player takes when presented with certain scenarios.

Main approaches

A lot of different models and algorithms are being applied to RL problems.

Really, a lot.

However, all of them more or less fall into the same two categories: policy-based, and value-based.

Policy-Based Approach

In policy-based approaches to RL, our goal is to learn the best possible policy. Policy models will directly output the best possible move from the current state, or a distribution over the possible actions.

Value-Based Approach

In value-based approaches, we want to find the the optimal value function, which is the maximum value function over all policies.

We can then choose which actions to take (i.e. which policy to use) based on the values we get from the model.

Exploration vs Exploitation

The trade-off between exploration and exploitation has been widely studied in the RL literature.

Exploration refers to the act of visiting and collecting information about states in the environment that we have not yet visited, or about which we still don’t have much information. The ideas is that exploring our MDP might lead us to better decisions in the future.

On the other side, exploitation consists on making the best decision given current knowledge, comfortable in the bubble of the already known.

We will see in the following example how these concepts apply to a real problem.

A Multi-Armed Bandit

We will now look at a practical example of a Reinforcement Learning problem – the multi-armed bandit problem.

The multi-armed bandit is one of the most popular problems in RL:

You are faced repeatedly with a choice among k different options, or actions. After each choice you receive a numerical reward chosen from a stationary probability distribution that depends on the action you selected. Your objective is to maximize the expected total reward over some time period, for example, over 1000 action selections, or time steps.

You can think of it in analogy to a slot machine (a one-armed bandit). Each action selection is like a play of one of the slot machine’s levers, and the rewards are the payoffs for hitting the jackpot.

Bandit machine

Solving this problem means that we can come come up with an optimal policy: a strategy that allows us to select the best possible action (the one with the highest expected return) at each time step.

Action-Value Methods

A very simple solution is based on the action value function. Remember that an action value is the mean reward when that action is selected:

$ $ q(a) = E[R_t \mid A=a] $ $

We can easily estimate q using the sample average:

$ $ Q_t(a) = \frac{\text{sum of rewards when “a” taken prior to “t”}}{\text{number of times “a” taken prior to “t”}} $ $

If we collect enough observations, our estimate gets close enough to the real function. We can then act greedily at each timestep, i.e. select the action with the highest value, to collect the highest possible rewards.

Don’t be too Greedy

Remember when we talked about the trade-off between exploration and exploitation? This is one example of why we should care about it.

As a matter of fact, if we always act greedily as proposed in the previous paragraph, we never try out sub-optimal actions which might actually eventually lead to better results.

To introduce some degree of exploration in our solution, we can use an ε-greedy strategy: we select actions greedily most of the time, but every once in a while, with probability ε, we select a random action, regardless of the action values.

It turns out that this simple exploration method works very well, and it can significantly increase the rewards we get.

One final caveat – to avoid from making our solution too computationally expensive, we compute the average incrementally according to this formula:

$ $ Q_{n+1} = Q_n + \frac{1}{n}[R_n – Q_n] $ $

Python Solution Walkthrough

import numpy as np  # Number of bandits k = 3  # Our action values Q = [0 for _ in range(k)]  # This is to keep track of the number of times we take each action N = [0 for _ in range(k)]  # Epsilon value for exploration eps = 0.1  # True probability of winning for each bandit p_bandits = [0.45, 0.40, 0.80]  def pull(a):       """Pull arm of bandit with index `i` and return 1 if win,      else return 0."""     if np.random.rand() < p_bandits[a]:         return 1     else:         return 0  while True:       if np.random.rand() > eps:         # Take greedy action most of the time         a = np.argmax(Q)     else:         # Take random action with probability eps         a = np.random.randint(0, k)      # Collect reward     reward = pull(a)      # Incremental average     N[a] += 1     Q[a] += 1/N[a] * (reward - Q[a]) 

Et voilà! If we run this script for a couple of seconds, we already see that our action values are proportional to the probability of hitting the jackpots for our bandits:

0.4406301434281669,   0.39131455399060977,   0.8008844354479673   

This means that our greedy policy will correctly favour actions from which we can expect higher rewards.

Conclusion

Reinforcement Learning is a growing field, and there is a lot more to cover. In fact, we still haven’t looked at general-purpose algorithms and models (e.g. dynamic programming, Monte Carlo, Temporal Difference).

The most important thing right now is to get familiar with concepts such as value functions, policies, and MDPs. In the Resources section of this article, you’ll find some awesome resources to gain a deeper understanding of this kind of material.

Resources

Planet Python

Stack Abuse: Overview of Classification Methods in Python with Scikit-Learn

Introduction

Are you a Python programmer looking to get into machine learning? An excellent place to start your journey is by getting acquainted with Scikit-Learn.

Doing some classification with Scikit-Learn is a straightforward and simple way to start applying what you’ve learned, to make machine learning concepts concrete by implementing them with a user-friendly, well-documented, and robust library.

What is Scikit-Learn?

Scikit-Learn is a library for Python that was first developed by David Cournapeau in 2007. It contains a range of useful algorithms that can easily be implemented and tweaked for the purposes of classification and other machine learning tasks.

Scikit-Learn uses SciPy as a foundation, so this base stack of libraries must be installed before Scikit-Learn can be utilized.

Defining our Terms

Before we go any further into our exploration of Scikit-Learn, let’s take a minute to define our terms. It is important to have an understanding of the vocabulary that will be used when describing Scikit-Learn’s functions.

To begin with, a machine learning system or network takes inputs and outputs. The inputs into the machine learning framework are often referred to as “features” .

Features are essentially the same as variables in a scientific experiment, they are characteristics of the phenomenon under observation that can be quantified or measured in some fashion.

When these features are fed into a machine learning framework the network tries to discern relevant patterns between the features. These patterns are then used to generate the outputs of the framework/network.

The outputs of the framework are often called “labels”, as the output features have some label given to them by the network, some assumption about what category the output falls into.


Credit: Siyavula Education

In a machine learning context, classification is a type of supervised learning. Supervised learning means that the data fed to the network is already labeled, with the important features/attributes already separated into distinct categories beforehand.

This means that the network knows which parts of the input are important, and there is also a target or ground truth that the network can check itself against. An example of classification is sorting a bunch of different plants into different categories like ferns or angiosperms. That task could be accomplished with a Decision Tree, a type of classifier in Scikit-Learn.

In contrast, unsupervised learning is where the data fed to the network is unlabeled and the network must try to learn for itself what features are most important. As mentioned, classification is a type of supervised learning, and therefore we won’t be covering unsupervised learning methods in this article.

The process of training a model is the process of feeding data into a neural network and letting it learn the patterns of the data. The training process takes in the data and pulls out the features of the dataset. During the training process for a supervised classification task the network is passed both the features and the labels of the training data. However, during testing, the network is only fed features.

The testing process is where the patterns that the network has learned are tested. The features are given to the network, and the network must predict the labels. The data for the network is divided into training and testing sets, two different sets of inputs. You do not test the classifier on the same dataset you train it on, as the model has already learned the patterns of this set of data and it would be extreme bias.

Instead, the dataset is split up into training and testing sets, a set the classifier trains on and a set the classifier has never seen before.

Different Types of Classifiers


Credit: CreativeMagic

Scikit-Learn provides easy access to numerous different classification algorithms. Among these classifiers are:

There is a lot of literature on how these various classifiers work, and brief explanations of them can be found at Scikit-Learn’s website.

For this reason, we won’t delve too deeply into how they work here, but there will be a brief explanation of how the classifier operates.

K-Nearest Neighbors


Credit: Antti Ajanki AnAj

K-Nearest Neighbors operates by checking the distance from some test example to the known values of some training example. The group of data points/class that would give the smallest distance between the training points and the testing point is the class that is selected.

Decision Trees

A Decision Tree Classifier functions by breaking down a dataset into smaller and smaller subsets based on different criteria. Different sorting criteria will be used to divide the dataset, with the number of examples getting smaller with every division.

Once the network has divided the data down to one example, the example will be put into a class that corresponds to a key. When multiple random forest classifiers are linked together they are called Random Forest Classifiers.

Naive Bayes

A Naive Bayes Classifier determines the probability that an example belongs to some class, calculating the probability that an event will occur given that some input event has occurred.

When it does this calculation it is assumed that all the predictors of a class have the same effect on the outcome, that the predictors are independent.

Linear Discriminant Analysis

Linear Discriminant Analysis works by reducing the dimensionality of the dataset, projecting all of the data points onto a line. Then it combines these points into classes based on their distance from a chosen point or centroid.

Linear discriminant analysis, as you may be able to guess, is a linear classification algorithm and best used when the data has a linear relationship.

Support Vector Machines


Credit: Qluong2016

Support Vector Machines work by drawing a line between the different clusters of data points to group them into classes. Points on one side of the line will be one class and points on the other side belong to another class.

The classifier will try to maximize the distance between the line it draws and the points on either side of it, to increase its confidence in which points belong to which class. When the testing points are plotted, the side of the line they fall on is the class they are put in.

Logistic Regression

Logistic Regression outputs predictions about test data points on a binary scale, zero or one. If the value of something is 0.5 or above, it is classified as belonging to class 1, while below 0.5 if is classified as belonging to 0.

Each of the features also has a label of only 0 or 1. Logistic regression is a linear classifier and therefore used when there is some sort of linear relationship between the data.

Examples of Classification Tasks

Classification tasks are any tasks that have you putting examples into two or more classes. Determining if an image is a cat or dog is a classification task, as is determining what the quality of a bottle of wine is based on features like acidity and alcohol content.

Depending on the classification task at hand, you will want to use different classifiers. For instance, a logistic regression model is best suited for binary classification tasks, even though multiple variable logistic regression models exist.

As you gain more experience with classifiers you will develop a better sense for when to use which classifier. However, a common practice is to instantiate multiple classifiers and compare their performance against one another, then select the classifier which performs the best.

Implementing a Classifier

Now that we’ve discussed the various classifiers that Scikit-Learn provides access to, let’s see how to implement a classifier.

The first step in implementing a classifier is to import the classifier you need into Python. Let’s look at the import statement for logistic regression:

from sklearn.linear_model import LogisticRegression   

Here are the import statements for the other classifiers discussed in this article:

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis   from sklearn.neighbors import KNeighborsClassifier   from sklearn.naive_bayes import GaussianNB   from sklearn.tree import DecisionTreeClassifier   from sklearn.svm import SVC   

Scikit-Learn has other classifiers as well, and their respective documentation pages will show how to import them.

After this, the classifier must be instantiated. Instantiation is the process of bringing the classifier into existence within your Python program – to create an instance of the classifier/object.

This is typically done just by making a variable and calling the function associated with the classifier:

logreg_clf = LogisticRegression()   

Now the classifier needs to be trained. In order to accomplish this, the classifier must be fit with the training data.

The training features and the training labels are passed into the classifier with the fit command:

logreg_clf.fit(features, labels)   

After the classifier model has been trained on the training data, it can make predictions on the testing data.

This is easily done by calling the predict command on the classifier and providing it with the parameters it needs to make predictions about, which are the features in your testing dataset:

logreg_clf.predict(test_features)   

These steps: instantiation, fitting/training, and predicting are the basic workflow for classifiers in Scikit-Learn.

However, the handling of classifiers is only one part of doing classifying with Scikit-Learn. The other half of the classification in Scikit-Learn is handling data.

To understand how handling the classifier and handling data come together as a whole classification task, let’s take a moment to understand the machine learning pipeline.

The Machine Learning Pipeline

The machine learning pipeline has the following steps: preparing data, creating training/testing sets, instantiating the classifier, training the classifier, making predictions, evaluating performance, tweaking parameters.

The first step to training a classifier on a dataset is to prepare the dataset – to get the data into the correct form for the classifier and handle any anomalies in the data. If there are missing values in the data, outliers in the data, or any other anomalies these data points should be handled, as they can negatively impact the performance of the classifier. This step is referred to as data preprocessing.

Once the data has been preprocessed, the data must be split into training and testing sets. We have previously discussed the rationale for creating training and testing sets, and this can easily be done in Scikit-Learn with a helpful function called traintestsplit.

As previously discussed the classifier has to be instantiated and trained on the training data. After this, predictions can be made with the classifier. By comparing the predictions made by the classifier to the actual known values of the labels in your test data, you can get a measurement of how accurate the classifier is.

There are various methods comparing the hypothetical labels to the actual labels and evaluating the classifier. We’ll go over these different evaluation metrics later. For now, know that after you’ve measured the classifier’s accuracy, you will probably go back and tweak the parameters of your model until you have hit an accuracy you are satisfied with (as it is unlikely your classifier will meet your expectations on the first run).

Let’s look at an example of the machine learning pipeline, going from data handling to evaluation.

Sample Classification Implementation

# Begin by importing all necessary libraries import pandas as pd   from sklearn.metrics import classification_report   from sklearn.metrics import confusion_matrix   from sklearn.metrics import accuracy_score   from sklearn.neighbors import KNeighborsClassifier   from sklearn.svm import SVC   

Because the iris dataset is so common, Scikit-Learn actually already has it, available for loading in with the following command:

sklearn.datasets.load_iris   

However, we’ll be loading the CSV file here, so that you get a look at how to load and preprocess data. You can download the csv file here.

Just put the data file in the same directory as your Python file. The Pandas library has an easy way to load in data, read_csv():

data = pd.read_csv('iris.csv')  # It is a good idea to check and make sure the data is loaded as expected.  print(data.head(5))   

Because the dataset has been prepared so well, we don’t need to do a lot of preprocessing. One thing we may want to do though it drop the “ID” column, as it is just a representation of row the example is found on.

As this isn’t helpful we could drop it from the dataset using the drop() function:

data.drop('Id', axis=1, inplace=True)   

We now need to define the features and labels. We can do this easily with Pandas by slicing the data table and choosing certain rows/columns with iloc():

# Pandas ".iloc" expects row_indexer, column_indexer   X = data.iloc[:,:-1].values   # Now let's tell the dataframe which column we want for the target/labels.   y = data['Species']   

The slicing notation above selects every row and every column except the last column (which is our label, the species).

Alternatively, you could select certain features of the dataset you were interested in by using the bracket notation and passing in column headers:

# Alternate way of selecting columns: X = data.iloc['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm']   

Now that we have the features and labels we want, we can split the data into training and testing sets using sklearn’s handy feature train_test_split():

# Test size specifies how much of the data you want to set aside for the testing set.  # Random_state parameter is just a random seed we can use. # You can use it if you'd like to reproduce these specific results. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=27)   

You may want to print the results to be sure your data is being parsed as you expect:

print(X_train)   print(y_train)   

Now we can instantiate the models. Let’s try using two classifiers, a Support Vector Classifier and a K-Nearest Neighbors Classifier:

SVC_model = svm.SVC()   # KNN model requires you to specify n_neighbors, # the number of points the classifier will look at to determine what class a new point belongs to KNN_model = KNeighborsClassifier(n_neighbors=5)   

Now let’s fit the classifiers:

SVC_model.fit(X_train, y_train)   KNN_model.fit(X_train, y_train)   

The call has trained the model, so now we can predict and store the prediction in a variable:

SVC_prediction = SVC_model.predict(X_test)   KNN_prediction = KNN_model.predict(X_test)   

We should now evaluate how the classifier performed. There are multiple methods of evaluating a classifier’s performance, and you can read more about there different methods below.

In Scikit-Learn you just pass in the predictions against the ground truth labels which were stored in your test labels:

# Accuracy score is the simplest way to evaluate print(accuracy_score(SVC_prediction, y_test))   print(accuracy_score(KNN_prediction, y_test))   # But Confusion Matrix and Classification Report give more details about performance print(confusion_matrix(SVC_prediction, y_test))   print(classification_report(KNN_prediction, y_test))   

For reference, here’s the output we got on the metrics:

SVC accuracy: 0.9333333333333333   KNN accuracy: 0.9666666666666667   

At first glance, it seems KNN performed better. Here’s the confusion matrix for SVC:

[[ 7  0  0]  [ 0 10  1]  [ 0  1 11]] 

This can be a bit hard to interpret, but the number of correct predictions for each class run on the diagonal from top-left to bottom-right. Check below for more info on this.

Finally, here’s the output for the classification report for KNN:

precision    recall  f1-score   support  Iris-setosa       1.00      1.00      1.00         7   Iris-versicolor       0.91      0.91      0.91        11   Iris-virginica       0.92      0.92      0.92        12  micro avg       0.93      0.93      0.93        30   macro avg       0.94      0.94      0.94        30   weighted avg       0.93      0.93      0.93        30   

Evaluating the Classifier

When it comes to the evaluation of your classifier, there are several different ways you can measure its performance.

Classification Accuracy

Classification Accuracy is the simplest out of all the methods of evaluating the accuracy, and the most commonly used. Classification accuracy is simply the number of correct predictions divided by all predictions or a ratio of correct predictions to total predictions.

While it can give you a quick idea of how your classifier is performing, it is best used when the number of observations/examples in each class is roughly equivalent. Because this doesn’t happen very often, you’re probably better off using another metric.

Logarithmic Loss

Logarithmic Loss, or LogLoss, essentially evaluates how confident the classifier is about its predictions. LogLoss returns probabilities for membership of an example in a given class, summing them together to give a representation of the classifier’s general confidence.

The value for predictions runs from 1 to 0, with 1 being completely confident and 0 being no confidence. The loss, or overall lack of confidence, is returned as a negative number with 0 representing a perfect classifier, so smaller values are better.

Area Under ROC Curve (AUC)

This is a metric used only for binary classification problems. The area under the curve represents the model’s ability to properly discriminate between negative and positive examples, between one class or another.

A 1.0, all of the area falling under the curve, represents a perfect classifier. This means that an AUC of 0.5 is basically as good as randomly guessing. The ROC curve is calculated with regards to sensitivity (true positive rate/recall) and specificity (true negative rate). You can read more about these calculations at this ROC curve article.

Confusion Matrix

A confusion matrix is a table or chart, representing the accuracy of a model with regards to two or more classes. The predictions of the model will be on the X-axis while the outcomes/accuracy are located on the y-axis.

The cells are filled with the number of predictions the model makes. Correct predictions can be found on a diagonal line moving from the top left to the bottom right. You can read more about interpreting a confusion matrix here.

Classification Report

The classification report is a Scikit-Learn built in metric created especially for classification problems. Using the classification report can give you a quick intuition of how your model is performing. Recall pits the number of examples your model labeled as Class A (some given class) against the total number of examples of Class A, and this is represented in the report.

The report also returns prediction and f1-score. Precision is the percentage of examples your model labeled as Class A which actually belonged to Class A (true positives against false positives), and f1-score is an average of precision and recall.

Conclusion

To take your understanding of Scikit-Learn farther, it would be a good idea to learn more about the different classification algorithms available. Once you have an understanding of these algorithms, read more about how to evaluate classifiers.

Many of the nuances of classification with only come with time and practice, but if you follow the steps in this guide you’ll be well on your way to becoming an expert in classification tasks with Scikit-Learn.

Planet Python