## Stack Abuse: Brief Introduction to OpenGL in Python with PyOpenGL

### Introduction

In this tutorial, we’re going to learn how to use PyOpenGL library in Python. OpenGL is a graphics library which is supported by multiple platforms including Windows, Linux, and MacOS, and is available for use in multiple other languages as well; however, the scope of this post will be limited to its usage in the Python programming language.

OpenGL, as compared to other similar graphics libraries, is fairly simple. We’ll start with setting it up on our system, followed by writing a simple example demonstrating the usage of the library.

### Installation

The easiest way to install OpenGL using Python is through the pip package manager. If you have pip installed in your system, run the following command to download and install OpenGL:

$pip install PyOpenGL PyOpenGL_accelerate  I’d recommend copying the above command to help avoid typos. Once this command finishes execution, if the installation is successful, you should get the following output at the end: Successfully installed PyOpenGL-3.1.0 PyOpenGL-accelerate-3.1.0  If this doesn’t work, you can also download it manually. For that, this link, scroll down to the ‘downloading and installation’ heading, and download all the files over there. After that, navigate to the folder where you downloaded those files, and run the following command in the terminal or command prompt: $   python setup.py 

It is pertinent to mention that you require Visual C++ 14.0 build tools installed on your system in order to work with OpenGL libraries in Python.

Now that we have successfully installed OpenGL on our system, let’s get our hands dirty with it.

### Coding Exercise

The first thing we need to do to use OpenGL in our code is to import it. To do that, run the following command:

import OpenGL   

Before we proceed, there are a few other libraries that you need to import whenever you intend to use this library in your program. Below is the code for those imports:

import OpenGL.GL   import OpenGL.GLUT   import OpenGL.GLU   print("Imports successful!") # If you see this printed to the console then installation was successful   

Now that we are done with the necessary imports, let’s first create a window in which our graphics will be shown. The code for that is given below, along with its explanation in the comments:

def showScreen():       glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT) # Remove everything from screen (i.e. displays all white)  glutInit() # Initialize a glut instance which will allow us to customize our window   glutInitDisplayMode(GLUT_RGBA) # Set the display mode to be colored   glutInitWindowSize(500, 500)   # Set the width and height of your window   glutInitWindowPosition(0, 0)   # Set the position at which this windows should appear   wind = glutCreateWindow("OpenGL Coding Practice") # Give your window a title   glutDisplayFunc(showScreen)  # Tell OpenGL to call the showScreen method continuously   glutIdleFunc(showScreen)     # Draw any graphics or shapes in the showScreen function at all times   glutMainLoop()  # Keeps the window created above displaying/running in a loop   

Copy the imports above, as well as this code in a single python (.py) file, and execute it. You should see a white square dimension screen pop up. Now, if we wish to draw any shapes or make any other kind of graphics, we need to do that in our “showScreen” function.

Let’s now try to make a square using OpenGL, but before we do we need to understand the coordinate system that OpenGL follows.

The (0,0) point is the bottom left of your window, if you go up from there, you’re moving along the y-axis, and if you go right from there, you’re moving along the x-axis. So, the top left point of your window would be (0, 500), top right would be (500, 500), bottom right would be (500, 0).

Note: We’re talking about the window we created above, which had a dimension of 500 x 500 in our example, and not your computer’s full screen.

Now that we’ve got that out of the way, lets code a square. The explanation to the code can be found in the comments.

from OpenGL.GL import *   from OpenGL.GLUT import *   from OpenGL.GLU import *  w, h = 500,500  # ---Section 1--- def square():       # We have to declare the points in this sequence: bottom left, bottom right, top right, top left     glBegin(GL_QUADS) # Begin the sketch     glVertex2f(100, 100) # Coordinates for the bottom left point     glVertex2f(200, 100) # Coordinates for the bottom right point     glVertex2f(200, 200) # Coordinates for the top right point     glVertex2f(100, 200) # Coordinates for the top left point     glEnd() # Mark the end of drawing  # This alone isn't enough to draw our square  # ---Section 2---  def showScreen():       glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT) # Remove everything from screen (i.e. displays all white)     glLoadIdentity() # Reset all graphic/shape's position     square() # Draw a square using our function     glutSwapBuffers()  #---Section 3---  glutInit()   glutInitDisplayMode(GLUT_RGBA) # Set the display mode to be colored   glutInitWindowSize(500, 500)   # Set the w and h of your window   glutInitWindowPosition(0, 0)   # Set the position at which this windows should appear   wind = glutCreateWindow("OpenGL Coding Practice") # Set a window title   glutDisplayFunc(showScreen)   glutIdleFunc(showScreen) # Keeps the window open   glutMainLoop()  # Keeps the above created window displaying/running in a loop   

Running the code above would draw a square, but that square would not be visible since it’s color would be the same as the color of our window, so we need to assign it a different color as well, for that we will make some changes in “Section 2” of the code above i.e. the showScreen function. Add the following line below the glLoadIdentity statement and above the square() statement:

glColor3f(1.0, 0.0, 3.0) # Set the color to pink   

However, our code is still not complete. What it currently does is draw the square once, and then clear the screen again. We don’t want that. Actually, we won’t even be able to spot the moment when it actually draws the square because it would appear and disappear in a split second. Lets write another function to avoid this.

# Add this function before Section 2 of the code above i.e. the showScreen function def iterate():       glViewport(0, 0, 500,500)     glMatrixMode(GL_PROJECTION)     glLoadIdentity()     glOrtho(0.0, 500, 0.0, 500, 0.0, 1.0)     glMatrixMode (GL_MODELVIEW)     glLoadIdentity() 

Call this iterate function in “Section 2” of the code above. Add it below glLoadIdentity and above the glColor3d statement in the showScreen function.

Let’s now compile all this into a single code file so that there are no ambiguities:

from OpenGL.GL import *   from OpenGL.GLUT import *   from OpenGL.GLU import *  w,h= 500,500   def square():       glBegin(GL_QUADS)     glVertex2f(100, 100)     glVertex2f(200, 100)     glVertex2f(200, 200)     glVertex2f(100, 200)     glEnd()  def iterate():       glViewport(0, 0, 500, 500)     glMatrixMode(GL_PROJECTION)     glLoadIdentity()     glOrtho(0.0, 500, 0.0, 500, 0.0, 1.0)     glMatrixMode (GL_MODELVIEW)     glLoadIdentity()  def showScreen():       glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT)     glLoadIdentity()     iterate()     glColor3f(1.0, 0.0, 3.0)     square()     glutSwapBuffers()  glutInit()   glutInitDisplayMode(GLUT_RGBA)   glutInitWindowSize(500, 500)   glutInitWindowPosition(0, 0)   wind = glutCreateWindow("OpenGL Coding Practice")   glutDisplayFunc(showScreen)   glutIdleFunc(showScreen)   glutMainLoop()   

When you run this, a window should appear with a pink colored square box in it.

Output:

### Conclusion

In this tutorial, we learned about OpenGL, how to download and install it, followed by using it an a short example program. In this example we also practiced making a basic shape using OpenGL, which gave us an insight into some complex function calls that need to be made whenever we need to draw something using this library. To conclude, OpenGL is very resourceful and gets more and more complex as we dive deeper into it.

Planet Python

## Stack Abuse: Python for NLP: Working with the Gensim Library (Part 2)

This is my 11th article in the series of articles on Python for NLP and 2nd article on the Gensim library in this series. In a previous article, I provided a brief introduction to Python’s Gensim library. I explained how we can create dictionaries that map words to their corresponding numeric Ids. We further discussed how to create a bag of words corpus from dictionaries. In this article, we will study how we can perform topic modeling using the Gensim library.

I have explained how to do topic modeling using Python’s Scikit-Learn library, in my previous article. In that article, I explained how Latent Dirichlet Allocation (LDA) and Non-Negative Matrix factorization (NMF) can be used for topic modeling.

In this article, we will use the Gensim library for topic modeling. The approaches employed for topic modeling will be LDA and LSI (Latent Semantim Indexing).

### Installing Required Libraries

We will perform topic modeling on the text obtained from Wikipedia articles. To scrape Wikipedia articles, we will use the Wikipedia API. To download the Wikipedia API library, execute the following command:

$pip install wikipedia  Otherwise, if you use Anaconda distribution of Python, you can use one of the following commands: $   conda install -c conda-forge wikipedia $conda install -c conda-forge/label/cf201901 wikipedia  To visualize our topic model, we will use the pyLDAvis library. To download the library, execute the following pip command: $   pip install pyLDAvis 

Again, if you use the Anaconda distribution instead you can execute one of the following commands:

$conda install -c conda-forge pyldavis$   conda install -c conda-forge/label/gcc7 pyldavis $conda install -c conda-forge/label/cf201901 pyldavis  ### Topic Modeling with LDA In this section, we will perform topic modeling of the Wikipedia articles using LDA. We will download four Wikipedia articles on the topics “Global Warming”, “Artifical Intelligence”, “Eiffel Tower”, and “Mona Lisa”. Next, we will preprocess the articles, followed by the topic modeling step. Finally, we will see how we can visualize the LDA model. #### Scraping Wikipedia Articles Execute the following script: import wikipedia import nltk nltk.download('stopwords') en_stop = set(nltk.corpus.stopwords.words('english')) global_warming = wikipedia.page("Global Warming") artificial_intelligence = wikipedia.page("Artificial Intelligence") mona_lisa = wikipedia.page("Mona Lisa") eiffel_tower = wikipedia.page("Eiffel Tower") corpus = [global_warming.content, artificial_intelligence.content, mona_lisa.content, eiffel_tower.content]  In the script above, we first import the wikipedia and nltk libraries. We also download the English nltk stopwords. We will use these stopwords later. Next, we downloaded the article from Wikipedia by specifying the topic to the page object of the wikipedia library. The object returned contains information about the downloaded page. To retrieve the contents of the webpage, we can use the content attribute. The content of all the four articles is stored in the list named corpus. #### Data Preprocessing To perform topic modeling via LDA, we need a data dictionary and the bag of words corpus. From the last article (linked above), we know that to create a dictionary and bag of words corpus we need data in the form of tokens. Furthermore, we need to remove things like punctuations and stop words from our dataset. For the sake of uniformity, we will convert all the tokens to lower case and will also lemmatize them. Also, we will remove all the tokens having less than 5 characters. Look at the following script: import re from nltk.stem import WordNetLemmatizer stemmer = WordNetLemmatizer() def preprocess_text(document): # Remove all the special characters document = re.sub(r'\W', ' ', str(document)) # remove all single characters document = re.sub(r'\s+[a-zA-Z]\s+', ' ', document) # Remove single characters from the start document = re.sub(r'\^[a-zA-Z]\s+', ' ', document) # Substituting multiple spaces with single space document = re.sub(r'\s+', ' ', document, flags=re.I) # Removing prefixed 'b' document = re.sub(r'^b\s+', '', document) # Converting to Lowercase document = document.lower() # Lemmatization tokens = document.split() tokens = [stemmer.lemmatize(word) for word in tokens] tokens = [word for word in tokens if word not in en_stop] tokens = [word for word in tokens if len(word) > 5] return tokens  In the above script, we create a method named preprocess_text that accepts a text document as a parameter. The method uses regex operations to perform a variety of tasks. Let’s briefly review what’s happening in the function above: document = re.sub(r'\W', ' ', str(X[sen]))  The above line replaces all the special characters and numbers by a space. However, when you remove punctuations, single characters with no meaning appear in the text. For instance, when you replace punctuation in the text Eiffel's, the words Eiffel and s appear. Here the s has no meaning, therefore we need to replace it by space. The following script does that: document = re.sub(r'\s+[a-zA-Z]\s+', ' ', document)  The above script removes single characters within the text only. To remove a single character at the beginning of the text, the following code is used. document = re.sub(r'\^[a-zA-Z]\s+', ' ', document)  When you remove single spaces within the text, multiple empty spaces can appear. The following code replaces multiple empty spaces by a single space: document = re.sub(r'\s+', ' ', document, flags=re.I)  When you scrape a document online, a string b is often appended with the document, which signifies that the document is binary. To remove the prefixed b, the following script is used: document = re.sub(r'^b\s+', '', document)  The rest of the method is self-explanatory. The document is converted into lower case and then split into tokens. The tokens are lemmatized and the stop words are removed. Finally, all the tokens having less than five characters are ignored. The rest of the tokens are returned to the calling function. #### Modeling Topics This section is the meat of the article. Here we will see how the Gensim library’s built-in function can be used for topic modeling. But before that, we need to create a corpus of all the tokens (words) in the four Wikipedia articles that we scraped. Look at the following script: processed_data = []; for doc in corpus: tokens = preprocess_text(doc) processed_data.append(tokens)  The script above is straight forward. We iterate through the corpus list that contains the four Wikipedia articles in the form of strings. In each iteration, we pass the document to the preprocess_text method that we created earlier. The method returns tokens for that particular document. The tokens are stored in the processed_data list. At the end of the for loop all tokens from all four articles will be stored in the processed_data list. We can now use this list to create a dictionary and corresponding bag of words corpus. The following script does that: from gensim import corpora gensim_dictionary = corpora.Dictionary(processed_data) gensim_corpus = [gensim_dictionary.doc2bow(token, allow_update=True) for token in processed_data]  Next, we will save our dictionary as well as the bag of words corpus using pickle. We will use the saved dictionary later to make predictions on the new data. import pickle pickle.dump(gensim_corpus, open('gensim_corpus_corpus.pkl', 'wb')) gensim_dictionary.save('gensim_dictionary.gensim')  Now, we have everything needed to create LDA model in Gensim. We will use the LdaModel class from the gensim.models.ldamodel module to create the LDA model. We need to pass the bag of words corpus that we created earlier as the first parameter to the LdaModel constructor, followed by the number of topics, the dictionary that we created earlier, and the number of passes (number of iterations for the model). Execute the following script: import gensim lda_model = gensim.models.ldamodel.LdaModel(gensim_corpus, num_topics=4, id2word=gensim_dictionary, passes=20) lda_model.save('gensim_model.gensim')  Yes, it is that simple. In the script above we created the LDA model from our dataset and saved it. Next, let’s print 10 words for each topic. To do so, we can use the print_topics method. Execute the following script: topics = lda_model.print_topics(num_words=10) for topic in topics: print(topic)  The output looks like this: (0, '0.036*"painting" + 0.018*"leonardo" + 0.009*"louvre" + 0.009*"portrait" + 0.006*"museum" + 0.006*"century" + 0.006*"french" + 0.005*"giocondo" + 0.005*"original" + 0.004*"picture"') (1, '0.016*"intelligence" + 0.014*"machine" + 0.012*"artificial" + 0.011*"problem" + 0.010*"learning" + 0.009*"system" + 0.008*"network" + 0.007*"research" + 0.007*"knowledge" + 0.007*"computer"') (2, '0.026*"eiffel" + 0.008*"second" + 0.006*"french" + 0.006*"structure" + 0.006*"exposition" + 0.005*"tallest" + 0.005*"engineer" + 0.004*"design" + 0.004*"france" + 0.004*"restaurant"') (3, '0.031*"climate" + 0.026*"change" + 0.024*"warming" + 0.022*"global" + 0.014*"emission" + 0.013*"effect" + 0.012*"greenhouse" + 0.011*"temperature" + 0.007*"carbon" + 0.006*"increase"')  The first topic contains words like painting, louvre, portrait, french museum, etc. We can assume that these words belong to a topic related to a picture with the French connection. Similarly, the second contains words like intelligence, machine, research, etc. We can assume that these words belong to the topic related to Artificial Intelligence. Similarly, the words from the third and fourth topics point to the fact that these words are part of the topic Eiffel Tower and Global Warming, respectively. We can clearly, see that the LDA model has successfully identified the four topics in our data set. It is important to mention here that LDA is an unsupervised learning algorithm and in real-world problems, you will not know about the topics in the dataset beforehand. You will simply be given a corpus, the topics will be created using LDA and then the names of the topics are up to you. Let’s now create 8 topics using our dataset. We will print 5 words per topic: lda_model = gensim.models.ldamodel.LdaModel(gensim_corpus, num_topics=8, id2word=gensim_dictionary, passes=15) lda_model.save('gensim_model.gensim') topics = lda_model.print_topics(num_words=5) for topic in topics: print(topic)  The output looks like this: (0, '0.000*"climate" + 0.000*"change" + 0.000*"eiffel" + 0.000*"warming" + 0.000*"global"') (1, '0.018*"intelligence" + 0.016*"machine" + 0.013*"artificial" + 0.012*"problem" + 0.010*"learning"') (2, '0.045*"painting" + 0.023*"leonardo" + 0.012*"louvre" + 0.011*"portrait" + 0.008*"museum"') (3, '0.000*"intelligence" + 0.000*"machine" + 0.000*"problem" + 0.000*"artificial" + 0.000*"system"') (4, '0.035*"climate" + 0.030*"change" + 0.027*"warming" + 0.026*"global" + 0.015*"emission"') (5, '0.031*"eiffel" + 0.009*"second" + 0.007*"french" + 0.007*"structure" + 0.007*"exposition"') (6, '0.000*"painting" + 0.000*"machine" + 0.000*"system" + 0.000*"intelligence" + 0.000*"problem"') (7, '0.000*"climate" + 0.000*"change" + 0.000*"global" + 0.000*"machine" + 0.000*"intelligence"')  Again, the number of topics that you want to create is up to you. Keep trying different numbers until you find suitable topics. For our dataset, the suitable number of topics is 4 since we already know that our corpus contains words from four different articles. Revert back to four topics by executing the following script: lda_model = gensim.models.ldamodel.LdaModel(gensim_corpus, num_topics=4, id2word=gensim_dictionary, passes=20) lda_model.save('gensim_model.gensim') topics = lda_model.print_topics(num_words=10) for topic in topics: print(topic)  This time, you will see different results since the initial values for the LDA parameters are chosen randomly. The results this time are as follows: (0, '0.031*"climate" + 0.027*"change" + 0.024*"warming" + 0.023*"global" + 0.014*"emission" + 0.013*"effect" + 0.012*"greenhouse" + 0.011*"temperature" + 0.007*"carbon" + 0.006*"increase"') (1, '0.026*"eiffel" + 0.008*"second" + 0.006*"french" + 0.006*"structure" + 0.006*"exposition" + 0.005*"tallest" + 0.005*"engineer" + 0.004*"design" + 0.004*"france" + 0.004*"restaurant"') (2, '0.037*"painting" + 0.019*"leonardo" + 0.009*"louvre" + 0.009*"portrait" + 0.006*"museum" + 0.006*"century" + 0.006*"french" + 0.005*"giocondo" + 0.005*"original" + 0.004*"subject"') (3, '0.016*"intelligence" + 0.014*"machine" + 0.012*"artificial" + 0.011*"problem" + 0.010*"learning" + 0.009*"system" + 0.008*"network" + 0.007*"knowledge" + 0.007*"research" + 0.007*"computer"')  You can see that words for the first topic are now mostly related to Global Warming, while the second topic contains words related to Eiffel tower. #### Evaluating the LDA Model As I said earlier, unsupervised learning models are hard to evaluate since there is no concrete truth against which we can test the output of our model. Suppose we have a new text document and we want to find its topic using the LDA model we just created, we can do so using the following script: test_doc = 'Great structures are build to remember an event happened in the history.' test_doc = preprocess_text(test_doc) bow_test_doc = gensim_dictionary.doc2bow(test_doc) print(lda_model.get_document_topics(bow_test_doc))  In the script above, we created a string, created its dictionary representation and then converted the string into the bag of words corpus. The bag of words representation is then passed to the get_document_topics method. The output looks like this: [(0, 0.08422605), (1, 0.7446843), (2, 0.087012805), (3, 0.08407689)]  The output shows that there is 8.4% chance that the new document belongs to topic 1 (see the words for topic 1 in the last output). Similarly, there is a 74.4% chance that this document belongs to the second topic. If we look at the second topic, it contains words related to the Eiffel Tower. Our test document also contains words related to structures and buildings. Therefore, it has been assigned the second topic. Another way to evaluate the LDA model is via Perplexity and Coherence Score. As a rule of thumb for a good LDA model, the perplexity score should be low while coherence should be high. The Gensim library has a CoherenceModel class which can be used to find the coherence of LDA model. For perplexity, the LdaModel object contains log_perplexity method which takes a bag of words corpus as a parameter and returns the corresponding perplexity. print('\nPerplexity:', lda_model.log_perplexity(gensim_corpus)) from gensim.models import CoherenceModel coherence_score_lda = CoherenceModel(model=lda_model, texts=processed_data, dictionary=gensim_dictionary, coherence='c_v') coherence_score = coherence_score_lda.get_coherence() print('\nCoherence Score:', coherence_score)  The CoherenceModel class takes the LDA model, the tokenized text, the dictionary, and the dictionary as parameters. To get the coherence score, the get_coherence method is used. The output looks like this: Perplexity: -7.492867099178969 Coherence Score: 0.718387005948207  #### Visualizing the LDA To visualize our data, we can use the pyLDAvis library that we downloaded at the beginning of the article. The library contains a module for Gensim LDA model. First we need to prepare the visualization by passing the dictionary, a bag of words corpus and the LDA model to the prepare method. Next, we need to call the display on the gensim module of the pyLDAvis library, as shown below: gensim_dictionary = gensim.corpora.Dictionary.load('gensim_dictionary.gensim') gensim_corpus = pickle.load(open('gensim_corpus_corpus.pkl', 'rb')) lda_model = gensim.models.ldamodel.LdaModel.load('gensim_model.gensim') import pyLDAvis.gensim lda_visualization = pyLDAvis.gensim.prepare(lda_model, gensim_corpus, gensim_dictionary, sort_topics=False) pyLDAvis.display(lda_visualization)  In the output, you will see the following visualization: Each circle in the above image corresponds to one topic. From the output of the LDA model using 4 topics, we know that the first topic is related to Global Warming, the second topic is related to the Eiffel Tower, the third topic is related to Mona Lisa, while the fourth topic is related to Artificial Intelligence. The distance between circles shows how different the topics are from each other. You can see that circle 2 and 3 are overlapping. This is because of the fact that topic 2 (Eiffel Tower) and topic 3 (Mona Lisa) have many words in common such as “French”, “France”, “Museum”, “Paris”, etc. If you hover over any word on the right, you will only see the circle for the topic that contains the word. For instance, if you hover over the word “climate”, you will see that the topic 2 and 4 disappear since they don’t contain the word climate. The size of topic 1 will increase since most of the occurrences of the word “climate” are within the first topic. A very small percentage is in topic 3, as shown in the following image: Similarly, if you hover click any of the circles, a list of most frequent terms for that topic will appear on the right along with the frequency of occurrence in that very topic. For instance, if you hover over circle 2, which corresponds to the topic “Eiffel Tower”, you will see the following results: From the output, you can see that the circle for the second topic i.e. “Eiffel Tower” has been selected. From the list on right, you can see the most occurring terms for the topic. The term “eiffel” is on the top. Also, it is evident that the term “eiffel” occurred mostly within this topic. On the other hand, if you look at the term “french”, you can clearly see that around half of the occurrences for the term are within this topic. This is because topic 3, i.e. “Mona Lisa” also contains the term “French” quite a few times. To verify this, click on the circle for topic 3 and hover over the term “french”. ### Topic Modeling via LSI In the previous section, we saw how to perform topic modeling via LDA. Let’s see how we can perform topic modeling via Latent Semantic Indexing (LSI). To do so, all you have to do is use the LsiModel class. The rest of the process remains absolutely similar to what we followed before with LDA. Look at the following script: from gensim.models import LsiModel lsi_model = LsiModel(gensim_corpus, num_topics=4, id2word=gensim_dictionary) topics = lsi_model.print_topics(num_words=10) for topic in topics: print(topic)  The output looks like this: (0, '-0.337*"intelligence" + -0.297*"machine" + -0.250*"artificial" + -0.240*"problem" + -0.208*"system" + -0.200*"learning" + -0.166*"network" + -0.161*"climate" + -0.159*"research" + -0.153*"change"') (1, '-0.453*"climate" + -0.377*"change" + -0.344*"warming" + -0.326*"global" + -0.196*"emission" + -0.177*"greenhouse" + -0.168*"effect" + 0.162*"intelligence" + -0.158*"temperature" + 0.143*"machine"') (2, '0.688*"painting" + 0.346*"leonardo" + 0.179*"louvre" + 0.175*"eiffel" + 0.170*"portrait" + 0.147*"french" + 0.127*"museum" + 0.117*"century" + 0.109*"original" + 0.092*"giocondo"') (3, '-0.656*"eiffel" + 0.259*"painting" + -0.184*"second" + -0.145*"exposition" + -0.145*"structure" + 0.135*"leonardo" + -0.128*"tallest" + -0.116*"engineer" + -0.112*"french" + -0.107*"design"')  ### Conclusion Topic modeling is an important NLP task. A variety of approaches and libraries exist that can be used for topic modeling in Python. In this article, we saw how to do topic modeling via the Gensim library in Python using the LDA and LSI approaches. We also saw how to visualize the results of our LDA model. Planet Python ## Stack Abuse: Python for NLP: Working with the Gensim Library (Part 1) This is the 10th article in my series of articles on Python for NLP. In my previous article, I explained how the StanfordCoreNLP library can be used to perform different NLP tasks. In this article, we will explore the Gensim library, which is another extremely useful NLP library for Python. Gensim was primarily developed for topic modeling. However, it now supports a variety of other NLP tasks such as converting words to vectors (word2vec), document to vectors (doc2vec), finding text similarity, and text summarization. In this article and the next article of the series, we will see how the Gensim library is used to perform these tasks. ### Installing Gensim If you use pip installer to install your Python libraries, you can use the following command to download the Gensim library: $   pip install gensim 

Alternatively, if you use the Anaconda distribution of Python, you can execute the following command to install the Gensim library:

#### Python Solution Walkthrough

import numpy as np  # Number of bandits k = 3  # Our action values Q = [0 for _ in range(k)]  # This is to keep track of the number of times we take each action N = [0 for _ in range(k)]  # Epsilon value for exploration eps = 0.1  # True probability of winning for each bandit p_bandits = [0.45, 0.40, 0.80]  def pull(a):       """Pull arm of bandit with index i and return 1 if win,      else return 0."""     if np.random.rand() < p_bandits[a]:         return 1     else:         return 0  while True:       if np.random.rand() > eps:         # Take greedy action most of the time         a = np.argmax(Q)     else:         # Take random action with probability eps         a = np.random.randint(0, k)      # Collect reward     reward = pull(a)      # Incremental average     N[a] += 1     Q[a] += 1/N[a] * (reward - Q[a]) 

Et voilà! If we run this script for a couple of seconds, we already see that our action values are proportional to the probability of hitting the jackpots for our bandits:

0.4406301434281669,   0.39131455399060977,   0.8008844354479673   

This means that our greedy policy will correctly favour actions from which we can expect higher rewards.

### Conclusion

Reinforcement Learning is a growing field, and there is a lot more to cover. In fact, we still haven’t looked at general-purpose algorithms and models (e.g. dynamic programming, Monte Carlo, Temporal Difference).

The most important thing right now is to get familiar with concepts such as value functions, policies, and MDPs. In the Resources section of this article, you’ll find some awesome resources to gain a deeper understanding of this kind of material.

Planet Python

## Stack Abuse: Overview of Classification Methods in Python with Scikit-Learn

### Introduction

Are you a Python programmer looking to get into machine learning? An excellent place to start your journey is by getting acquainted with Scikit-Learn.

Doing some classification with Scikit-Learn is a straightforward and simple way to start applying what you’ve learned, to make machine learning concepts concrete by implementing them with a user-friendly, well-documented, and robust library.

### What is Scikit-Learn?

Scikit-Learn is a library for Python that was first developed by David Cournapeau in 2007. It contains a range of useful algorithms that can easily be implemented and tweaked for the purposes of classification and other machine learning tasks.

Scikit-Learn uses SciPy as a foundation, so this base stack of libraries must be installed before Scikit-Learn can be utilized.

### Defining our Terms

Before we go any further into our exploration of Scikit-Learn, let’s take a minute to define our terms. It is important to have an understanding of the vocabulary that will be used when describing Scikit-Learn’s functions.

To begin with, a machine learning system or network takes inputs and outputs. The inputs into the machine learning framework are often referred to as “features” .

Features are essentially the same as variables in a scientific experiment, they are characteristics of the phenomenon under observation that can be quantified or measured in some fashion.

When these features are fed into a machine learning framework the network tries to discern relevant patterns between the features. These patterns are then used to generate the outputs of the framework/network.

The outputs of the framework are often called “labels”, as the output features have some label given to them by the network, some assumption about what category the output falls into.

Credit: Siyavula Education

In a machine learning context, classification is a type of supervised learning. Supervised learning means that the data fed to the network is already labeled, with the important features/attributes already separated into distinct categories beforehand.

This means that the network knows which parts of the input are important, and there is also a target or ground truth that the network can check itself against. An example of classification is sorting a bunch of different plants into different categories like ferns or angiosperms. That task could be accomplished with a Decision Tree, a type of classifier in Scikit-Learn.

In contrast, unsupervised learning is where the data fed to the network is unlabeled and the network must try to learn for itself what features are most important. As mentioned, classification is a type of supervised learning, and therefore we won’t be covering unsupervised learning methods in this article.

The process of training a model is the process of feeding data into a neural network and letting it learn the patterns of the data. The training process takes in the data and pulls out the features of the dataset. During the training process for a supervised classification task the network is passed both the features and the labels of the training data. However, during testing, the network is only fed features.

The testing process is where the patterns that the network has learned are tested. The features are given to the network, and the network must predict the labels. The data for the network is divided into training and testing sets, two different sets of inputs. You do not test the classifier on the same dataset you train it on, as the model has already learned the patterns of this set of data and it would be extreme bias.

Instead, the dataset is split up into training and testing sets, a set the classifier trains on and a set the classifier has never seen before.

### Different Types of Classifiers

Credit: CreativeMagic

Scikit-Learn provides easy access to numerous different classification algorithms. Among these classifiers are:

There is a lot of literature on how these various classifiers work, and brief explanations of them can be found at Scikit-Learn’s website.

For this reason, we won’t delve too deeply into how they work here, but there will be a brief explanation of how the classifier operates.

#### K-Nearest Neighbors

Credit: Antti Ajanki AnAj

K-Nearest Neighbors operates by checking the distance from some test example to the known values of some training example. The group of data points/class that would give the smallest distance between the training points and the testing point is the class that is selected.

#### Decision Trees

A Decision Tree Classifier functions by breaking down a dataset into smaller and smaller subsets based on different criteria. Different sorting criteria will be used to divide the dataset, with the number of examples getting smaller with every division.

Once the network has divided the data down to one example, the example will be put into a class that corresponds to a key. When multiple random forest classifiers are linked together they are called Random Forest Classifiers.

#### Naive Bayes

A Naive Bayes Classifier determines the probability that an example belongs to some class, calculating the probability that an event will occur given that some input event has occurred.

When it does this calculation it is assumed that all the predictors of a class have the same effect on the outcome, that the predictors are independent.

#### Linear Discriminant Analysis

Linear Discriminant Analysis works by reducing the dimensionality of the dataset, projecting all of the data points onto a line. Then it combines these points into classes based on their distance from a chosen point or centroid.

Linear discriminant analysis, as you may be able to guess, is a linear classification algorithm and best used when the data has a linear relationship.

#### Support Vector Machines

Credit: Qluong2016

Support Vector Machines work by drawing a line between the different clusters of data points to group them into classes. Points on one side of the line will be one class and points on the other side belong to another class.

The classifier will try to maximize the distance between the line it draws and the points on either side of it, to increase its confidence in which points belong to which class. When the testing points are plotted, the side of the line they fall on is the class they are put in.

#### Logistic Regression

Logistic Regression outputs predictions about test data points on a binary scale, zero or one. If the value of something is 0.5 or above, it is classified as belonging to class 1, while below 0.5 if is classified as belonging to 0.

Each of the features also has a label of only 0 or 1. Logistic regression is a linear classifier and therefore used when there is some sort of linear relationship between the data.

Classification tasks are any tasks that have you putting examples into two or more classes. Determining if an image is a cat or dog is a classification task, as is determining what the quality of a bottle of wine is based on features like acidity and alcohol content.

Depending on the classification task at hand, you will want to use different classifiers. For instance, a logistic regression model is best suited for binary classification tasks, even though multiple variable logistic regression models exist.

As you gain more experience with classifiers you will develop a better sense for when to use which classifier. However, a common practice is to instantiate multiple classifiers and compare their performance against one another, then select the classifier which performs the best.

### Implementing a Classifier

Now that we’ve discussed the various classifiers that Scikit-Learn provides access to, let’s see how to implement a classifier.

The first step in implementing a classifier is to import the classifier you need into Python. Let’s look at the import statement for logistic regression:

from sklearn.linear_model import LogisticRegression   

Here are the import statements for the other classifiers discussed in this article:

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis   from sklearn.neighbors import KNeighborsClassifier   from sklearn.naive_bayes import GaussianNB   from sklearn.tree import DecisionTreeClassifier   from sklearn.svm import SVC   

Scikit-Learn has other classifiers as well, and their respective documentation pages will show how to import them.

After this, the classifier must be instantiated. Instantiation is the process of bringing the classifier into existence within your Python program – to create an instance of the classifier/object.

This is typically done just by making a variable and calling the function associated with the classifier:

logreg_clf = LogisticRegression()   

Now the classifier needs to be trained. In order to accomplish this, the classifier must be fit with the training data.

The training features and the training labels are passed into the classifier with the fit command:

logreg_clf.fit(features, labels)   

After the classifier model has been trained on the training data, it can make predictions on the testing data.

This is easily done by calling the predict command on the classifier and providing it with the parameters it needs to make predictions about, which are the features in your testing dataset:

logreg_clf.predict(test_features)   

These steps: instantiation, fitting/training, and predicting are the basic workflow for classifiers in Scikit-Learn.

However, the handling of classifiers is only one part of doing classifying with Scikit-Learn. The other half of the classification in Scikit-Learn is handling data.

To understand how handling the classifier and handling data come together as a whole classification task, let’s take a moment to understand the machine learning pipeline.

### The Machine Learning Pipeline

The machine learning pipeline has the following steps: preparing data, creating training/testing sets, instantiating the classifier, training the classifier, making predictions, evaluating performance, tweaking parameters.

The first step to training a classifier on a dataset is to prepare the dataset – to get the data into the correct form for the classifier and handle any anomalies in the data. If there are missing values in the data, outliers in the data, or any other anomalies these data points should be handled, as they can negatively impact the performance of the classifier. This step is referred to as data preprocessing.

Once the data has been preprocessed, the data must be split into training and testing sets. We have previously discussed the rationale for creating training and testing sets, and this can easily be done in Scikit-Learn with a helpful function called traintestsplit.

As previously discussed the classifier has to be instantiated and trained on the training data. After this, predictions can be made with the classifier. By comparing the predictions made by the classifier to the actual known values of the labels in your test data, you can get a measurement of how accurate the classifier is.

There are various methods comparing the hypothetical labels to the actual labels and evaluating the classifier. We’ll go over these different evaluation metrics later. For now, know that after you’ve measured the classifier’s accuracy, you will probably go back and tweak the parameters of your model until you have hit an accuracy you are satisfied with (as it is unlikely your classifier will meet your expectations on the first run).

Let’s look at an example of the machine learning pipeline, going from data handling to evaluation.

#### Sample Classification Implementation

# Begin by importing all necessary libraries import pandas as pd   from sklearn.metrics import classification_report   from sklearn.metrics import confusion_matrix   from sklearn.metrics import accuracy_score   from sklearn.neighbors import KNeighborsClassifier   from sklearn.svm import SVC   

Because the iris dataset is so common, Scikit-Learn actually already has it, available for loading in with the following command:

sklearn.datasets.load_iris   

However, we’ll be loading the CSV file here, so that you get a look at how to load and preprocess data. You can download the csv file here.

Just put the data file in the same directory as your Python file. The Pandas library has an easy way to load in data, read_csv():

data = pd.read_csv('iris.csv')  # It is a good idea to check and make sure the data is loaded as expected.  print(data.head(5))   

Because the dataset has been prepared so well, we don’t need to do a lot of preprocessing. One thing we may want to do though it drop the “ID” column, as it is just a representation of row the example is found on.

As this isn’t helpful we could drop it from the dataset using the drop() function:

data.drop('Id', axis=1, inplace=True)   

We now need to define the features and labels. We can do this easily with Pandas by slicing the data table and choosing certain rows/columns with iloc():

# Pandas ".iloc" expects row_indexer, column_indexer   X = data.iloc[:,:-1].values   # Now let's tell the dataframe which column we want for the target/labels.   y = data['Species']   

The slicing notation above selects every row and every column except the last column (which is our label, the species).

Alternatively, you could select certain features of the dataset you were interested in by using the bracket notation and passing in column headers:

# Alternate way of selecting columns: X = data.iloc['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm']   

Now that we have the features and labels we want, we can split the data into training and testing sets using sklearn’s handy feature train_test_split():

# Test size specifies how much of the data you want to set aside for the testing set.  # Random_state parameter is just a random seed we can use. # You can use it if you'd like to reproduce these specific results. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=27)   

You may want to print the results to be sure your data is being parsed as you expect:

print(X_train)   print(y_train)   

Now we can instantiate the models. Let’s try using two classifiers, a Support Vector Classifier and a K-Nearest Neighbors Classifier:

SVC_model = svm.SVC()   # KNN model requires you to specify n_neighbors, # the number of points the classifier will look at to determine what class a new point belongs to KNN_model = KNeighborsClassifier(n_neighbors=5)   

Now let’s fit the classifiers:

SVC_model.fit(X_train, y_train)   KNN_model.fit(X_train, y_train)   

The call has trained the model, so now we can predict and store the prediction in a variable:

SVC_prediction = SVC_model.predict(X_test)   KNN_prediction = KNN_model.predict(X_test)   

We should now evaluate how the classifier performed. There are multiple methods of evaluating a classifier’s performance, and you can read more about there different methods below.

In Scikit-Learn you just pass in the predictions against the ground truth labels which were stored in your test labels:

# Accuracy score is the simplest way to evaluate print(accuracy_score(SVC_prediction, y_test))   print(accuracy_score(KNN_prediction, y_test))   # But Confusion Matrix and Classification Report give more details about performance print(confusion_matrix(SVC_prediction, y_test))   print(classification_report(KNN_prediction, y_test))   

For reference, here’s the output we got on the metrics:

SVC accuracy: 0.9333333333333333   KNN accuracy: 0.9666666666666667   

At first glance, it seems KNN performed better. Here’s the confusion matrix for SVC:

[[ 7  0  0]  [ 0 10  1]  [ 0  1 11]] 

This can be a bit hard to interpret, but the number of correct predictions for each class run on the diagonal from top-left to bottom-right. Check below for more info on this.

Finally, here’s the output for the classification report for KNN:

precision    recall  f1-score   support  Iris-setosa       1.00      1.00      1.00         7   Iris-versicolor       0.91      0.91      0.91        11   Iris-virginica       0.92      0.92      0.92        12  micro avg       0.93      0.93      0.93        30   macro avg       0.94      0.94      0.94        30   weighted avg       0.93      0.93      0.93        30   

### Evaluating the Classifier

When it comes to the evaluation of your classifier, there are several different ways you can measure its performance.

#### Classification Accuracy

Classification Accuracy is the simplest out of all the methods of evaluating the accuracy, and the most commonly used. Classification accuracy is simply the number of correct predictions divided by all predictions or a ratio of correct predictions to total predictions.

While it can give you a quick idea of how your classifier is performing, it is best used when the number of observations/examples in each class is roughly equivalent. Because this doesn’t happen very often, you’re probably better off using another metric.

#### Logarithmic Loss

Logarithmic Loss, or LogLoss, essentially evaluates how confident the classifier is about its predictions. LogLoss returns probabilities for membership of an example in a given class, summing them together to give a representation of the classifier’s general confidence.

The value for predictions runs from 1 to 0, with 1 being completely confident and 0 being no confidence. The loss, or overall lack of confidence, is returned as a negative number with 0 representing a perfect classifier, so smaller values are better.

#### Area Under ROC Curve (AUC)

This is a metric used only for binary classification problems. The area under the curve represents the model’s ability to properly discriminate between negative and positive examples, between one class or another.

A 1.0, all of the area falling under the curve, represents a perfect classifier. This means that an AUC of 0.5 is basically as good as randomly guessing. The ROC curve is calculated with regards to sensitivity (true positive rate/recall) and specificity (true negative rate). You can read more about these calculations at this ROC curve article.

#### Confusion Matrix

A confusion matrix is a table or chart, representing the accuracy of a model with regards to two or more classes. The predictions of the model will be on the X-axis while the outcomes/accuracy are located on the y-axis.

The cells are filled with the number of predictions the model makes. Correct predictions can be found on a diagonal line moving from the top left to the bottom right. You can read more about interpreting a confusion matrix here.

#### Classification Report

The classification report is a Scikit-Learn built in metric created especially for classification problems. Using the classification report can give you a quick intuition of how your model is performing. Recall pits the number of examples your model labeled as Class A (some given class) against the total number of examples of Class A, and this is represented in the report.

The report also returns prediction and f1-score. Precision is the percentage of examples your model labeled as Class A which actually belonged to Class A (true positives against false positives), and f1-score is an average of precision and recall.

### Conclusion

To take your understanding of Scikit-Learn farther, it would be a good idea to learn more about the different classification algorithms available. Once you have an understanding of these algorithms, read more about how to evaluate classifiers.

Many of the nuances of classification with only come with time and practice, but if you follow the steps in this guide you’ll be well on your way to becoming an expert in classification tasks with Scikit-Learn.

Planet Python