Stack Abuse: Getting Started with Python’s Wikipedia API

Introduction

In this article, we will be using the Wikipedia API to retrieve data from Wikipedia. Data scraping has seen a rapid surge owing to the increasing use of data analytics and machine learning tools. The Internet is the single largest source of information, and therefore it is important to know how to fetch data from various sources. And with Wikipedia being one of the largest and most popular sources for information on the Internet, this is a natural place to start.

In this article, we will see how to use Python’s Wikipedia API to fetch a variety of information from the Wikipedia website.

Installation

In order to extract data from Wikipedia, we must first install the Python Wikipedia library, which wraps the official Wikipedia API. This can be done by entering the command below in your command prompt or terminal:

$   pip install wikipedia 

Once the installation is done, we can use the Wikipedia API in Python to extract information from Wikipedia. In order to call the methods of the Wikipedia module in Python, we need to import it using the following command.

import wikipedia   

Searching Titles and Suggestions

The search() method does a Wikipedia search for a query that is supplied as an argument to it. As a result, this method returns a list of all the article’s titles that contain the query. For example:

import wikipedia   print(wikipedia.search("Bill"))   

Output:

['Bill', 'The Bill', 'Bill Nye', 'Bill Gates', 'Bills, Bills, Bills', 'Heartbeat bill', 'Bill Clinton', 'Buffalo Bill', 'Bill & Ted', 'Kill Bill: Volume 1'] 

As you see in the output, the searched title along with the related search suggestions are displayed. You can configure the number of search titles returned by passing a value for the results parameter, as shown here:

import wikipedia   print(wikipedia.search("Bill", results=2))   

Output:

['Bill', 'The Bill'] 

The above code prints only 2 search results of the query since that is how many we requested to be returned.

Let’s say we need to get the Wikipedia search suggestions for a search title, “Bill Cliton” that is incorrectly entered or has a typo. The suggest() method returns suggestions related to the search query entered as a parameter to it, or it will return “None” if no suggestions were found.

Let’s try it out here:

import wikipedia   print(wikipedia.suggest("Bill cliton"))   

Output:

bill clinton   

You can see that it took our incorrect entry, “Bill cliton”, and returned the correct suggestion of “bill clinton”.

Extracting Wikipedia Article Summary

We can extract the summary of a Wikipedia article using the summary() method. The article for which the summary needs to be extracted is passed as a parameter to this method.

Let’s extract the summary for “Ubuntu”:

print(wikipedia.summary("Ubuntu"))   

Output:

Ubuntu ( (listen)) is a free and open-source Linux distribution based on Debian. Ubuntu is officially released in three editions: Desktop, Server, and Core (for the internet of things devices and robots). Ubuntu is a popular operating system for cloud computing, with support for OpenStack.Ubuntu is released every six months, with long-term support (LTS) releases every two years. The latest release is 19.04 ("Disco Dingo"), and the most recent long-term support release is 18.04 LTS ("Bionic Beaver"), which is supported until 2028. Ubuntu is developed by Canonical and the community under a meritocratic governance model. Canonical provides security updates and support for each Ubuntu release, starting from the release date and until the release reaches its designated end-of-life (EOL) date. Canonical generates revenue through the sale of premium services related to Ubuntu. Ubuntu is named after the African philosophy of Ubuntu, which Canonical translates as "humanity to others" or "I am what I am because of who we all are".   

The whole summary is printed in the output. We can customize the number of sentences in the summary text to be displayed by configuring the sentences argument of the method.

print(wikipedia.summary("Ubuntu", sentences=2))   

Output:

Ubuntu ( (listen)) is a free and open-source Linux distribution based on Debian. Ubuntu is officially released in three editions: Desktop, Server, and Core (for the internet of things devices and robots).   

As you can see, only 2 sentences of Ubuntu’s text summary is printed.

However, keep in mind that wikipedia.summary will raise a “disambiguation error” if the page does not exist or the page is disambiguous. Let’s see an example.

print(wikipedia.summary("key"))   

The above code throws a DisambiguationError since there are many articles that would match “key”.

Output:

Traceback (most recent call last):     File "<stdin>", line 1, in <module>   File "/Library/Python/2.7/site-packages/wikipedia/util.py", line 28, in __call__     ret = self._cache[key] = self.fn(*args, **kwargs)   File "/Library/Python/2.7/site-packages/wikipedia/wikipedia.py", line 231, in summary     page_info = page(title, auto_suggest=auto_suggest, redirect=redirect)   File "/Library/Python/2.7/site-packages/wikipedia/wikipedia.py", line 276, in page     return WikipediaPage(title, redirect=redirect, preload=preload)   File "/Library/Python/2.7/site-packages/wikipedia/wikipedia.py", line 299, in __init__     self.__load(redirect=redirect, preload=preload)   File "/Library/Python/2.7/site-packages/wikipedia/wikipedia.py", line 393, in __load     raise DisambiguationError(getattr(self, 'title', page['title']), may_refer_to) wikipedia.exceptions.DisambiguationError: "Key" may refer to:   Key (cryptography)   Key (lock)   Key (map)   ... 

If you had wanted the summary on a “cryptography key”, for example, then you’d have to enter it as the following:

print(wikipedia.summary("Key (cryptography)"))   

With the more specific query we now get the correct summary in the output.

Retrieving Full Wikipedia Page Data

In order to get the contents, categories, coordinates, images, links and other metadata of a Wikipedia page, we must first get the Wikipedia page object or the page ID for the page. To do this, the page() method is used with page the title passed as an argument to the method.

Look at the following example:

wikipedia.page("Ubuntu")   

This method call will return a WikipediaPage object, which we’ll explore more in the next few sections.

Extracting Metadata of a Page

To get the complete plain text content of a Wikipedia page (excluding images, tables, etc.), we can use the content attribute of the page object.

print(wikipedia.page("Python").content)   

Output:

Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aims to help programmers write clear, logical code for small and large-scale projects.Python is dynamically typed and garbage-collected. It supports multiple programming paradigms, including procedural, object-oriented, and  functional programming. Python is often described as a "batteries included" language due to its comprehensive standard library.Python was conceived in the late 1980s as a successor to the ABC language. Python 2.0, released 2000, introduced features like list comprehensions and a garbage collection system capable of collecting reference cycles.   ... 

Similarly, we can get the URL of the page using the url attribute:

print(wikipedia.page("Python").url)   

Output:

https://en.wikipedia.org/wiki/Python_(programming_language)   

We can get the URLs of external links on a Wikipedia page by using the references property of the WikipediaPage object.

print(wikipedia.page("Python").references)   

Output:

[u'http://www.computerworld.com.au/index.php/id;66665771', u'http://neopythonic.blogspot.be/2009/04/tail-recursion-elimination.html', u'http://www.amk.ca/python/writing/gvr-interview', u'http://cdsweb.cern.ch/journal/CERNBulletin/2006/31/News%20Articles/974627?ln=en', u'http://www.2ality.com/2013/02/javascript-influences.html', ...] 

The title property of the WikipediaPage object can be used to we extract the title of the page.

print(wikipedia.page("Python").title)   

Output:

Python (programming language)   

Similarly, the categories attribute can be used to get the list of categories of a Wikipedia page:

print(wikipedia.page("Python").categories)   

Output

['All articles containing potentially dated statements', 'Articles containing potentially dated statements from August 2016', 'Articles containing potentially dated statements from December 2018', 'Articles containing potentially dated statements from March 2018', 'Articles with Curlie links', 'Articles with short description', 'Class-based programming languages', 'Computational notebook', 'Computer science in the Netherlands', 'Cross-platform free software', 'Cross-platform software', 'Dutch inventions', 'Dynamically typed programming languages', 'Educational programming languages', 'Good articles', 'High-level programming languages', 'Information technology in the Netherlands', 'Object-oriented programming languages', 'Programming languages', 'Programming languages created in 1991', 'Python (programming language)', 'Scripting languages', 'Text-oriented programming languages', 'Use dmy dates from August 2015', 'Wikipedia articles with BNF identifiers', 'Wikipedia articles with GND identifiers', 'Wikipedia articles with LCCN identifiers', 'Wikipedia articles with SUDOC identifiers'] 

The links element of the WikipediaPage object can be used to get the list of titles of the pages whose links are present in the page.

print(wikipedia.page("Ubuntu").links)   

Output

[u'/e/ (operating system)', u'32-bit', u'4MLinux', u'ALT Linux', u'AMD64', u'AOL', u'APT (Debian)', u'ARM64', u'ARM architecture', u'ARM v7', ...] 

Finding Pages Based on Coordinates

The geosearch() method is used to do a Wikipedia geo search using latitude and longitude arguments supplied as float or decimal numbers to the method.

print(wikipedia.geosearch(37.787, -122.4))   

Output:

['140 New Montgomery', 'New Montgomery Street', 'Cartoon Art Museum', 'San Francisco Bay Area Planning and Urban Research Association', 'Academy of Art University', 'The Montgomery (San Francisco)', 'California Historical Society', 'Palace Hotel Residential Tower', 'St. Regis Museum Tower', 'Museum of the African Diaspora'] 

As you see, the above method returns articles based on the coordinates provided.

Similarly, we can set the coordinates property of the page() and get the articles related to the geolocation. For example:

print(wikipedia.page(37.787, -122.4))   

Output:

['140 New Montgomery', 'New Montgomery Street', 'Cartoon Art Museum', 'San Francisco Bay Area Planning and Urban Research Association', 'Academy of Art University', 'The Montgomery (San Francisco)', 'California Historical Society', 'Palace Hotel Residential Tower', 'St. Regis Museum Tower', 'Museum of the African Diaspora'] 

Language Settings

You can customize the language of a Wikipedia page to your native language, provided the page exists in your native language. To do so, you can use the set_lang() method. Each language has a standard prefix code which is passed as an argument to the method. For example, let’s get the first 2 sentences of the summary text of “Ubuntu” wiki page in the German language.

wikipedia.set_lang("de")   print(wikipedia.summary("ubuntu", sentences=2))   

Output

Ubuntu (auch Ubuntu Linux) ist eine Linux-Distribution, die auf Debian basiert. Der Name Ubuntu bedeutet auf Zulu etwa „Menschlichkeit“ und bezeichnet eine afrikanische Philosophie.   

You can check the list of currently supported ISO languages along with its prefix, as follows:

print(wikipedia.languages())   

Retrieving Images in a Wikipedia Page

The images list of the WikipediaPage object can be used to fetch images from a Wikipedia page. For instance, the following script returns the first image from Wikipedia’s Ubuntu page:

print(wikipedia.page("ubuntu").images[0])   

Output

https://upload.wikimedia.org/wikipedia/commons/1/1d/Bildschirmfoto_zu_ubuntu_704.png   

The above code returns the URL of the image present at index 0 in the Wikipedia page.

To see the image, you can copy and paste the above URL into your browser.

Retreiving Full HTML Page Content

To get the full Wikipedia page in HTML format, you can use the following script:

print(wikipedia.page("Ubuntu").html())   

Output

<div class="mw-parser-output"><div role="note" class="hatnote navigation-not-searchable">For the African philosophy, see <a href="/wiki/Ubuntu_philosophy" title="Ubuntu philosophy">Ubuntu philosophy</a>. For other uses, see <a href="/wiki/Ubuntu_(disambiguation)" class="mw-disambig" title="Ubuntu (disambiguation)">Ubuntu (disambiguation)</a>.</div>   <div class="shortdescription nomobile noexcerpt noprint searchaux" style="display:none">Linux distribution based on Debian</div>   ... 

As seen in the output, the entire page in HTML format is displayed. This can take a bit longer to load if the page size is large, so keep in mind that it can raise an HTMLTimeoutError when a request to the server times out.

Conclusion

In this tutorial, we had a glimpse of using the Wikipedia API for extracting data from the web. We saw how to get a variety of information such as a page’s title, category, links, images, and retrieve articles based on geo-locations.

Planet Python

Stack Abuse: Python for NLP: Getting Started with the StanfordCoreNLP Library

This is the ninth article in my series of articles on Python for NLP. In the previous article, we saw how Python’s Pattern library can be used to perform a variety of NLP tasks ranging from tokenization to POS tagging, and text classification to sentiment analysis. Before that we explored the TextBlob library for performing similar natural language processing tasks.

In this article, we will explore StanfordCoreNLP library which is another extremely handy library for natural language processing. We will see different features of StanfordCoreNLP with the help of examples. So before wasting any further time, let’s get started.

Setting up the Environment

The installation process for StanfordCoreNLP is not as straight forward as the other Python libraries. As a matter of fact, StanfordCoreNLP is a library that’s actually written in Java. Therefore make sure you have Java installed on your system. You can download the latest version of Java freely.

Once you have Java installed, you need to download the JAR files for the StanfordCoreNLP libraries. The JAR file contains models that are used to perform different NLP tasks. To download the JAR files for the English models, download and unzip the folder located at the official StanfordCoreNLP website.

Next thing you have to do is run the server that will serve the requests sent by the Python wrapper to the StanfordCoreNLP library. Navigate to the path where you unzipped the JAR files folder. Navigate inside the folder and execute the following command on the command prompt:

$   java -mx6g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -timeout 10000 

The above command initiates the StanfordCoreNLP server. The parameter -mx6g specifies that the memory used by the server should not exceed 6 gigabytes. It is important to mention that you should be running 64-bit system in order to have a heap as big as 6GB. If you are running a 32-bit system, you might have to reduce the memory size dedicated to the server.

Once you run the above command, you should see the following output:

[main] INFO CoreNLP - --- StanfordCoreNLPServer#main() called --- [main] INFO CoreNLP - setting default constituency parser [main] INFO CoreNLP - warning: cannot find edu/stanford/nlp/models/srparser/englishSR.ser.gz [main] INFO CoreNLP - using: edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz instead [main] INFO CoreNLP - to use shift reduce parser download English models jar from: [main] INFO CoreNLP - http://stanfordnlp.github.io/CoreNLP/download.html [main] INFO CoreNLP -     Threads: 8 [main] INFO CoreNLP - Starting server... [main] INFO CoreNLP - StanfordCoreNLPServer listening at /0:0:0:0:0:0:0:0:9000 

The server is running at port 9000.

Now the final step is to install the Python wrapper for the StanfordCoreNLP library. The wrapper we will be using is pycorenlp. The following script downloads the wrapper library:

$   pip install pycorenlp 

Now we are all set to connect to the StanfordCoreNLP server and perform the desired NLP tasks.

To connect to the server, we have to pass the address of the StanfordCoreNLP server that we initialized earlier to the StanfordCoreNLP class of the pycorenlp module. The object returned can then be used to perform NLP tasks. Look at the following script:

from pycorenlp import StanfordCoreNLP  nlp_wrapper = StanfordCoreNLP('http://localhost:9000')   

Performing NLP Tasks

In this section, we will briefly explore the use of StanfordCoreNLP library for performing common NLP tasks.

Lemmatization, POS Tagging and Named Entity Recognition

Lemmatization, parts of speech tagging, and named entity recognition are the most basic NLP tasks. The StanfordCoreNLP library supports pipeline functionality that can be used to perform these tasks in a structured way.

In the following script, we will create an annotator which first splits a document into sentences and then further splits the sentences into words or tokens. The words are then annotated with the POS and named entity recognition tags.

doc = "Ronaldo has moved from Real Madrid to Juventus. While messi still plays for Barcelona"   annot_doc = nlp_wrapper.annotate(doc,       properties={         'annotators': 'ner, pos',         'outputFormat': 'json',         'timeout': 1000,     }) 

In the script above we have a document with two sentences. We use the annotate method of the StanfordCoreNLP wrapper object that we initialized earlier. The method takes three parameters. The annotator parameter takes the type of annotation we want to perform on the text. We pass 'ner, pos' as the value for the annotator parameter which specifies that we want to annotate our document for POS tags and named entities.

The outputFormat variable defines the format in which you want the annotated text. The possible values are json for JSON objects, xml for XML format, text for plain text, and serialize for serialized data.

The final parameter is the timeout in milliseconds which defines the time that the wrapper should wait for the response from the server before timing out.

In the output, you should see a JSON object as follows:

 
{'sentences': [{'index': 0, 'entitymentions': [{'docTokenBegin': 0, 'docTokenEnd': 1, 'tokenBegin': 0, 'tokenEnd': 1, 'text': 'Ronaldo', 'characterOffsetBegin': 0, 'characterOffsetEnd': 7, 'ner': 'PERSON'}, {'docTokenBegin': 4, 'docTokenEnd': 6, 'tokenBegin': 4, 'tokenEnd': 6, 'text': 'Real Madrid', 'characterOffsetBegin': 23, 'characterOffsetEnd': 34, 'ner': 'ORGANIZATION'}, {'docTokenBegin': 7, 'docTokenEnd': 8, 'tokenBegin': 7, 'tokenEnd': 8, 'text': 'Juventus', 'characterOffsetBegin': 38, 'characterOffsetEnd': 46, 'ner': 'ORGANIZATION'}], 'tokens': [{'index': 1, 'word': 'Ronaldo', 'originalText': 'Ronaldo', 'lemma': 'Ronaldo', 'characterOffsetBegin': 0, 'characterOffsetEnd': 7, 'pos': 'NNP', 'ner': 'PERSON', 'before': '', 'after': ' '}, {'index': 2, 'word': 'has', 'originalText': 'has', 'lemma': 'have', 'characterOffsetBegin': 8, 'characterOffsetEnd': 11, 'pos': 'VBZ', 'ner': 'O', 'before': ' ', 'after': ' '}, {'index': 3, 'word': 'moved', 'originalText': 'moved', 'lemma': 'move', 'characterOffsetBegin': 12, 'characterOffsetEnd': 17, 'pos': 'VBN', 'ner': 'O', 'before': ' ', 'after': ' '}, {'index': 4, 'word': 'from', 'originalText': 'from', 'lemma': 'from', 'characterOffsetBegin': 18, 'characterOffsetEnd': 22, 'pos': 'IN', 'ner': 'O', 'before': ' ', 'after': ' '}, {'index': 5, 'word': 'Real', 'originalText': 'Real', 'lemma': 'real', 'characterOffsetBegin': 23, 'characterOffsetEnd': 27, 'pos': 'JJ', 'ner': 'ORGANIZATION', 'before': ' ', 'after': ' '}, {'index': 6, 'word': 'Madrid', 'originalText': 'Madrid', 'lemma': 'Madrid', 'characterOffsetBegin': 28, 'characterOffsetEnd': 34, 'pos': 'NNP', 'ner': 'ORGANIZATION', 'before': ' ', 'after': ' '}, {'index': 7, 'word': 'to', 'originalText': 'to', 'lemma': 'to', 'characterOffsetBegin': 35, 'characterOffsetEnd': 37, 'pos': 'TO', 'ner': 'O', 'before': ' ', 'after': ' '}, {'index': 8, 'word': 'Juventus', 'originalText': 'Juventus', 'lemma': 'Juventus', 'characterOffsetBegin': 38, 'characterOffsetEnd': 46, 'pos': 'NNP', 'ner': 'ORGANIZATION', 'before': ' ', 'after': ''}, {'index': 9, 'word': '.', 'originalText': '.', 'lemma': '.', 'characterOffsetBegin': 46, 'characterOffsetEnd': 47, 'pos': '.', 'ner': 'O', 'before': '', 'after': ' '}]}, {'index': 1, 'entitymentions': [{'docTokenBegin': 14, 'docTokenEnd': 15, 'tokenBegin': 5, 'tokenEnd': 6, 'text': 'Barcelona', 'characterOffsetBegin': 76, 'characterOffsetEnd': 85, 'ner': 'ORGANIZATION'}], 'tokens': [{'index': 1, 'word': 'While', 'originalText': 'While', 'lemma': 'while', 'characterOffsetBegin': 48, 'characterOffsetEnd': 53, 'pos': 'IN', 'ner': 'O', 'before': ' ', 'after': ' '}, {'index': 2, 'word': 'messi', 'originalText': 'messi', 'lemma': 'messus', 'characterOffsetBegin': 54, 'characterOffsetEnd': 59, 'pos': 'NNS', 'ner': 'O', 'before': ' ', 'after': ' '}, {'index': 3, 'word': 'still', 'originalText': 'still', 'lemma': 'still', 'characterOffsetBegin': 60, 'characterOffsetEnd': 65, 'pos': 'RB', 'ner': 'O', 'before': ' ', 'after': ' '}, {'index': 4, 'word': 'plays', 'originalText': 'plays', 'lemma': 'play', 'characterOffsetBegin': 66, 'characterOffsetEnd': 71, 'pos': 'VBZ', 'ner': 'O', 'before': ' ', 'after': ' '}, {'index': 5, 'word': 'for', 'originalText': 'for', 'lemma': 'for', 'characterOffsetBegin': 72, 'characterOffsetEnd': 75, 'pos': 'IN', 'ner': 'O', 'before': ' ', 'after': ' '}, {'index': 6, 'word': 'Barcelona', 'originalText': 'Barcelona', 'lemma': 'Barcelona', 'characterOffsetBegin': 76, 'characterOffsetEnd': 85, 'pos': 'NNP', 'ner': 'ORGANIZATION', 'before': ' ', 'after': ''}]}]}

If you look at the above script carefully, you can find the POS tags, named entities and lemmatized version of each word.

Lemmatization

Let’s now explore the annotated results. We’ll first print the lemmatizations for the words in the two sentences in our dataset:

for sentence in annot_doc["sentences"]:       for word in sentence["tokens"]:         print(word["word"] + " => " + word["lemma"]) 

In the script above, the outer loop iterates through each sentence in the document and the inner loop iterates through each word in the sentence. Inside the inner loop, the word and it’s corresponding lemmatized form are printed on the console. The output looks like this:

Ronaldo=>Ronaldo   has=>have   moved=>move   from=>from   Real=>real   Madrid=>Madrid   to=>to   Juventus=>Juventus   .=>. While=>while   messi=>messus   still=>still   plays=>play   for=>for   Barcelona=>Barcelona   

For example, you can see the word moved has been lemmatized to move, similarly the word plays has been lemmatized to play.

POS Tagging

In the same way, we can find the POS tags for each word. Look at the following script:

for sentence in annot_doc["sentences"]:       for word in sentence["tokens"]:         print (word["word"] + "=>" + word["pos"]) 

In the output, you should see the following results:

Ronaldo=>NNP   has=>VBZ   moved=>VBN   from=>IN   Real=>JJ   Madrid=>NNP   to=>TO   Juventus=>NNP   .=>. While=>IN   messi=>NNS   still=>RB   plays=>VBZ   for=>IN   Barcelona=>NNP   

The tag set used for POS tags is the Penn Treebank tagset and can be found here.

Named Entity Recognition

To find named entities in our document, we can use the following script:

for sentence in annot_doc["sentences"]:       for word in sentence["tokens"]:         print (word["word"] + "=>" + word["ner"]) 

The output looks like this:

Ronaldo=>PERSON   has=>O   moved=>O   from=>O   Real=>ORGANIZATION   Madrid=>ORGANIZATION   to=>O   Juventus=>ORGANIZATION   .=>O While=>O   messi=>O   still=>O   plays=>O   for=>O   Barcelona=>ORGANIZATION   

We can see that Ronaldo has been identified as a PERSON while Barcelona has been identified as Organization, which in this case is correct.

Sentiment Analysis

To find the sentiment of a sentence, all you have to is pass sentiment as the value for the annotators property. Look at the following script:

doc = "I like this chocolate. This chocolate is not good. The chocolate is delicious. Its a very tasty chocolate. This is so bad"   annot_doc = nlp_wrapper.annotate(doc,       properties={        'annotators': 'sentiment',        'outputFormat': 'json',        'timeout': 1000,     }) 

To find the sentiment, we can iterate over each sentence and then use sentimentValue property to find the sentiment. The sentimentValue returns a value between 1 and 4 where 1 corresponds to highly negative sentiment while 4 corresponds to highly positive sentiment. The sentiment property can be used to get sentiment in verbal form i.e positive, negative or neutral.

The following script finds the sentiment for each sentence in the document we defined above.

for sentence in annot_doc["sentences"]:       print ( " ".join([word["word"] for word in sentence["tokens"]]) + " => " \         + str(sentence["sentimentValue"]) + " = "+ sentence["sentiment"]) 

Output:

I like this chocolate . => 2 = Neutral   This chocolate is not good . => 1 = Negative   The chocolate is delicious . => 3 = Positive   Its a very tasty chocolate . => 3 = Positive   This is so bad => 1 = Negative   

Conclusion

StanfordCoreNLP is another extremely handy library for natural language processing. In this article, we studied how to set up the environment to run StanfordCoreNLP. We then explored the use of StanfordCoreNLP library for common NLP tasks such as lemmatization, POS tagging and named entity recognition and finally, we rounded off the article with sentimental analysis using StanfordCoreNLP.

Planet Python

Mike Driscoll: Getting Started with JupyterLab

JupyterLab is the latest package from Project Jupyter. In some ways, it is kind of a replacement for Jupyter Notebook. However the Jupyter Notebook is a separate project from JupyterLab. I like to think of JupyterLab as a kind of web-based Integrated Development Environment that you an use to to work with Jupyter Notebooks as well as using terminals, text editors and code consoles. You might say JupyterLab is a more powerful version of Jupyter Notebook.

Anyway, here are a few of the things that JupyterLab is capable of:

  • Code Consoles – These are coding scratchpads that you can use for running code interactively, kind of like Python’s IDLE
  • Kernel-backed documents – These allow you to enable code in any text file (Markdown, Python, R, etc) that can then be run in the Jupyter kernel
  • Mirrored Notebook cell outputs – This let’s you create simple dashboards
  • Multiple views of the same document – Gives you the ability to live edit documents and see the results in real-time

JupyterLab will allow you to view and handle multiple types of data. You can also display rich output from these formats using various visualizations or Markdown.

For navigation, you can use customizable keyboard shortcuts or key maps from vim, emacs and even SublimeText.

You can add new behavior to your JupyterLab instance via extensions. This includes theming support, file editors and more.


Installation

You can use conda, pip or pipenv to install JupyterLab.

conda

If you are an Anaconda user, then you can use conda for installation purposes by using the following command:

conda install -c conda-forge jupyterlab

pip

If you prefer using Python’s native installer, pip, then this is the command you want:

pip install jupyterlab

Note: If you are using pip install –user, then you will need to add the user-level “bin” directory to your PATH environment variable to be able to launch jupyterlab.

pipenv

The pipenv tool is a new package that can be used to create a Python virtual environment and download a package into it. If you happen to have it installed, then you can use the following two commands to get JupyterLab:

pipenv install jupyterlab pipenv shell

Note that calling the shell command is required if you want to launch JupyterLab from within the virtualenv that you installed it into.


Running JupyterLab

Now that we have JupyterLab installed, we should try running it. You can use either jupyter-lab or jupyter lab to run it. When I ran either of these commands, I got the following initial web application:

Initial Landing Page for a New JupyterLabInitial Landing Page for a New JupyterLab

The tab on the right is called the Launcher. This is one place you can go to start a new Notebook, Code Console, Terminal or Text File. New documents are opened as new tabs. You will note that when you create a new Notebook or other item that the Launcher disappears. If you would like to open a second document, just click the “+” button on the left, which I have circled below:

Adding a new item in JupyterLabAdding a new item in JupyterLab

Let’s open a Notebook and then click the plus button. If you do that, your screen should look something like this:

Multiple Tabs in JupyterLabMultiple Tabs in JupyterLab

You can also create new items by using the **Menu** that runs along the top of the screen. Just go to **File** –> **New** and then choose the type of item you would like to create. Most of the menu items should be familiar to you if you have used Jupyter Notebook. There are some new entries here that are specific to JupyterLab however. For example:

  • New Launcher – Launches a new launcher
  • Open From Path – Open a document from a path other than the one you started in
  • Save Notebook As… – Let’s you save the currently selected Notebook with a new filename
  • Export Notebook As… – Let’s you export your Notebook to a different format, such as PDF, Markdown, etc

Explore the menu and see what else you can find. It’s pretty self-explanatory.


The File Browser

The tree on the left is known as the File Browser. It shows you the files that are available to you from the location that you launched JupyterLab from. Just click the folder icon to make the tree collapse so that the tab can fill the browser:

File Browser MinimizedFile Browser Minimized

You will note that you can create new folders in the File Browser as well by clicking the folder+ icon (circled below):

Creating a New FolderCreating a New Folder

If you need to add a file to JupyterLab from another location on your computer, then you will want to click on the Upload button:

Uploading / Saving a File to the WorkspaceUploading / Saving a File to the Workspace

When you do, it will pop open a File Open dialog:

JupyterLab upload dialogUpload Dialog

Just use this as you would if you were opening a file in another program. Just remember that instead of opening a file, you are “uploading” or “copying” it to your JupyterLab workspace.

Finally there is a refresh button that you can use to refresh the workspace if you happened to copy a file into the workspace by using a method other than the Upload button:

Refresh File Browser Button in JupyterLabRefresh File Browser Button


Special URLs

As with Jupyter Notebook, JupyterLab allows users to copy URLS into the browser to open a specific Notebook or file. However JupyterLab has also added the ability to manage workspaces and file navigation via URLs.

For example, if you want to use file navigation, you can use the special keyword tree to do so. Here is an example URL using an untitled Notebook:

http://localhost:8888/lab/tree/Untitled.ipynb

If you were to try this out, you would see the normal Jupyter Notebook interface instead of seeing the Notebook inside of JupyterLab.

Workspaces

The default workspace doesn’t have a name, but it can be found at /lab. If you would like to clone your workspace, you can use the following format:

http://localhost:8888/lab/workspaces/test?clone

This will copy your current workspace into a workspace named test. If you want to copy the test workspace into your default workspace, the URL would look like this:

http://localhost:8888/lab?clone=test

You can also reset a workspace using the reset URL parameter. When you reset a workspace, you are clearing it of its contents. Here is an example of resetting the default workspace:

http://localhost:8888/lab/workspaces/lab?reset

The Cell Inspector

Let’s create a Notebook inside of our JupyterLab instance. Go to the Launcher and choose a kernel. You will have Python 2 or Python 3 by default. Once you have created it, you should see a new tab named “Untitled.ipynb” like this:

An Empty Notebook in JupyterLabAn Empty Notebook in JupyterLab

As you can see, we have a Notebook with a single cell. Let’s add the following code to the slide:

def adder(a, b):    return a + b   adder(2, 3)

Now let’s click the little wrench that is in the toolbar on the left. Here’s a screenshot with the button circled:

The Cell Inspector in JupyterLabThe Cell Inspector

When you click that wrench, your screen should look like the above. This is called the Cell Inspector. This is where you can set up your Notebook for presentation purposes. You can set which cells are slides or sub-slides. Pretty much anything that we talked about in Chapter 9 relating to the Notebook itself can also be done here. You will also note that the Cell Inspector will display any metadata that JupyterLab / Notebook is adding to the cells.

If you would like to see that in action, then try setting your first cell to a Slide. Now you should see the Metadata field populated like this:

The Cell's Metadata in JupyterLabThe Cell’s Metadata


Using Files

You can use JupyterLab’s File Browser and File menu to work with files and directories on your system. This allows you to open, create, delete, rename, download / upload, copy and share files and directories. You can find the File Browser in the left sidebar:

The File Browser in JupyterLabThe File Browser

If you have files in your browser, you can open it by just double-clicking the file as you would normally do in your system’s file browser. You may also drag a file from the File Browser into the work area (on the right) which will cause it to open.

Many of the file types that JupyterLab supports also have multiple viewers and editors. You can open a Markdown file in an editor or view it as HTML, for example. If you want to open the file in a non-default viewer/editor, just right-click the file and choose “Open With…” from the context menu:

Context Menu for Files in JupyterLabContext Menu for Files

Note that you can open a single file into multiple viewers/editors and they will remain in sync.


The Text Editor

JupyterLab comes with a built-in text editor that you can use to create or open text files. Open up the Launcher and instead of creating a Notebook, go to the bottom of the Launcher and create a text file.

Launch the Text Editor in JupyterLabLaunch the Text Editor

It will create an untitled.txt file by default. But you can go to the File menu and use “Save As…” to save it as something else. This allows you to create Python files, Markdown and pretty much anything else you would like to. It even provides syntax highlighting for some file types, although code completion is not supported. You may also create files via the File menu.

The text editor also supports configurable indentation (tabs vs. spaces), key maps and basic theming. Just go to the Settings menu to view or edit them:

File Settings in JupyterLabFile Settings

If you would like to edit an existing text file, all you need to do is double-click it in the File Browser.


Interactive Code Consoles

One of the newer features to JupyterLab is the Code Console, which is basically a REPL in your browser. It will let you run code interactively in the currently selected kernel. The “cells” of a code console show the order in which the code was run. To create a new Code Console, click the “+” button in the File Browser and select the kernel of your choice:

Code Console Launcher in JupyterLabCode Console Launcher

Enter some code. Here’s an example if you are having some trouble thinking of any on your own:

print('Hello Console!')

Now press Shift+Enter to run the code. You should see the following output if everything worked correctly:

Code Console in JupyterLabCode Console

Code completion works via the Tab key. You can also bring up tooltips by pressing Shift+Tab.

If you need to clear the code console without restarting the kernel, you can right click on the console itself and select “Clear Console Cells”.


Terminals

The JupyterLab project continues its support of system shells in the browser. For Mac / Linux, it supports bash, tsch, etc while on Windows, it supports Powershell. These terminals can run anything that you would normally run from your system’s terminal, including other programs like vim or emacs. Do note that the JupyterLab terminals run on the system that you have JupyterLab installed to, so it will be using your user’s privileges as well.

Anyway, if you would like to see a terminal in action, just start up the Launcher by pressing the “+” button in the File Browser. Then select the Terminal:

Terminal Launcher in JupyterLabTerminal Launcher

If you close the terminal tab, JupyterLab will leave it running in the background. Here is a terminal running:

Running Terminal in JupyterLabA Running Terminal

If you would like to re-open your terminal, just go to the Running tab:

Running Apps in JupyterLabRunning Apps

Then select the terminal from the list of running applications.


The Command Palette

The user actions in JupyterLab all go through a central command system. This includes the commands used by the menu bar, context menus, keyboard shortcuts and more. You can access the available commands via the Command Palette, which you will find under the Commands tab:

The Command Palette in JupyterLabThe Command Palette

Here you can search for commands and execute them directly instead of hunting for them in the menu system. You can also bring up the Command Palette with the following keyboard shortcut: Command/Ctrl Shift C


Supported File Types

JupyterLab supports quite a few filetypes that it can display or allow you to edit. This allows you to layout rich cell output in a Notebook or Code Console. For files, JupyterLab will detect the data format by looking at the extension of the file or the entire filename if the extensions does not exist. Note that multiple editors / viewers can be associated with a single file type. For example, you can edit a Markdown file and view it as HTML. Just right-click a file and go to the Open With context menu item to view the editors and viewers that are available to you for that file type:

Using Open With

You can use Python code to display different data formats in your Notebook or Code Console. Here is an example:

from IPython.display import display, HTML display(HTML('<h1>Hello from JupyterLab</h1>'))

When you run this code in a Notebook, it should look like this:

Running HTML in a NotebookRunning HTML in a Notebook

For a full list of file types that are supported by JupyterLab, I recommend checking out the documentation. This should always be up-to-date and more useful then if I were to list out the items myself.


A Word on Extensions

As you might expect, JupyterLab supports extensions and was designed with extensibility in mind. Extensions can customize the user’s experience or enhance one or more parts of JupyterLab. For example, you could add new items to the menu or command palette or add some new keyboard shortcuts. JupyterLab itself is actually a collection of extensions.

If you want to create an extension for JupyterLab, then you will need to be familiar with Javascript or be willing to learn. The extensions need to be in the npm packaging format. To install pre-made extensions, you are required to have Node.js installed on your machine. Be sure to check out their website for proper installation instructions for your operating system.

Installing / Uninstalling Extensions

Once you have Node.js installed, then you can install an extension to JupyterLab by running the following command:

jupyter labextension install the-extension-name

If you require a specific version of the extension, then the command would look like this:

jupyter labextension install the-extension-name@1.2

Where “1.2” is the version you require. You can also extensions that are gzipped tarballs or a URL to a gzipped tarball.

To get a list of currently installed JupyterLab extensions, just run

jupyter labextension list

In the event that you want to uninstall an extension, you can easily do so like this:

jupyter labextension uninstall the-extension-name

You may also install or uninstall multiple extensions by listing the names of the packages after the install or uninstall command. Since JupyterLab rebuilds after each installation, this can take quite a while. To speed things up a bit when installing or uninstalling multiple extensions, you can include the **–no-build** flag. Then once the installation or uninstall is complete, you will need to run the build command yourself, like this:

jupyter lab build

Disabling Extensions

If you don’t want to uninstall an extension but you would like to disable it, that is easy to do too. Just run the disable command:

jupyter labextension disable the-extension-name

Then when you want to re-enable it, you can run the enable command:

jupyter labextension enable the-extension-name

Wrapping Up

The JupyterLab package is really amazing. You can do a lot more with it than you could with just a Jupyter Notebook. However the user interface is also more complex so the learning curve will be a bit steeper. However I think it is worth learning how to use it as the ability to edit documents and view them live is really helpful when creating a presentation or doing other types of work. At the very least, I would give it a try in a virtual environment to see whether or not it will fit your workflow.


Related Reading

Planet Python

Getting Active Directory User Information

getting active directory user information

Whether starting recon on a new client’s environment or migrating a long-term client’s data to new servers, it sometimes is necessary to know some basic information on the user accounts listed in Active Directory.

Recently, I was installing new servers for a public school, and part of the project included the clean-up of unused login scripts that existed on the domain controllers. Additionally, we wanted to add new home drives to all users, but we didn’t know which already existed or where they were already pointed without painstakingly combing through AD account by account.

Utilizing PowerShell for User Info

Around here, we like to stand on the shoulders of giants, so building on Lindsey’s recent post on AD user counts, I turned to my old friend PowerShell. Running the following script on my client’s domain controller gave me just what I needed. I was able to immediately see all my users, which ones were enabled and which were disabled, their login script, their home drive information and the last time they logged on:

Get-ADUser -filter * -properties * | ft -AutoSize enabled, null, Name, null, scriptpath, null, homedrive, null, homedirectory, null, lastlogondate > C:\ADusers.txt

 

While it may look a bit daunting at first glance, it is actually quite simple when you break it down:

Get-ADUser -filter * -properties *
  • This part tells PowerShell to grab all user accounts that exist in AD.
| ft -AutoSize
  • Formats output as a table and sets column width automatically based on the amount of information returned.
enabled, null, Name, null, scriptpath, null, homedrive, null, homedirectory, null, lastlogondate
  • Pulls information from each of the listed properties for each account. Null isn’t necessary here, but it does add a handy double bracket between each column. This makes the output easier to read and enables you to use the column as a delimiter if you decide to import the output file into Excel.
> C:\ADusers.txt
  • This pipes the output into a text file at the listed location.

By adding other available fields to the table column list, I can easily tweak the information I get however I want. For example, I can change this to show all locked-out accounts, the last bad password attempt for each account and when the accounts were locked.

Adjusting the PowerShell Window

If you run into issues with columns having their content truncated in those annoying fiena… formats, you can adjust the width of your PowerShell window. Just click the icon in the upper-right corner of your PowerShell window and go to Properties > Layout > Window Size. Adjusting it to a higher number will widen the window and allow the columns to fill out:

adjusting window width in PowerShell

The post Getting Active Directory User Information appeared first on InterWorks.

InterWorks

Stack Abuse: Getting Started with MySQL and Python

Introduction

For any fully functional deployable application, the persistence of data is indispensable. A trivial way of storing data would be to write it to a file in the hard disk, but one would prefer writing the application specific data to a database for obvious reasons. Python provides language support for writing data to a wide range of databases.

Python DB API

At the heart of Python support for database programming is the Python DB API (PEP – 249) which does not depend on any specific database engine. Depending on the database we use at the persistence layer, an appropriate implementation of Python DB API should be imported and used in our program. In this tutorial, we will be demonstrating how to use Python to connect to MySQL database and do transactions with it. For this, we will be using the MySQLdb Python package.

Before we proceed with connecting to the database using Python, we need to install MySQL connector for Python. This can be done in two ways:

  • One way is to download the appropriate installer for the OS and bit version directly from the MySQL official site.
  • Another way is to use pip to install it.
$   pip install mysql-connector-python 

If there is a specific MySQL version installed in the local machine, then you may need a specific MySQL connector version so that no compatibility issues arise, which we can get using the following command:

$   pip install mysql-connector-python==<insert_version_number_here> 

Finally, we need to install MySQL client module that will enable us to connect to MySQL databases from our Python application, which acts as the client:

$   pip install mysqlclient 

Connecting to the Database

Once we have the connector installed in place, the import MySQLdb statement should not throw any error on executing the Python file.

Prerequisites

Note: It is assumed that the readers have basic understanding of databases in general and the MySQL database in specific, along with knowledge of structured query language (SQL). However, the basic process to create a database and a user has been explained in this section. Follow these steps:

  • Ensure that your MySQL server is running. This can be checked via MySQL WorkBench -> Server Status.
  • Open MySQL WorkBench or MySQL CLI. Create a new database. Let us call it pythondb.
CREATE DATABASE pythondb;   USE pythondb;   
  • Create a new user pythonuser with password pythonpwd123 and grant access to pythondb
CREATE USER 'pythonuser'@'localhost' IDENTIFIED BY 'pythonpwd123'   GRANT ALL PRIVILEGES ON pythondb.* To 'pythonuser'@'localhost'   FLUSH PRIVILEGES   

Checking your Connection to pythondb

Here is a simple script that can be used to programmatically test connection to the newly created database:

#!/usr/bin/python  import MySQLdb  dbconnect = MySQLdb.connect("localhost", "pythonuser", "pythonpwd123", "pythondb")  cursor = dbconnect.cursor()   cursor.execute("SELECT VERSION()")  data = cursor.fetchone()   if data:     print('Version retrieved: ', data) else:     print('Version not retrieved.')  dbconnect.close()   

Output

Version retrieved: 5.7.19   

The version number shown above is just a dummy number. It should match the installed MySQL server’s version.

Let us take a closer look at the sample program above to learn how it works. First off, import MySQLdb is used to import the required python module.

MySQLdb.connect() method takes hostname, username, password, and database schema name to create a database connection. On successfully connecting with the database, it will return a connection object (which is referred to as dbconnect here).

Using the connection object, we can execute queries, commit transactions and rollback transactions before closing the connection.

Once we get the connection object, we need to get a MySQLCursor object in order to execute queries using execute method. The result set of the transaction can be retrieved using the fetchall, fetchone, or fetchmany methods, which will be discussed later in this tutorial.

There are three important methods related to database transactions apart from the execute method. We will learn briefly about these methods now.

The dbconnect.commit() method informs the database that the changes executed before calling this function shall be finalized and there is no scope for rolling back to the previous state if the transaction is successful.

Sometimes, if transaction failure occurs, we will need to change the database to the previous state before the failure happened so that the data is not lost or corrupted. In such a case, we will need to rollback the database to the previous state using dbconnect.rollback().

Finally, the dbconnect.close() method is used to close the connection to database. To perform further transactions, we need to create a new connection.

Create a New Table

Once the connection with pythondb is established successfully, we are ready to go to the next step. Let us create a new table in it:

import MySQLdb  dbconnect = MySQLdb.connect("localhost","pythonuser","pythonpwd123","pythondb" )  cursor = dbconnect.cursor()   cursor.execute("DROP TABLE IF EXISTS MOVIE")  query = "CREATE TABLE MOVIE(  \             id int(11) NOT NULL,\           name varchar(20),\           year int(11),\           director varchar(20),\           genre varchar(20),\           PRIMARY KEY (id))"  cursor.execute(query)  dbconnect.close()   

After executing the above script, you should be able to see a new table movie created for the schema pythondb. This can be viewed using MySQL WorkBench.

Performing CRUD Operations

Now we’ll perform some insert, read, modify, and delete operations in the newly created database table via the Python script.

Creating a New Record

The following script demonstrates how to insert a new record into MySQL database using a Python script:

#!/usr/bin/python  import MySQLdb  dbconnect = MySQLdb.connect("localhost", "pythonuser", "pythonpwd123", "pythondb")  cursor = dbconnect.cursor()  query = 'insert into movie(id, name, year, director, genre)  \          values (1, "Bruce Almighty", 2003, "Tom Shaydac", "Comedy")' try:      cursor.execute(query)    dbconnect.commit() except:      dbconnect.rollback() finally:      dbconnect.close() 

Reading Rows from a Table

Once a new row is inserted in the database, you can fetch the data in three ways using the cursor object:

  • cursor.fetchall() – can be used to get all rows
  • cursor.fetchmany() – can be used to get a selected number of rows
  • cursor.fetchone() – can be used to get only the first row from the result set

For simplicity, we will use the “select all” SQL query and use a for loop over the result set of the fetchall method to print individual records.

#!/usr/bin/python  import MySQLdb  dbconnect = MySQLdb.connect("localhost", "pythonuser", "pythonpwd123", "pythondb")  cursor = dbconnect.cursor()  query = "SELECT * FROM movie"   try:      cursor.execute(query)    resultList = cursor.fetchall()    for row in resultList:       print ("Movie ID =", row[0])       print ("Name =", row[1])       print ("Year =", row[2])       print ("Director = ", row[3])       print ('Genre = ', row[4]) except:      print ("Encountered error while retrieving data from database") finally:      dbconnect.close() 

Output:

Movie ID = 1   Name = Bruce Almighty   Year = 2003   Director = Tom Shaydac   Genre = Comedy   

Updating a Row

Let us now update the Genre of “Bruce Almighty” from Comedy to Satire:

import MySQLdb  dbconnect = MySQLdb.connect("localhost", "pythonuser", "pythonpwd123", "pythondb")  # The cursor object obtained below allows SQL queries to be executed in the database session. cursor = dbconnect.cursor()  updatequery = "update movie set genre = 'Satire' where id = 1"  cursor.execute(updatequery)  dbconnect.commit()  print(cursor.rowcount, "record(s) affected")   

Output:

1 record(s) affected   

Deleting a Record

Here is a Python script that demonstrates how to delete a database row:

import MySQLdb  dbconnect = MySQLdb.connect("localhost", "pythonuser", "pythonpwd123", "pythondb")  # The cursor object obtained below allows SQL queries to be executed in the database session. cursor = dbconnect.cursor()  updatequery = "DELETE FROM movie WHERE id = 1"  cursor.execute(updatequery)  dbconnect.commit()  print(cursor.rowcount, "record(s) deleted")   

After executing the above script, you should be able to see the following output if everything goes well.

Output

1 record(s) deleted   

Conclusion

In this article, we learned how to use the Python DB API to connect to a database. Specifically, we saw how a connection can be established to a MySQL database using the MySQLdb implementation of Python DB API. We also learned how to perform transactions with the database.

Planet Python