Getting Started with Kubernetes: A kubectl Cheat Sheet

Introduction

Kubectl is a command-line tool designed to manage Kubernetes objects and clusters. It provides a command-line interface for performing common operations like creating and scaling Deployments, switching contexts, and accessing a shell in a running container.

How to Use This Guide:

  • This guide is in cheat sheet format with self-contained command-line snippets.
  • It is not an exhaustive list of kubectl commands, but contains many common operations and use cases. For a more thorough reference, consult the Kubectl Reference Docs
  • Jump to any section that is relevant to the task you are trying to complete.

Prerequisites

Sample Deployment

To demonstrate some of the operations and commands in this cheat sheet, we’ll use a sample Deployment that runs 2 replicas of Nginx:

nginx-deployment.yaml
apiVersion: apps/v1 kind: Deployment metadata:   name: nginx-deployment spec:   replicas: 2   selector:     matchLabels:       app: nginx   template:     metadata:       labels:         app: nginx     spec:       containers:       - name: nginx         image: nginx         ports:         - containerPort: 80 

Copy and paste this manifest into a file called nginx-deployment.yaml.

Installing kubectl

Note: These commands have only been tested on an Ubuntu 18.04 machine. To learn how to install kubectl on other operating systems, consult Install and Set Up kubectl from the Kubernetes docs.

First, update your local package index and install required dependencies:

  • sudo apt-get update && sudo apt-get install -y apt-transport-https

Then add the Google Cloud GPG key to APT and make the kubectl package available to your system:

  • curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
  • echo "deb https://apt.kubernetes.io/ kubernetes-xenial main" | sudo tee -a /etc/apt/sources.list.d/kubernetes.list
  • sudo apt-get update

Finally, install kubectl:

  • sudo apt-get install -y kubectl

Test that the installation succeeded using version:

  • kubectl version

Setting Up Shell Autocompletion

Note: These commands have only been tested on an Ubuntu 18.04 machine. To learn how to set up autocompletion on other operating systems, consult Install and Set Up kubectl from the Kubernetes docs.

kubectl includes a shell autocompletion script that you can make available to your system’s existing shell autocompletion software.

Installing kubectl Autocompletion

First, check if you have bash-completion installed:

  • type _init_completion

You should see some script output.

Next, source the kubectl autocompletion script in your ~/.bashrc file:

  • echo 'source <(kubectl completion bash)' >>~/.bashrc
  • . ~/.bashrc

Alternatively, you can add the completion script to the /etc/bash_completion.d directory:

  • kubectl completion bash >/etc/bash_completion.d/kubectl

Usage

To use the autocompletion feature, press the TAB key to display available kubectl commands:

  • kubectl TAB TAB
Output
annotate apply autoscale completion cordon delete drain explain kustomize options port-forward rollout set uncordon api-resources attach certificate config cp describe . . .

You can also display available commands after partially typing a command:

  • kubectl d TAB
Output
delete describe diff drain

Connecting, Configuring and Using Contexts

Connecting

To test that kubectl can authenticate with and access your Kubernetes cluster, use cluster-info:

  • kubectl cluster-info

If kubectl can successfully authenticate with your cluster, you should see the following output:

Output
Kubernetes master is running at https://kubernetes_master_endpoint CoreDNS is running at https://coredns_endpoint To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.

kubectl is configured using kubeconfig configuration files. By default, kubectl will look for a file called config in the $ HOME/.kube directory. To change this, you can set the $ KUBECONFIG environment variable to a custom kubeconfig file, or pass in the custom file at execution time using the --kubeconfig flag:

  • kubectl cluster-info --kubeconfig=path_to_your_kubeconfig_file

Note: If you’re using a managed Kubernetes cluster, your cloud provider should have made its kubeconfig file available to you.

If you don’t want to use the --kubeconfig flag with every command, and there is no existing ~/.kube/config file, create a directory called ~/.kube in your home directory if it doesn’t already exist, and copy in the kubeconfig file, renaming it to config:

  • mkdir ~/.kube
  • cp your_kubeconfig_file ~/.kube/config

Now, run cluster-info once again to test your connection.

Modifying your kubectl Configuration

You can also modify your config using the kubectl config set of commands.

To view your kubectl configuration, use the view subcommand:

  • kubectl config view
Output
apiVersion: v1 clusters: - cluster: certificate-authority-data: DATA+OMITTED . . .

Modifying Clusters

To fetch a list of clusters defined in your kubeconfig, use get-clusters:

  • kubectl config get-clusters
Output
NAME do-nyc1-sammy

To add a cluster to your config, use the set-cluster subcommand:

  • kubectl config set-cluster new_cluster --server=server_address --certificate-authority=path_to_certificate_authority

To delete a cluster from your config, use delete-cluster:

Note: This only deletes the cluster from your config and does not delete the actual Kubernetes cluster.

  • kubectl config delete-cluster

Modifying Users

You can perform similar operations for users using set-credentials:

  • kubectl config set-credentials username --client-certificate=/path/to/cert/file --client-key=/path/to/key/file

To delete a user from your config, you can run unset:

  • kubectl config unset users.username

Contexts

A context in Kubernetes is an object that contains a set of access parameters for your cluster. It consists of a cluster, namespace, and user triple. Contexts allow you to quickly switch between different sets of cluster configuration.

To see your current context, you can use current-context:

  • kubectl config current-context
Output
do-nyc1-sammy

To see a list of all configured contexts, run get-contexts:

  • kubectl config get-contexts
Output
CURRENT NAME CLUSTER AUTHINFO NAMESPACE * do-nyc1-sammy do-nyc1-sammy do-nyc1-sammy-admin

To set a context, use set-context:

  • kubectl config set-context context_name --cluster=cluster_name --user=user_name --namespace=namespace

You can switch between contexts with use-context:

  • kubectl config use-context context_name
Output
Switched to context "do-nyc1-sammy"

And you can delete a context with delete-context:

  • kubectl config delete-context context_name

Using Namespaces

A Namespace in Kubernetes is an abstraction that allows you to subdivide your cluster into multiple virtual clusters. By using Namespaces you can divide cluster resources among multiple teams and scope objects appropriately. For example, you can have a prod Namespace for production workloads, and a dev Namespace for development and test workloads.

To fetch and print a list of all the Namespaces in your cluster, use get namespace:

  • kubectl get namespace
Output
NAME STATUS AGE default Active 2d21h kube-node-lease Active 2d21h kube-public Active 2d21h kube-system Active 2d21h

To set a Namespace for your current context, use set-context --current:

  • kubectl config set-context --current --namespace=namespace_name

To create a Namespace, use create namespace:

  • kubectl create namespace namespace_name
Output
namespace/sammy created

Similarly, to delete a Namespace, use delete namespace:

Warning: Deleting a Namespace will delete everything in the Namespace, including running Deployments, Pods, and other workloads. Only run this command if you’re sure you’d like to kill whatever’s running in the Namespace or if you’re deleting an empty Namespace.

  • kubectl delete namespace namespace_name

To fetch all Pods in a given Namespace or to perform other operations on resources in a given Namespace, make sure to include the --namespace flag:

  • kubectl get pods --namespace=namespace_name

Managing Kubernetes Resources

General Syntax

The general syntax for most kubectl management commands is:

  • kubectl command type name flags

Where

  • command is an operation you’d like to perform, like create
  • type is the Kubernetes resource type, like deployment
  • name is the resource’s name, like app_frontend
  • flags are any optional flags you’d like to include

For example the following command retrieves information about a Deployment named app_frontend:

  • kubectl get deployment app_frontend

Declarative Management and kubectl apply

The recommended approach to managing workloads on Kubernetes is to rely on the cluster’s declarative design as much as possible. This means that instead of running a series of commands to create, update, delete, and restart running Pods, you should define the workloads, services, and systems you’d like to run in YAML manifest files, and provide these files to Kubernetes, which will handle the rest.

In practice, this means using the kubectl apply command, which applies a particular configuration to a given resource. If the target resource doesn’t exist, then Kubernetes will create the resource. If the resource already exists, then Kubernetes will save the current revision, and update the resource according to the new configuration. This declarative approach exists in contrast to the imperative approach of running the kubectl create , kubectl edit, and the kubectl scale set of commands to manage resources. To learn more about the different ways of managing Kubernetes resources, consult Kubernetes Object Management from the Kubernetes docs.

Rolling out a Deployment

For example, to deploy the sample Nginx Deployment to your cluster, use apply and provide the path to the nginx-deployment.yaml manifest file:

  • kubectl apply -f nginx-deployment.yaml
Output
deployment.apps/nginx-deployment created

The -f flag is used to specify a filename or URL containing a valid configuration. If you’d like to apply all manifests from a directory, you can use the -k flag:

  • kubectl apply -k manifests_dir

You can track the rollout status using rollout status:

  • kubectl rollout status deployment/nginx-deployment
Output
Waiting for deployment "nginx-deployment" rollout to finish: 1 of 2 updated replicas are available... deployment "nginx-deployment" successfully rolled out

An alternative to rollout status is the kubectl get command, along with the -w (watch) flag:

  • kubectl get deployment -w
Output
NAME READY UP-TO-DATE AVAILABLE AGE nginx-deployment 0/2 2 0 3s nginx-deployment 1/2 2 1 3s nginx-deployment 2/2 2 2 3s

Using rollout pause and rollout resume, you can pause and resume the rollout of a Deployment:

  • kubectl rollout pause deployment/nginx-deployment
Output
deployment.extensions/nginx-deployment paused
  • kubectl rollout resume deployment/nginx-deployment
Output
deployment.extensions/nginx-deployment resumed

Modifying a Running Deployment

If you’d like to modify a running Deployment, you can make changes to its manifest file and then run kubectl apply again to apply the update. For example, we’ll modify the nginx-deployment.yaml file to change the number of replicas from 2 to 3:

nginx-deployment.yaml
. . . spec:   replicas: 3   selector:     matchLabels:       app: nginx . . . 

The kubectl diff command allows you to see a diff between currently running resources, and the changes proposed in the supplied configuration file:

  • kubectl diff -f nginx-deployment.yaml

Now allow Kubernetes to perform the update using apply:

  • kubectl apply -f nginx-deployment.yaml

Running another get deployment should confirm the addition of a third replica.

If you run apply again without modifying the manifest file, Kubernetes will detect that no changes were made and won’t perform any action.

Using rollout history you can see a list of the Deployment’s previous revisions:

  • kubectl rollout history deployment/nginx-deployment
Output
deployment.extensions/nginx-deployment REVISION CHANGE-CAUSE 1 <none>

With rollout undo, you can revert a Deployment to any of its previous revisions:

  • kubectl rollout undo deployment/nginx-deployment --to-revision=1

Deleting a Deployment

To delete a running Deployment, use kubectl delete:

  • kubectl delete -f nginx-deployment.yaml
Output
deployment.apps "nginx-deployment" deleted

Imperative Management

You can also use a set of imperative commands to directly manipulate and manage Kubernetes resources.

Creating a Deployment

Use create to create an object from a file, URL, or STDIN. Note that unlike apply, if an object with the same name already exists, the operation will fail. The --dry-run flag allows you to preview the result of the operation without actually performing it:

  • kubectl create -f nginx-deployment.yaml --dry-run
Output
deployment.apps/nginx-deployment created (dry-run)

We can now create the object:

  • kubectl create -f nginx-deployment.yaml
Output
deployment.apps/nginx-deployment created

Modifying a Running Deployment

Use scale to scale the number of replicas for the Deployment from 2 to 4:

  • kubectl scale --replicas=4 deployment/nginx-deployment
Output
deployment.extensions/nginx-deployment scaled

You can edit any object in-place using kubectl edit. This will open up the object’s manifest in your default editor:

  • kubectl edit deployment/nginx-deployment

You should see the following manifest file in your editor:

nginx-deployment
# Please edit the object below. Lines beginning with a '#' will be ignored, # and an empty file will abort the edit. If an error occurs while saving this file will be # reopened with the relevant failures. # apiVersion: extensions/v1beta1 kind: Deployment . . .  spec:   progressDeadlineSeconds: 600   replicas: 4   revisionHistoryLimit: 10   selector:     matchLabels: . . . 

Change the replicas value from 4 to 2, then save and close the file.

Now run a get to inspect the changes:

  • kubectl get deployment/nginx-deployment
Output
NAME READY UP-TO-DATE AVAILABLE AGE nginx-deployment 2/2 2 2 6m40s

We’ve successfully scaled the Deployment back down to 2 replicas on-the-fly. You can update most of a Kubernetes’ object’s fields in a similar manner.

Another useful command for modifying objects in-place is kubectl patch. Using patch, you can update an object’s fields on-the-fly without having to open up your editor. patch also allows for more complex updates with various merging and patching strategies. To learn more about these, consult Update API Objects in Place Using kubectl patch.

The following command will patch the nginx-deployment object to update the replicas field from 2 to 4; deploy is shorthand for the deployment object.

  • kubectl patch deploy nginx-deployment -p '{"spec": {"replicas": 4}}'
Output
deployment.extensions/nginx-deployment patched

We can now inspect the changes:

  • kubectl get deployment/nginx-deployment
Output
NAME READY UP-TO-DATE AVAILABLE AGE nginx-deployment 4/4 4 4 18m

You can also create a Deployment imperatively using the run command. run will create a Deployment using an image provided as a parameter:

  • kubectl run nginx-deployment --image=nginx --port=80 --replicas=2

The expose command lets you quickly expose a running Deployment with a Kubernetes Service, allowing connections from outside your Kubernetes cluster:

  • kubectl expose deploy nginx-deployment --type=LoadBalancer --port=80 --name=nginx-svc
Output
service/nginx-svc exposed

Here we’ve exposed the nginx-deployment Deployment as a LoadBalancer Service, opening up port 80 to external traffic and directing it to container port 80. We name the service nginx-svc. Using the LoadBalancer Service type, a cloud load balancer is automatically provisioned and configured by Kubernetes. To get the Service’s external IP address, use get:

  • kubectl get svc nginx-svc
Output
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE nginx-svc LoadBalancer 10.245.26.242 203.0.113.0 80:30153/TCP 22m

You can access the running Nginx containers by navigating to EXTERNAL-IP in your web browser.

Inspecting Workloads and Debugging

There are several commands you can use to get more information about workloads running in your cluster.

Inspecting Kubernetes Resources

kubectl get fetches a given Kubernetes resource and displays some basic information associated with it:

  • kubectl get deployment -o wide
Output
NAME READY UP-TO-DATE AVAILABLE AGE CONTAINERS IMAGES SELECTOR nginx-deployment 4/4 4 4 29m nginx nginx app=nginx

Since we did not provide a Deployment name or Namespace, kubectl fetches all Deployments in the current Namespace. The -o flag provides additional information like CONTAINERS and IMAGES.

In addition to get, you can use describe to fetch a detailed description of the resource and associated resources:

  • kubectl describe deploy nginx-deployment
Output
Name: nginx-deployment Namespace: default CreationTimestamp: Wed, 11 Sep 2019 12:53:42 -0400 Labels: run=nginx-deployment Annotations: deployment.kubernetes.io/revision: 1 Selector: run=nginx-deployment . . .

The set of information presented will vary depending on the resource type. You can also use this command without specifying a resource name, in which case information will be provided for all resources of that type in the current Namespace.

explain allows you to quickly pull configurable fields for a given resource type:

  • kubectl explain deployment.spec

By appending additional fields you can dive deeper into the field hierarchy:

  • kubectl explain deployment.spec.template.spec

Gaining Shell Access to a Container

To gain shell access into a running container, use exec. First, find the Pod that contains the running container you’d like access to:

  • kubectl get pod
Output
nginx-deployment-8859878f8-7gfw9 1/1 Running 0 109m nginx-deployment-8859878f8-z7f9q 1/1 Running 0 109m

Let’s exec into the first Pod. Since this Pod has only one container, we don’t need to use the -c flag to specify which container we’d like to exec into.

  • kubectl exec -i -t nginx-deployment-8859878f8-7gfw9 -- /bin/bash
Output
root@nginx-deployment-8859878f8-7gfw9:/#

You now have shell access to the Nginx container. The -i flag passes STDIN to the container, and -t gives you an interactive TTY. The -- double-dash acts as a separator for the kubectl command and the command you’d like to run inside the container. In this case, we are running /bin/bash.

To run commands inside the container without opening a full shell, omit the -i and -t flags, and substitute the command you’d like to run instead of /bin/bash:

  • kubectl exec nginx-deployment-8859878f8-7gfw9 ls
Output
bin boot dev etc home lib lib64 media . . .

Fetching Logs

Another useful command is logs, which prints logs for Pods and containers, including terminated containers.

To stream logs to your terminal output, you can use the -f flag:

  • kubectl logs -f nginx-deployment-8859878f8-7gfw9
Output
10.244.2.1 - - [12/Sep/2019:17:21:33 +0000] "GET / HTTP/1.1" 200 612 "-" "203.0.113.0" "-" 2019/09/16 17:21:34 [error] 6#6: *1 open() "/usr/share/nginx/html/favicon.ico" failed (2: No such file or directory), client: 10.244.2.1, server: localhost, request: "GET /favicon.ico HTTP/1.1", host: "203.0.113.0", referrer: "http://203.0.113.0" . . .

This command will keep running in your terminal until interrupted with a CTRL+C. You can omit the -f flag if you’d like to print log output and exit immediately.

You can also use the -p flag to fetch logs for a terminated container. When this option is used within a Pod that had a prior running container instance, logs will print output from the terminated container:

  • kubectl logs -p nginx-deployment-8859878f8-7gfw9

The -c flag allows you to specify the container you’d like to fetch logs from, if the Pod has multiple containers. You can use the --all-containers=true flag to fetch logs from all containers in the Pod.

Port Forwarding and Proxying

To gain network access to a Pod, you can use port-forward:

  • sudo kubectl port-forward pod/nginx-deployment-8859878f8-7gfw9 80:80
Output
Forwarding from 127.0.0.1:80 -> 80 Forwarding from [::1]:80 -> 80

In this case we use sudo because local port 80 is a protected port. For most other ports you can omit sudo and run the kubectl command as your system user.

Here we forward local port 80 (preceding the colon) to the Pod’s container port 80 (after the colon).

You can also use deploy/nginx-deployment as the resource type and name to forward to. If you do this, the local port will be forwarded to the Pod selected by the Deployment.

The proxy command can be used to access the Kubernetes API server locally:

  • kubectl proxy --port=8080
Output
Starting to serve on 127.0.0.1:8080

In another shell, use curl to explore the API:

curl http://localhost:8080/api/ 
Output
{ "kind": "APIVersions", "versions": [ "v1" ], "serverAddressByClientCIDRs": [ { "clientCIDR": "0.0.0.0/0", "serverAddress": "203.0.113.0:443" } ]

Close the proxy by hitting CTRL-C.

Conclusion

This guide covers some of the more common kubectl commands you may use when managing a Kubernetes cluster and workloads you’ve deployed to it.

You can learn more about kubectl by consulting the official Kubernetes reference documentation.

There are many more commands and variations that you may find useful as part of your work with kubectl. To learn more about all of your available options, you can run:

kubectl --help 

DigitalOcean Community Tutorials

Real Python: Getting Started With Async Features in Python

Have you heard of asynchronous programming in Python? Are you curious to know more about Python async features and how you can use them in your work? Perhaps you’ve even tried to write threaded programs and run into some issues. If you’re looking to understand how to use Python async features, then you’ve come to the right place.

In this article, you’ll learn:

  • What a synchronous program is
  • What an asynchronous program is
  • Why you might want to write an asynchronous program
  • How to use Python async features

All of the example code in this article have been tested with Python 3.7.2. You can grab a copy to follow along by clicking the link below:

Dowload Code: Click here to download the code you’ll use to learn about async features in Python in this tutorial.

Understanding Asynchronous Programming

A synchronous program is executed one step at a time. Even with conditional branching, loops and function calls, you can still think about the code in terms of taking one execution step at a time. When each step is complete, the program moves on to the next one.

Here are two examples of programs that work this way:

  • Batch processing programs are often created as synchronous programs. You get some input, process it, and create some output. Steps follow one after the other until the program reaches the desired output. The program only needs to pay attention to the steps and their order.

  • Command-line programs are small, quick processes that run in a terminal. These scripts are used to create something, transform one thing into something else, generate a report, or perhaps list out some data. This can be expressed as a series of program steps that are executed sequentially until the program is done.

An asynchronous program behaves differently. It still takes one execution step at a time. The difference is that the system may not wait for an execution step to be completed before moving on to the next one.

This means that the program will move on to future execution steps even though a previous step hasn’t yet finished and is still running elsewhere. This also means that the program knows what to do when a previous step does finish running.

Why would you want to write a program in this manner? The rest of this article will help you answer that question and give you the tools you need to elegantly solve interesting asynchronous problems.

Building a Synchronous Web Server

A web server’s basic unit of work is, more or less, the same as batch processing. The server will get some input, process it, and create the output. Written as a synchronous program, this would create a working web server.

It would also be an absolutely terrible web server.

Why? In this case, one unit of work (input, process, output) is not the only purpose. The real purpose is to handle hundreds or even thousands of units of work as quickly as possible. This can happen over long periods of time, and several work units may even arrive all at once.

Can a synchronous web server be made better? Sure, you could optimize the execution steps so that all the work coming in is handled as quickly as possible. Unfortunately, there are limitations to this approach. The result could be a web server that doesn’t respond fast enough, can’t handle enough work, or even one that times out when work gets stacked up.

Note: There are other limitations you might see if you tried to optimize the above approach. These include network speed, file IO speed, database query speed, and the speed of other connected services, to name a few. What these all have in common is that they are all IO functions. All of these items are orders of magnitude slower than the CPU’s processing speed.

In a synchronous program, if an execution step starts a database query, then the CPU is essentially idle until the database query is returned. For batch-oriented programs, this isn’t a priority most of the time. Processing the results of that IO operation is the goal. Often, this can take longer than the IO operation itself. Any optimization efforts would be focused on the processing work, not the IO.

Asynchronous programming techniques allow your programs to take advantage of relatively slow IO processes by freeing the CPU to do other work.

Thinking Differently About Programming

When you start trying to understand asynchronous programming, you might see a lot of discussion about the importance of blocking, or writing non-blocking code. (Personally, I struggled to get a good grasp of these concepts from the people I asked and the documentation I read.)

What is non-blocking code? What’s blocking code, for that matter? Would the answers to these questions help you write a better web server? If so, how could you do it? Let’s find out!

Writing asynchronous programs requires that you think differently about programming. While this new way of thinking can be hard to wrap your head around, it’s also an interesting exercise. That’s because the real world is almost entirely asynchronous, and so is how you interact with it.

Imagine this: you’re a parent trying to do several things at once. You have to balance the checkbook, do the laundry, and keep an eye on the kids. Somehow, you’re able to do all of these things at the same time without even thinking about it! Let’s break it down:

  • Balancing the checkbook is a synchronous task. One step follows another until it’s done. You’re doing all the work yourself.

  • However, you can break away from the checkbook to do laundry. You unload the dryer, move clothes from the washer to the dryer, and start another load in the washer.

  • Working with the washer and dryer is a synchronous task, but the bulk of the work happens after the washer and dryer are started. Once you’ve got them going, you can walk away and get back to the checkbook task. At this point, the washer and dryer tasks have become asynchronous. The washer and dryer will run independently until the buzzer goes off (notifying you that the task needs attention).

  • Watching your kids is another asynchronous task. Once they are set up and playing, they can do so independently for the most part. This changes when someone needs attention, like when someone gets hungry or hurt. When one of your kids yells in alarm, you react. The kids are a long-running task with high priority. Watching them supersedes any other tasks you might be doing, like the checkbook or laundry.

These examples can help to illustrate the concepts of blocking and non-blocking code. Let’s think about this in programming terms. In this example, you’re like the CPU. While you’re moving the laundry around, you (the CPU) are busy and blocked from doing other work, like balancing the checkbook. But that’s okay because the task is relatively quick.

On the other hand, starting the washer and dryer does not block you from performing other tasks. It’s an asynchronous function because you don’t have to wait for it to finish. Once it’s started, you can go back to something else. This is called a context switch: the context of what you’re doing has changed, and the machine’s buzzer will notify you sometime in the future when the laundry task is complete.

As a human, this is how you work all the time. You naturally juggle multiple things at once, often without thinking about it. As a developer, the trick is how to translate this kind of behavior into code that does the same kind of thing.

Programming Parents: Not as Easy as It Looks!

If you recognize yourself (or your parents) in the example above, then that’s great! You’ve got a leg up in understanding asynchronous programming. Again, you’re able to switch contexts between competing tasks fairly easily, picking up some tasks and resuming others. Now you’re going to try and program this behavior into virtual parents!

Thought Experiment #1: The Synchronous Parent

How would you create a parent program to do the above tasks in a completely synchronous manner? Since watching the kids is a high-priority task, perhaps your program would do just that. The parent watches over the kids while waiting for something to happen that might need their attention. However, nothing else (like the checkbook or laundry) would get done in this scenario.

Now, you can re-prioritize the tasks any way you want, but only one of them would happen at any given time. This is the result of a synchronous, step-by-step approach. Like the synchronous web server described above, this would work, but it might not be the best way to live. The parent wouldn’t be able to complete any other tasks until the kids fell asleep. All other tasks would happen afterward, well into the night. (A couple of weeks of this and many real parents might jump out the window!)

Thought Experiment #2: The Polling Parent

If you used polling, then you could change things up so that multiple tasks are completed. In this approach, the parent would periodically break away from the current task and check to see if any other tasks need attention.

Let’s make the polling interval something like fifteen minutes. Now, every fifteen minutes your parent checks to see if the washer, dryer or kids need any attention. If not, then the parent can go back to work on the checkbook. However, if any of those tasks do need attention, then the parent will take care of it before going back to the checkbook. This cycle continues on until the next timeout out of the polling loop.

This approach works as well since multiple tasks are getting attention. However, there are a couple of problems:

  1. The parent may spend a lot of time checking on things that don’t need attention: The washer and dryer haven’t yet finished, and the kids don’t need any attention unless something unexpected happens.

  2. The parent may miss completed tasks that do need attention: For instance, if the washer finished its cycle at the beginning of the polling interval, then it wouldn’t get any attention for up to fifteen minutes! What’s more, watching the kids is supposedly the highest priority task. They couldn’t tolerate fifteen minutes with no attention when something might be going drastically wrong.

You could address these issues by shortening the polling interval, but now your parent (the CPU) would be spending more time context switching between tasks. This is when you start to hit a point of diminishing returns. (Once again, a couple of weeks living like this and, well… See the previous comment about windows and jumping.)

Thought Experiment #3: The Threading Parent

“If I could only clone myself…” If you’re a parent, then you’ve probably had similar thoughts! Since you’re programming virtual parents, you can essentially do this by using threading. This is a mechanism that allows multiple sections of one program to run at the same time. Each section of code that runs independently is known as a thread, and all threads share the same memory space.

If you think of each task as a part of one program, then you can separate them and run them as threads. In other words, you can “clone” the parent, creating one instance for each task: watching the kids, monitoring the washer, monitoring the dryer, and balancing the checkbook. All of these “clones” are running independently.

This sounds like a pretty nice solution, but there are some issues here as well. One is that you’ll have to explicitly tell each parent instance what to do in your program. This can lead to some problems since all instances share everything in the program space.

For example, say that Parent A is monitoring the dryer. Parent A sees that the clothes are dry, so they take control of the dryer and begin unloading the clothes. At the same time, Parent B sees that the washer is done, so they take control of the washer and begin removing clothes. However, Parent B also needs to take control of the dryer so they can put the wet clothes inside. This can’t happen, because Parent A currently has control of the dryer.

After a short while, Parent A has finished unloading clothes. Now they want to take control of the washer and start moving clothes into the empty dryer. This can’t happen, either, because Parent B currently has control of the washer!

These two parents are now deadlocked. Both have control of their own resource and want control of the other resource. They’ll wait forever for the other parent instance to release control. As the programmer, you’d have to write code to work this situation out.

Note: Threaded programs allow you to create multiple, parallel paths of execution that all share the same memory space. This is both an advantage and a disadvantage. Any memory shared between threads is subject to one or more threads trying to use the same shared memory at the same time. This can lead to data corruption, data read in an invalid state, and data that’s just messy in general.

In threaded programming, the context switch happens under system control, not the programmer. The system controls when to switch contexts and when to give threads access to shared data, thereby changing the context of how the memory is being used. All of these kinds of problems are manageable in threaded code, but it’s difficult to get right, and hard to debug when it’s wrong.

Here’s another issue that might arise from threading. Suppose that a child gets hurt and needs to be taken to urgent care. Parent C has been assigned the task of watching over the kids, so they take the child right away. At the urgent care, Parent C needs to write a fairly large check to cover the cost of seeing the doctor.

Meanwhile, Parent D is at home working on the checkbook. They’re unaware of this large check being written, so they’re very surprised when the family checking account is suddenly overdrawn!

Remember, these two parent instances are working within the same program. The family checking account is a shared resource, so you’d have to work out a way for the child-watching parent to inform the checkbook-balancing parent. Otherwise, you’d need to provide some kind of locking mechanism so that the checkbook resource can only be used by one parent at a time, with updates.

Using Python Async Features in Practice

Now you’re going to take some of the approaches outlined in the thought experiments above and turn them into functioning Python programs.

All of the examples in this article have been tested with Python 3.7.2. The requirements.txt file indicates which modules you’ll need to install to run all the examples. If you haven’t yet downloaded the file, you can do so now:

Dowload Code: Click here to download the code you’ll use to learn about async features in Python in this tutorial.

You also might want to set up a Python virtual environment to run the code so you don’t interfere with your system Python.

Synchronous Programming

This first example shows a somewhat contrived way of having a task retrieve work from a queue and process that work. A queue in Python is a nice FIFO (first in first out) data structure. It provides methods to put things in a queue and take them out again in the order they were inserted.

In this case, the work is to get a number from the queue and have a loop count up to that number. It prints to the console when the loop begins, and again to output the total. This program demonstrates one way for multiple synchronous tasks to process the work in a queue.

The program named example_1.py in the repository is listed in full below:

 1 import queue  2   3 def task(name, work_queue):  4     if work_queue.empty():  5         print(f"Task {name} nothing to do")  6     else:  7         while not work_queue.empty():  8             count = work_queue.get()  9             total = 0 10             print(f"Task {name} running") 11             for x in range(count): 12                 total += 1 13             print(f"Task {name} total: {total}") 14  15 def main(): 16     """ 17     This is the main entry point for the program. 18     """ 19     # Create the queue of 'work' 20     work_queue = queue.Queue() 21  22     # Put some 'work' in the queue 23     for work in [15, 10, 5, 2]: 24         work_queue.put(work) 25  26     # Create some synchronous tasks 27     tasks = [ 28         (task, "One", work_queue), 29         (task, "Two", work_queue) 30     ] 31  32     # Run the tasks 33     for t, n, q in tasks: 34         t(n, q) 35  36 if __name__ == "__main__": 37     main() 

Let’s take a look at what each line does:

  • Line 1 imports the queue module. This is where the program stores work to be done by the tasks.
  • Lines 3 to 13 define task(). This function pulls work out of work_queue and processes the work until there isn’t any more to do.
  • Line 15 defines main() to run the program tasks.
  • Line 20 creates the work_queue. All tasks use this shared resource to retrieve work.
  • Lines 23 to 24 put work in work_queue. In this case, it’s just a random count of values for the tasks to process.
  • Lines 27 to 29 create a list of task tuples, with the parameter values those tasks will be passed.
  • Lines 33 to 34 iterate over the list of task tuples, calling each one and passing the previously defined parameter values.
  • Line 36 calls main() to run the program.

The task in this program is just a function accepting a string and a queue as parameters. When executed, it looks for anything in the queue to process. If there is work to do, then it pulls values off the queue, starts a for loop to count up to that value, and outputs the total at the end. It continues getting work off the queue until there is nothing left and it exits.

When this program is run, it produces the output you see below:

Task One running Task One total: 15 Task One running Task One total: 10 Task One running Task One total: 5 Task One running Task One total: 2 Task Two nothing to do 

This shows that Task One does all the work. The while loop that Task One hits within task() consumes all the work on the queue and processes it. When that loop exits, Task Two gets a chance to run. However, it finds that the queue is empty, so Task Two prints a statement that says it has nothing to do and then exits. There’s nothing in the code to allow both Task One and Task Two to switch contexts and work together.

Simple Cooperative Concurrency

The next version of the program allows the two tasks to work together. Adding a yield statement means the loop will yield control at the specified point while still maintaining its context. This way, the yielding task can be restarted later.

The yield statement turns task() into a generator. A generator function is called just like any other function in Python, but when the yield statement is executed, control is returned to the caller of the function. This is essentially a context switch, as control moves from the generator function to the caller.

The interesting part is that control can be given back to the generator function by calling next() on the generator. This is a context switch back to the generator function, which picks up execution with all function variables that were defined before the yield still intact.

The while loop in main() takes advantage of this when it calls next(t). This statement restarts the task at the point where it previously yielded. All of this means that you’re in control when the context switch happens: when the yield statement is executed in task().

This is a form of cooperative multitasking. The program is yielding control of its current context so that something else can run. In this case, it allows the while loop in main() to run two instances of task() as a generator function. Each instance consumes work from the same queue. This is sort of clever, but it’s also a lot of work to get the same results as the first program. The program example_2.py demonstrates this simple concurrency and is listed below:

 1 import queue  2   3 def task(name, queue):  4     while not queue.empty():  5         count = queue.get()  6         total = 0  7         print(f"Task {name} running")  8         for x in range(count):  9             total += 1 10             yield 11         print(f"Task {name} total: {total}") 12  13 def main(): 14     """ 15     This is the main entry point for the program. 16     """ 17     # Create the queue of 'work' 18     work_queue = queue.Queue() 19  20     # Put some 'work' in the queue 21     for work in [15, 10, 5, 2]: 22         work_queue.put(work) 23  24     # Create some tasks 25     tasks = [ 26         task("One", work_queue), 27         task("Two", work_queue) 28     ] 29  30     # Run the tasks 31     done = False 32     while not done: 33         for t in tasks: 34             try: 35                 next(t) 36             except StopIteration: 37                 tasks.remove(t) 38             if len(tasks) == 0: 39                 done = True 40  41 if __name__ == "__main__": 42     main() 

Here’s what’s happening in the code above:

  • Lines 3 to 11 define task() as before, but the addition of yield on Line 10 turns the function into a generator. This where the context switch is made and control is handed back to the while loop in main().
  • Lines 25 to 28 create the task list, but in a slightly different manner than you saw in the previous example code. In this case, each task is called with its parameters as its entered in the tasks list variable. This is necessary to get the task() generator function running the first time.
  • Lines 34 to 39 are the modifications to the while loop in main() that allow task() to run cooperatively. This is where control returns to each instance of task() when it yields, allowing the loop to continue and run another task.
  • Line 35 gives control back to task(), and continues its execution after the point where yield was called.
  • Line 39 sets the done variable. The while loop ends when all tasks have been completed and removed from tasks.

This is the output produced when you run this program:

Task One running Task Two running Task Two total: 10 Task Two running Task One total: 15 Task One running Task Two total: 5 Task One total: 2 

You can see that both Task One and Task Two are running and consuming work from the queue. This is what’s intended, as both tasks are processing work, and each is responsible for two items in the queue. This is interesting, but again, it takes quite a bit of work to achieve these results.

The trick here is using the yield statement, which turns task() into a generator and performs a context switch. The program uses this context switch to give control to the while loop in main(), allowing two instances of a task to run cooperatively.

Notice how Task Two outputs its total first. This might lead you to think that the tasks are running asynchronously. However, this is still a synchronous program. It’s structured so the two tasks can trade contexts back and forth. The reason why Task Two outputs its total first is that it’s only counting to 10, while Task One is counting to 15. Task Two simply arrives at its total first, so it gets to print its output to the console before Task One.

Cooperative Concurrency With Blocking Calls

The next version of the program is the same as the last, except for the addition of time.sleep(delay) in the body of your task loop. This adds a delay based on the value retrieved from the work queue to every iteration of the task loop. The delay is added to simulate the effect of a blocking call occurring in your task.

A blocking call is code that stops the CPU from doing anything else for some time. In the thought experiments above, if a parent wasn’t able to break away from balancing the checkbook until it was complete, then that would be a blocking call.

time.sleep(delay) does the same thing in this example, because the CPU can’t do anything else but wait for the delay to expire.

elapsed_time provides a way to get the elapsed time from when an instance of the class is created until it’s called as a function. The program example_3.py is listed below:

 1 import time  2 import queue  3 from lib.elapsed_time import ET  4   5 def task(name, queue):  6     while not queue.empty():  7         delay = queue.get()  8         et = ET()  9         print(f"Task {name} running") 10         time.sleep(delay) 11         print(f"Task {name} total elapsed time: {et():.1f}") 12         yield 13  14 def main(): 15     """ 16     This is the main entry point for the program. 17     """ 18     # Create the queue of 'work' 19     work_queue = queue.Queue() 20  21     # Put some 'work' in the queue 22     for work in [15, 10, 5, 2]: 23         work_queue.put(work) 24  25     tasks = [ 26         task("One", work_queue), 27         task("Two", work_queue) 28     ] 29  30     # Run the tasks 31     et = ET() 32     done = False 33     while not done: 34         for t in tasks: 35             try: 36                 next(t) 37             except StopIteration: 38                 tasks.remove(t) 39             if len(tasks) == 0: 40                 done = True 41  42     print(f"\nTotal elapsed time: {et():.1f}") 43  44 if __name__ == "__main__": 45     main() 

Here’s what’s different in the code above:

  • Line 1 imports the time module to give the program access to time.sleep().
  • Line 11 changes task() to include a time.sleep(delay) to mimic an IO delay. This replaces the for loop that did the counting in example_1.py.

When you run this program, you’ll see the following output:

Task One running Task One total elapsed time: 15.0 Task Two running Task Two total elapsed time: 10.0 Task One running Task One total elapsed time: 5.0 Task Two running Task Two total elapsed time: 2.0  Total elapsed time: 32.01021909713745 

As before, both Task One and Task Two are running, consuming work from the queue and processing it. However, even with the addition of the delay, you can see that cooperative concurrency hasn’t gotten you anything. The delay stops the processing of the entire program, and the CPU just waits for the IO delay to be over.

This is exactly what’s meant by blocking code in Python async documentation. You’ll notice that the time it takes to run the entire program is just the cumulative time of all the delays. Running tasks this way is not a win.

Cooperative Concurrency With Non-Blocking Calls

The next version of the program has been modified quite a bit. It makes use of Python async features using asyncio/await provided in Python 3.

The time and queue modules have been replaced with the asyncio package. This gives your program access to asynchronous friendly (non-blocking) sleep and queue functionality. The change to task() defines it as asynchronous with the addition of the async prefix on line 4. This indicates to Python that the function will be asynchronous.

The other big change is removing the time.sleep(delay) and yield statements, and replacing them with await asyncio.sleep(delay). This creates a non-blocking delay that will perform a context switch back to the caller main().

The while loop inside main() no longer exists. Instead of task_array, there’s a call to await asyncio.gather(...). This tells asyncio two things:

  1. Create two tasks based on task() and start running them.
  2. Wait for both of these to be completed before moving forward.

The last line of the program asyncio.run(main()) runs main(). This creates what’s known as an event loop). It’s this loop that will run main(), which in turn will run the two instances of task().

The event loop is at the heart of the Python async system. It runs all the code, including main(). When task code is executing, the CPU is busy doing work. When the await keyword is reached, a context switch occurs, and control passes back to the event loop. The event loop looks at all the tasks waiting for an event (in this case, an asyncio.sleep(delay) timeout) and passes control to a task with an event that’s ready.

await asyncio.sleep(delay) is non-blocking in regards to the CPU. Instead of waiting for the delay to timeout, the CPU registers a sleep event on the event loop task queue and performs a context switch by passing control to the event loop. The event loop continuously looks for completed events and passes control back to the task waiting for that event. In this way, the CPU can stay busy if work is available, while the event loop monitors the events that will happen in the future.

Note: An asynchronous program runs in a single thread of execution. The context switch from one section of code to another that would affect data is completely in your control. This means you can atomize and complete all shared memory data access before making a context switch. This simplifies the shared memory problem inherent in threaded code.

The example_4.py code is listed below:

 1 import asyncio  2 from lib.elapsed_time import ET  3   4 async def task(name, work_queue):  5     while not work_queue.empty():  6         delay = await work_queue.get()  7         et = ET()  8         print(f"Task {name} running")  9         await asyncio.sleep(delay) 10         print(f"Task {name} total elapsed time: {et():.1f}") 11  12 async def main(): 13     """ 14     This is the main entry point for the program. 15     """ 16     # Create the queue of 'work' 17     work_queue = asyncio.Queue() 18  19     # Put some 'work' in the queue 20     for work in [15, 10, 5, 2]: 21         await work_queue.put(work) 22  23     # Run the tasks 24     et = ET() 25     await asyncio.gather( 26         asyncio.create_task(task("One", work_queue)), 27         asyncio.create_task(task("Two", work_queue)), 28     ) 29     print(f"\nTotal elapsed time: {et():.1f}") 30  31 if __name__ == "__main__": 32     asyncio.run(main()) 

Here’s what’s different between this program and example_3.py:

  • Line 1 imports asyncio to gain access to Python async functionality. This replaces the time import.
  • Line 4 shows the addition of the async keyword in front of the task() definition. This informs the program that task can run asynchronously.
  • Line 9 replaces time.sleep(delay) with the non-blocking asyncio.sleep(delay), which also yields control (or switches contexts) back to the main event loop.
  • Line 17 creates the non-blocking asynchronous work_queue.
  • Lines 20 to 21 put work into work_queue in an asynchronous manner using the await keyword.
  • Lines 25 to 28 create the two tasks and gather them together, so the program will wait for both tasks to complete.
  • Line 32 starts the program running asynchronously. It also starts the internal event loop.

When you look at the output of this program, notice how both Task One and Task Two start at the same time, then wait at the mock IO call:

Task One running Task Two running Task Two total elapsed time: 10.0 Task Two running Task One total elapsed time: 15.0 Task One running Task Two total elapsed time: 5.0 Task One total elapsed time: 2.0  Total elapsed time: 17.0 

This indicates that await asyncio.sleep(delay) is non-blocking, and that other work is being done.

At the end of the program, you’ll notice the total elapsed time is essentially half the time it took for example_3.py to run. That’s the advantage of a program that uses Python async features! Each task was able to run await asyncio.sleep(delay) at the same time. The total execution time of the program is now less than the sum of its parts. You’ve broken away from the synchronous model!

Synchronous (Blocking) HTTP Calls

The next version of the program is kind of a step forward as well as a step back. The program is doing some actual work with real IO by making HTTP requests to a list of URLs and getting the page contents. However, it’s doing so in a blocking (synchronous) manner.

The program has been modified to import the wonderful requests module to make the actual HTTP requests. Also, the queue now contains a list of URLs, rather than numbers. In addition, task() no longer increments a counter. Instead, requests gets the contents of a URL retrieved from the queue, and prints how long it took to do so.

The example_5.py code is listed below:

 1 import queue  2 import requests  3 from lib.elapsed_time import ET  4   5 def task(name, work_queue):  6     with requests.Session() as session:  7         while not work_queue.empty():  8             url = work_queue.get()  9             print(f"Task {name} getting URL: https://realpython.com/python-async-features/") 10             et = ET() 11             session.get(url) 12             print(f"Task {name} total elapsed time: {et():.1f}") 13             yield 14  15 def main(): 16     """ 17     This is the main entry point for the program. 18     """ 19     # Create the queue of 'work' 20     work_queue = queue.Queue() 21  22     # Put some 'work' in the queue 23     for url in [ 24         "http://google.com", 25         "http://yahoo.com", 26         "http://linkedin.com", 27         "http://apple.com", 28         "http://microsoft.com", 29         "http://facebook.com", 30         "http://twitter.com" 31     ]: 32         work_queue.put(url) 33  34     tasks = [ 35         task("One", work_queue), 36         task("Two", work_queue) 37     ] 38  39     # Run the tasks 40     et = ET() 41     done = False 42     while not done: 43         for t in tasks: 44             try: 45                 next(t) 46             except StopIteration: 47                 tasks.remove(t) 48             if len(tasks) == 0: 49                 done = True 50  51     print(f"\nTotal elapsed time: {et():.1f}") 52  53 if __name__ == "__main__": 54     main() 

Here’s what’s happening in this program:

  • Line 2 imports requests, which provides a convenient way to make HTTP calls.
  • Line 11 introduces a delay, similar to example_3.py. However, this time it calls session.get(url), which returns the contents of the URL retrieved from work_queue.
  • Lines 23 to 32 put the list of URLs into work_queue.

When you run this program, you’ll see the following output:

Task One getting URL: http://google.com Task One total elapsed time: 0.3 Task Two getting URL: http://yahoo.com Task Two total elapsed time: 0.8 Task One getting URL: http://linkedin.com Task One total elapsed time: 0.4 Task Two getting URL: http://apple.com Task Two total elapsed time: 0.3 Task One getting URL: http://microsoft.com Task One total elapsed time: 0.5 Task Two getting URL: http://facebook.com Task Two total elapsed time: 0.5 Task One getting URL: http://twitter.com Task One total elapsed time: 0.4  Total elapsed time: 3.2 

Just like in earlier versions of the program, yield turns task() into a generator. It also performs a context switch that lets the other task instance run.

Each task gets a URL from the work queue, retrieves the contents of the page, and reports how long it took to get that content.

As before, yield allows both your tasks to run cooperatively. However, since this program is running synchronously, each session.get() call blocks the CPU until the page is retrieved. Note the total time it took to run the entire program at the end. This will be meaningful for the next example.

Asynchronous (Non-Blocking) HTTP Calls

This version of the program modifies the previous one to use Python async features. It also imports the aiohttp module, which is a library to make HTTP requests in an asynchronous fashion using asyncio.

The tasks here have been modified to remove the yield call since the code to make the HTTP GET call is no longer blocking. It also performs a context switch back to the event loop.

The example_6.py program is listed below:

 1 import asyncio  2 import aiohttp  3 from lib.elapsed_time import ET  4   5 async def task(name, work_queue):  6     async with aiohttp.ClientSession() as session:  7         while not work_queue.empty():  8             url = await work_queue.get()  9             print(f"Task {name} getting URL: https://realpython.com/python-async-features/") 10             et = ET() 11             async with session.get(url) as response: 12                 await response.text() 13             print(f"Task {name} total elapsed time: {et():.1f}") 14  15 async def main(): 16     """ 17     This is the main entry point for the program. 18     """ 19     # Create the queue of 'work' 20     work_queue = asyncio.Queue() 21  22     # Put some 'work' in the queue 23     for url in [ 24         "http://google.com", 25         "http://yahoo.com", 26         "http://linkedin.com", 27         "http://apple.com", 28         "http://microsoft.com", 29         "http://facebook.com", 30         "http://twitter.com", 31     ]: 32         await work_queue.put(url) 33  34     # Run the tasks 35     et = ET() 36     await asyncio.gather( 37         asyncio.create_task(task("One", work_queue)), 38         asyncio.create_task(task("Two", work_queue)), 39     ) 40     print(f"\nTotal elapsed time: {et():.1f}") 41  42 if __name__ == "__main__": 43     asyncio.run(main()) 

Here’s what’s happening in this program:

  • Line 2 imports the aiohttp library, which provides an asynchronous way to make HTTP calls.
  • Line 5 marks task() as an asynchronous function.
  • Line 6 creates an aiohttp session context manager.
  • Line 11 creates an aiohttp response context manager. It also makes an HTTP GET call to the URL taken from work_queue.
  • Line 12 uses the response to get the text retrieved from the URL asynchronously.

When you run this program, you’ll see the following output:

Task One getting URL: http://google.com Task Two getting URL: http://yahoo.com Task One total elapsed time: 0.3 Task One getting URL: http://linkedin.com Task One total elapsed time: 0.3 Task One getting URL: http://apple.com Task One total elapsed time: 0.3 Task One getting URL: http://microsoft.com Task Two total elapsed time: 0.9 Task Two getting URL: http://facebook.com Task Two total elapsed time: 0.4 Task Two getting URL: http://twitter.com Task One total elapsed time: 0.5 Task Two total elapsed time: 0.3  Total elapsed time: 1.7 

Take a look at the total elapsed time, as well as the individual times to get the contents of each URL. You’ll see that the duration is about half the cumulative time of all the HTTP GET calls. This is because the HTTP GET calls are running asynchronously. In other words, you’re effectively taking better advantage of the CPU by allowing it to make multiple requests at once.

Because the CPU is so fast, this example could likely create as many tasks as there are URLs. In this case, the program’s run time would be that of the single slowest URL retrieval.

Conclusion

This article has given you the tools you need to start making asynchronous programming techniques a part of your repertoire. Using Python async features gives you programmatic control of when context switches take place. This means that many of the tougher issues you might see in threaded programming are easier to deal with.

Asynchronous programming is a powerful tool, but it isn’t useful for every kind of program. If you’re writing a program that calculates pi to the millionth decimal place, for instance, then asynchronous code won’t help you. That kind of program is CPU bound, without much IO. However, if you’re trying to implement a server or a program that performs IO (like file or network access), then using Python async features could make a huge difference.

To sum it up, you’ve learned:

  • What synchronous programs are
  • How asynchronous programs are different, but also powerful and manageable
  • Why you might want to write asynchronous programs
  • How to use the built-in async features in Python

You can get the code for all of the example programs used in this tutorial:

Dowload Code: Click here to download the code you’ll use to learn about async features in Python in this tutorial.

Now that you’re equipped with these powerful skills, you can take your programs to the next level!


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

Planet Python

Getting Started with kubectl: A kubectl Cheat Sheet

Introduction

Kubectl is a command-line tool designed to manage Kubernetes objects and clusters. It provides a command-line interface for performing common operations like creating and scaling Deployments, switching contexts, and accessing a shell in a running container.

How to Use This Guide:

  • This guide is in cheat sheet format with self-contained command-line snippets.
  • It is not an exhaustive list of kubectl commands, but contains many common operations and use cases. For a more thorough reference, consult the Kubectl Reference Docs
  • Jump to any section that is relevant to the task you are trying to complete.

Prerequisites

Sample Deployment

To demonstrate some of the operations and commands in this cheat sheet, we’ll use a sample Deployment that runs 2 replicas of Nginx:

nginx-deployment.yaml
apiVersion: apps/v1 kind: Deployment metadata:   name: nginx-deployment spec:   replicas: 2   selector:     matchLabels:       app: nginx   template:     metadata:       labels:         app: nginx     spec:       containers:       - name: nginx         image: nginx         ports:         - containerPort: 80 

Copy and paste this manifest into a file called nginx-deployment.yaml.

Installing kubectl

Note: These commands have only been tested on an Ubuntu 18.04 machine. To learn how to install kubectl on other operating systems, consult Install and Set Up kubectl from the Kubernetes docs.

First, update your local package index and install required dependencies:

  • sudo apt-get update && sudo apt-get install -y apt-transport-https

Then add the Google Cloud GPG key to APT and make the kubectl package available to your system:

  • curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
  • echo "deb https://apt.kubernetes.io/ kubernetes-xenial main" | sudo tee -a /etc/apt/sources.list.d/kubernetes.list
  • sudo apt-get update

Finally, install kubectl:

  • sudo apt-get install -y kubectl

Test that the installation succeeded using version:

  • kubectl version

Setting Up Shell Autocompletion

Note: These commands have only been tested on an Ubuntu 18.04 machine. To learn how to set up autocompletion on other operating systems, consult Install and Set Up kubectl from the Kubernetes docs.

kubectl includes a shell autocompletion script that you can make available to your system’s existing shell autocompletion software.

Installing kubectl Autocompletion

First, check if you have bash-completion installed:

  • type _init_completion

You should see some script output.

Next, source the kubectl autocompletion script in your ~/.bashrc file:

  • echo 'source <(kubectl completion bash)' >>~/.bashrc
  • . ~/.bashrc

Alternatively, you can add the completion script to the /etc/bash_completion.d directory:

  • kubectl completion bash >/etc/bash_completion.d/kubectl

Usage

To use the autocompletion feature, press the TAB key to display available kubectl commands:

  • kubectl TAB TAB
Output
annotate apply autoscale completion cordon delete drain explain kustomize options port-forward rollout set uncordon api-resources attach certificate config cp describe . . .

You can also display available commands after partially typing a command:

  • kubectl d TAB
Output
delete describe diff drain

Connecting, Configuring and Using Contexts

Connecting

To test that kubectl can authenticate with and access your Kubernetes cluster, use cluster-info:

  • kubectl cluster-info

If kubectl can successfully authenticate with your cluster, you should see the following output:

Output
Kubernetes master is running at https://kubernetes_master_endpoint CoreDNS is running at https://coredns_endpoint To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.

kubectl is configured using kubeconfig configuration files. By default, kubectl will look for a file called config in the $ HOME/.kube directory. To change this, you can set the $ KUBECONFIG environment variable to a custom kubeconfig file, or pass in the custom file at execution time using the --kubeconfig flag:

  • kubectl cluster-info --kubeconfig=path_to_your_kubeconfig_file

Note: If you’re using a managed Kubernetes cluster, your cloud provider should have made its kubeconfig file available to you.

If you don’t want to use the --kubeconfig flag with every command, and there is no existing ~/.kube/config file, create a directory called ~/.kube in your home directory if it doesn’t already exist, and copy in the kubeconfig file, renaming it to config:

  • mkdir ~/.kube
  • cp your_kubeconfig_file ~/.kube/config

Now, run cluster-info once again to test your connection.

Modifying your kubectl Configuration

You can also modify your config using the kubectl config set of commands.

To view your kubectl configuration, use the view subcommand:

  • kubectl config view
Output
apiVersion: v1 clusters: - cluster: certificate-authority-data: DATA+OMITTED . . .

Modifying Clusters

To fetch a list of clusters defined in your kubeconfig, use get-clusters:

  • kubectl config get-clusters
Output
NAME do-nyc1-sammy

To add a cluster to your config, use the set-cluster subcommand:

  • kubectl config set-cluster new_cluster --server=server_address --certificate-authority=path_to_certificate_authority

To delete a cluster from your config, use delete-cluster:

Note: This only deletes the cluster from your config and does not delete the actual Kubernetes cluster.

  • kubectl config delete-cluster

Modifying Users

You can perform similar operations for users using set-credentials:

  • kubectl config set-credentials username --client-certificate=/path/to/cert/file --client-key=/path/to/key/file

To delete a user from your config, you can run unset:

  • kubectl config unset users.username

Contexts

A context in Kubernetes is an object that contains a set of access parameters for your cluster. It consists of a cluster, namespace, and user triple. Contexts allow you to quickly switch between different sets of cluster configuration.

To see your current context, you can use current-context:

  • kubectl config current-context
Output
do-nyc1-sammy

To see a list of all configured contexts, run get-contexts:

  • kubectl config get-contexts
Output
CURRENT NAME CLUSTER AUTHINFO NAMESPACE * do-nyc1-sammy do-nyc1-sammy do-nyc1-sammy-admin

To set a context, use set-context:

  • kubectl config set-context context_name --cluster=cluster_name --user=user_name --namespace=namespace

You can switch between contexts with use-context:

  • kubectl config use-context context_name
Output
Switched to context "do-nyc1-sammy"

And you can delete a context with delete-context:

  • kubectl config delete-context context_name

Using Namespaces

A Namespace in Kubernetes is an abstraction that allows you to subdivide your cluster into multiple virtual clusters. By using Namespaces you can divide cluster resources among multiple teams and scope objects appropriately. For example, you can have a prod Namespace for production workloads, and a dev Namespace for development and test workloads.

To fetch and print a list of all the Namespaces in your cluster, use get namespace:

  • kubectl get namespace
Output
NAME STATUS AGE default Active 2d21h kube-node-lease Active 2d21h kube-public Active 2d21h kube-system Active 2d21h

To set a Namespace for your current context, use set-context --current:

  • kubectl config set-context --current --namespace=namespace_name

To create a Namespace, use create namespace:

  • kubectl create namespace namespace_name
Output
namespace/sammy created

Similarly, to delete a Namespace, use delete namespace:

Warning: Deleting a Namespace will delete everything in the Namespace, including running Deployments, Pods, and other workloads. Only run this command if you’re sure you’d like to kill whatever’s running in the Namespace or if you’re deleting an empty Namespace.

  • kubectl delete namespace namespace_name

To fetch all Pods in a given Namespace or to perform other operations on resources in a given Namespace, make sure to include the --namespace flag:

  • kubectl get pods --namespace=namespace_name

Managing Kubernetes Resources

General Syntax

The general syntax for most kubectl management commands is:

  • kubectl command type name flags

Where

  • command is an operation you’d like to perform, like create
  • type is the Kubernetes resource type, like deployment
  • name is the resource’s name, like app_frontend
  • flags are any optional flags you’d like to include

For example the following command retrieves information about a Deployment named app_frontend:

  • kubectl get deployment app_frontend

Declarative Management and kubectl apply

The recommended approach to managing workloads on Kubernetes is to rely on the cluster’s declarative design as much as possible. This means that instead of running a series of commands to create, update, delete, and restart running Pods, you should define the workloads, services, and systems you’d like to run in YAML manifest files, and provide these files to Kubernetes, which will handle the rest.

In practice, this means using the kubectl apply command, which applies a particular configuration to a given resource. If the target resource doesn’t exist, then Kubernetes will create the resource. If the resource already exists, then Kubernetes will save the current revision, and update the resource according to the new configuration. This declarative approach exists in contrast to the imperative approach of running the kubectl create , kubectl edit, and the kubectl scale set of commands to manage resources. To learn more about the different ways of managing Kubernetes resources, consult Kubernetes Object Management from the Kubernetes docs.

Rolling out a Deployment

For example, to deploy the sample Nginx Deployment to your cluster, use apply and provide the path to the nginx-deployment.yaml manifest file:

  • kubectl apply -f nginx-deployment.yaml
Output
deployment.apps/nginx-deployment created

The -f flag is used to specify a filename or URL containing a valid configuration. If you’d like to apply all manifests from a directory, you can use the -k flag:

  • kubectl apply -k manifests_dir

You can track the rollout status using rollout status:

  • kubectl rollout status deployment/nginx-deployment
Output
Waiting for deployment "nginx-deployment" rollout to finish: 1 of 2 updated replicas are available... deployment "nginx-deployment" successfully rolled out

An alternative to rollout status is the kubectl get command, along with the -w (watch) flag:

  • kubectl get deployment -w
Output
NAME READY UP-TO-DATE AVAILABLE AGE nginx-deployment 0/2 2 0 3s nginx-deployment 1/2 2 1 3s nginx-deployment 2/2 2 2 3s

Using rollout pause and rollout resume, you can pause and resume the rollout of a Deployment:

  • kubectl rollout pause deployment/nginx-deployment
Output
deployment.extensions/nginx-deployment paused
  • kubectl rollout resume deployment/nginx-deployment
Output
deployment.extensions/nginx-deployment resumed

Modifying a Running Deployment

If you’d like to modify a running Deployment, you can make changes to its manifest file and then run kubectl apply again to apply the update. For example, we’ll modify the nginx-deployment.yaml file to change the number of replicas from 2 to 3:

nginx-deployment.yaml
. . . spec:   replicas: 3   selector:     matchLabels:       app: nginx . . . 

The kubectl diff command allows you to see a diff between currently running resources, and the changes proposed in the supplied configuration file:

  • kubectl diff -f nginx-deployment.yaml

Now allow Kubernetes to perform the update using apply:

  • kubectl apply -f nginx-deployment.yaml

Running another get deployment should confirm the addition of a third replica.

If you run apply again without modifying the manifest file, Kubernetes will detect that no changes were made and won’t perform any action.

Using rollout history you can see a list of the Deployment’s previous revisions:

  • kubectl rollout history deployment/nginx-deployment
Output
deployment.extensions/nginx-deployment REVISION CHANGE-CAUSE 1 <none>

With rollout undo, you can revert a Deployment to any of its previous revisions:

  • kubectl rollout undo deployment/nginx-deployment --to-revision=1

Deleting a Deployment

To delete a running Deployment, use kubectl delete:

  • kubectl delete -f nginx-deployment.yaml
Output
deployment.apps "nginx-deployment" deleted

Imperative Management

You can also use a set of imperative commands to directly manipulate and manage Kubernetes resources.

Creating a Deployment

Use create to create an object from a file, URL, or STDIN. Note that unlike apply, if an object with the same name already exists, the operation will fail. The --dry-run flag allows you to preview the result of the operation without actually performing it:

  • kubectl create -f nginx-deployment.yaml --dry-run
Output
deployment.apps/nginx-deployment created (dry-run)

We can now create the object:

  • kubectl create -f nginx-deployment.yaml
Output
deployment.apps/nginx-deployment created

Modifying a Running Deployment

Use scale to scale the number of replicas for the Deployment from 2 to 4:

  • kubectl scale --replicas=4 deployment/nginx-deployment
Output
deployment.extensions/nginx-deployment scaled

You can edit any object in-place using kubectl edit. This will open up the object’s manifest in your default editor:

  • kubectl edit deployment/nginx-deployment

You should see the following manifest file in your editor:

nginx-deployment
# Please edit the object below. Lines beginning with a '#' will be ignored, # and an empty file will abort the edit. If an error occurs while saving this file will be # reopened with the relevant failures. # apiVersion: extensions/v1beta1 kind: Deployment . . .  spec:   progressDeadlineSeconds: 600   replicas: 4   revisionHistoryLimit: 10   selector:     matchLabels: . . . 

Change the replicas value from 4 to 2, then save and close the file.

Now run a get to inspect the changes:

  • kubectl get deployment/nginx-deployment
Output
NAME READY UP-TO-DATE AVAILABLE AGE nginx-deployment 2/2 2 2 6m40s

We’ve successfully scaled the Deployment back down to 2 replicas on-the-fly. You can update most of a Kubernetes’ object’s fields in a similar manner.

Another useful command for modifying objects in-place is kubectl patch. Using patch, you can update an object’s fields on-the-fly without having to open up your editor. patch also allows for more complex updates with various merging and patching strategies. To learn more about these, consult Update API Objects in Place Using kubectl patch.

The following command will patch the nginx-deployment object to update the replicas field from 2 to 4; deploy is shorthand for the deployment object.

  • kubectl patch deploy nginx-deployment -p '{"spec": {"replicas": 4}}'
Output
deployment.extensions/nginx-deployment patched

We can now inspect the changes:

  • kubectl get deployment/nginx-deployment
Output
NAME READY UP-TO-DATE AVAILABLE AGE nginx-deployment 4/4 4 4 18m

You can also create a Deployment imperatively using the run command. run will create a Deployment using an image provided as a parameter:

  • kubectl run nginx-deployment --image=nginx --port=80 --replicas=2

The expose command lets you quickly expose a running Deployment with a Kubernetes Service, allowing connections from outside your Kubernetes cluster:

  • kubectl expose deploy nginx-deployment --type=LoadBalancer --port=80 --name=nginx-svc
Output
service/nginx-svc exposed

Here we’ve exposed the nginx-deployment Deployment as a LoadBalancer Service, opening up port 80 to external traffic and directing it to container port 80. We name the service nginx-svc. Using the LoadBalancer Service type, a cloud load balancer is automatically provisioned and configured by Kubernetes. To get the Service’s external IP address, use get:

  • kubectl get svc nginx-svc
Output
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE nginx-svc LoadBalancer 10.245.26.242 203.0.113.0 80:30153/TCP 22m

You can access the running Nginx containers by navigating to EXTERNAL-IP in your web browser.

Inspecting Workloads and Debugging

There are several commands you can use to get more information about workloads running in your cluster.

Inspecting Kubernetes Resources

kubectl get fetches a given Kubernetes resource and displays some basic information associated with it:

  • kubectl get deployment -o wide
Output
NAME READY UP-TO-DATE AVAILABLE AGE CONTAINERS IMAGES SELECTOR nginx-deployment 4/4 4 4 29m nginx nginx app=nginx

Since we did not provide a Deployment name or Namespace, kubectl fetches all Deployments in the current Namespace. The -o flag provides additional information like CONTAINERS and IMAGES.

In addition to get, you can use describe to fetch a detailed description of the resource and associated resources:

  • kubectl describe deploy nginx-deployment
Output
Name: nginx-deployment Namespace: default CreationTimestamp: Wed, 11 Sep 2019 12:53:42 -0400 Labels: run=nginx-deployment Annotations: deployment.kubernetes.io/revision: 1 Selector: run=nginx-deployment . . .

The set of information presented will vary depending on the resource type. You can also use this command without specifying a resource name, in which case information will be provided for all resources of that type in the current Namespace.

explain allows you to quickly pull configurable fields for a given resource type:

  • kubectl explain deployment.spec

By appending additional fields you can dive deeper into the field hierarchy:

  • kubectl explain deployment.spec.template.spec

Gaining Shell Access to a Container

To gain shell access into a running container, use exec. First, find the Pod that contains the running container you’d like access to:

  • kubectl get pod
Output
nginx-deployment-8859878f8-7gfw9 1/1 Running 0 109m nginx-deployment-8859878f8-z7f9q 1/1 Running 0 109m

Let’s exec into the first Pod. Since this Pod has only one container, we don’t need to use the -c flag to specify which container we’d like to exec into.

  • kubectl exec -i -t nginx-deployment-8859878f8-7gfw9 -- /bin/bash
Output
root@nginx-deployment-8859878f8-7gfw9:/#

You now have shell access to the Nginx container. The -i flag passes STDIN to the container, and -t gives you an interactive TTY. The -- double-dash acts as a separator for the kubectl command and the command you’d like to run inside the container. In this case, we are running /bin/bash.

To run commands inside the container without opening a full shell, omit the -i and -t flags, and substitute the command you’d like to run instead of /bin/bash:

  • kubectl exec nginx-deployment-8859878f8-7gfw9 ls
Output
bin boot dev etc home lib lib64 media . . .

Fetching Logs

Another useful command is logs, which prints logs for Pods and containers, including terminated containers.

To stream logs to your terminal output, you can use the -f flag:

  • kubectl logs -f nginx-deployment-8859878f8-7gfw9
Output
10.244.2.1 - - [12/Sep/2019:17:21:33 +0000] "GET / HTTP/1.1" 200 612 "-" "203.0.113.0" "-" 2019/09/16 17:21:34 [error] 6#6: *1 open() "/usr/share/nginx/html/favicon.ico" failed (2: No such file or directory), client: 10.244.2.1, server: localhost, request: "GET /favicon.ico HTTP/1.1", host: "203.0.113.0", referrer: "http://203.0.113.0" . . .

This command will keep running in your terminal until interrupted with a CTRL+C. You can omit the -f flag if you’d like to print log output and exit immediately.

You can also use the -p flag to fetch logs for a terminated container. When this option is used within a Pod that had a prior running container instance, logs will print output from the terminated container:

  • kubectl logs -p nginx-deployment-8859878f8-7gfw9

The -c flag allows you to specify the container you’d like to fetch logs from, if the Pod has multiple containers. You can use the --all-containers=true flag to fetch logs from all containers in the Pod.

Port Forwarding and Proxying

To gain network access to a Pod, you can use port-forward:

  • sudo kubectl port-forward pod/nginx-deployment-8859878f8-7gfw9 80:80
Output
Forwarding from 127.0.0.1:80 -> 80 Forwarding from [::1]:80 -> 80

In this case we use sudo because local port 80 is a protected port. For most other ports you can omit sudo and run the kubectl command as your system user.

Here we forward local port 80 (preceding the colon) to the Pod’s container port 80 (after the colon).

You can also use deploy/nginx-deployment as the resource type and name to forward to. If you do this, the local port will be forwarded to the Pod selected by the Deployment.

The proxy command can be used to access the Kubernetes API server locally:

  • kubectl proxy --port=8080
Output
Starting to serve on 127.0.0.1:8080

In another shell, use curl to explore the API:

curl http://localhost:8080/api/ 
Output
{ "kind": "APIVersions", "versions": [ "v1" ], "serverAddressByClientCIDRs": [ { "clientCIDR": "0.0.0.0/0", "serverAddress": "203.0.113.0:443" } ]

Close the proxy by hitting CTRL-C.

Conclusion

This guide covers some of the more common kubectl commands you may use when managing a Kubernetes cluster and workloads you’ve deployed to it.

You can learn more about kubectl by consulting the official Kubernetes reference documentation.

There are many more commands and variations that you may find useful as part of your work with kubectl. To learn more about all of your available options, you can run:

kubectl --help 

DigitalOcean Community Tutorials

Stack Abuse: Getting Started with Python’s Wikipedia API

Introduction

In this article, we will be using the Wikipedia API to retrieve data from Wikipedia. Data scraping has seen a rapid surge owing to the increasing use of data analytics and machine learning tools. The Internet is the single largest source of information, and therefore it is important to know how to fetch data from various sources. And with Wikipedia being one of the largest and most popular sources for information on the Internet, this is a natural place to start.

In this article, we will see how to use Python’s Wikipedia API to fetch a variety of information from the Wikipedia website.

Installation

In order to extract data from Wikipedia, we must first install the Python Wikipedia library, which wraps the official Wikipedia API. This can be done by entering the command below in your command prompt or terminal:

$   pip install wikipedia 

Once the installation is done, we can use the Wikipedia API in Python to extract information from Wikipedia. In order to call the methods of the Wikipedia module in Python, we need to import it using the following command.

import wikipedia   

Searching Titles and Suggestions

The search() method does a Wikipedia search for a query that is supplied as an argument to it. As a result, this method returns a list of all the article’s titles that contain the query. For example:

import wikipedia   print(wikipedia.search("Bill"))   

Output:

['Bill', 'The Bill', 'Bill Nye', 'Bill Gates', 'Bills, Bills, Bills', 'Heartbeat bill', 'Bill Clinton', 'Buffalo Bill', 'Bill & Ted', 'Kill Bill: Volume 1'] 

As you see in the output, the searched title along with the related search suggestions are displayed. You can configure the number of search titles returned by passing a value for the results parameter, as shown here:

import wikipedia   print(wikipedia.search("Bill", results=2))   

Output:

['Bill', 'The Bill'] 

The above code prints only 2 search results of the query since that is how many we requested to be returned.

Let’s say we need to get the Wikipedia search suggestions for a search title, “Bill Cliton” that is incorrectly entered or has a typo. The suggest() method returns suggestions related to the search query entered as a parameter to it, or it will return “None” if no suggestions were found.

Let’s try it out here:

import wikipedia   print(wikipedia.suggest("Bill cliton"))   

Output:

bill clinton   

You can see that it took our incorrect entry, “Bill cliton”, and returned the correct suggestion of “bill clinton”.

Extracting Wikipedia Article Summary

We can extract the summary of a Wikipedia article using the summary() method. The article for which the summary needs to be extracted is passed as a parameter to this method.

Let’s extract the summary for “Ubuntu”:

print(wikipedia.summary("Ubuntu"))   

Output:

Ubuntu ( (listen)) is a free and open-source Linux distribution based on Debian. Ubuntu is officially released in three editions: Desktop, Server, and Core (for the internet of things devices and robots). Ubuntu is a popular operating system for cloud computing, with support for OpenStack.Ubuntu is released every six months, with long-term support (LTS) releases every two years. The latest release is 19.04 ("Disco Dingo"), and the most recent long-term support release is 18.04 LTS ("Bionic Beaver"), which is supported until 2028. Ubuntu is developed by Canonical and the community under a meritocratic governance model. Canonical provides security updates and support for each Ubuntu release, starting from the release date and until the release reaches its designated end-of-life (EOL) date. Canonical generates revenue through the sale of premium services related to Ubuntu. Ubuntu is named after the African philosophy of Ubuntu, which Canonical translates as "humanity to others" or "I am what I am because of who we all are".   

The whole summary is printed in the output. We can customize the number of sentences in the summary text to be displayed by configuring the sentences argument of the method.

print(wikipedia.summary("Ubuntu", sentences=2))   

Output:

Ubuntu ( (listen)) is a free and open-source Linux distribution based on Debian. Ubuntu is officially released in three editions: Desktop, Server, and Core (for the internet of things devices and robots).   

As you can see, only 2 sentences of Ubuntu’s text summary is printed.

However, keep in mind that wikipedia.summary will raise a “disambiguation error” if the page does not exist or the page is disambiguous. Let’s see an example.

print(wikipedia.summary("key"))   

The above code throws a DisambiguationError since there are many articles that would match “key”.

Output:

Traceback (most recent call last):     File "<stdin>", line 1, in <module>   File "/Library/Python/2.7/site-packages/wikipedia/util.py", line 28, in __call__     ret = self._cache[key] = self.fn(*args, **kwargs)   File "/Library/Python/2.7/site-packages/wikipedia/wikipedia.py", line 231, in summary     page_info = page(title, auto_suggest=auto_suggest, redirect=redirect)   File "/Library/Python/2.7/site-packages/wikipedia/wikipedia.py", line 276, in page     return WikipediaPage(title, redirect=redirect, preload=preload)   File "/Library/Python/2.7/site-packages/wikipedia/wikipedia.py", line 299, in __init__     self.__load(redirect=redirect, preload=preload)   File "/Library/Python/2.7/site-packages/wikipedia/wikipedia.py", line 393, in __load     raise DisambiguationError(getattr(self, 'title', page['title']), may_refer_to) wikipedia.exceptions.DisambiguationError: "Key" may refer to:   Key (cryptography)   Key (lock)   Key (map)   ... 

If you had wanted the summary on a “cryptography key”, for example, then you’d have to enter it as the following:

print(wikipedia.summary("Key (cryptography)"))   

With the more specific query we now get the correct summary in the output.

Retrieving Full Wikipedia Page Data

In order to get the contents, categories, coordinates, images, links and other metadata of a Wikipedia page, we must first get the Wikipedia page object or the page ID for the page. To do this, the page() method is used with page the title passed as an argument to the method.

Look at the following example:

wikipedia.page("Ubuntu")   

This method call will return a WikipediaPage object, which we’ll explore more in the next few sections.

Extracting Metadata of a Page

To get the complete plain text content of a Wikipedia page (excluding images, tables, etc.), we can use the content attribute of the page object.

print(wikipedia.page("Python").content)   

Output:

Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aims to help programmers write clear, logical code for small and large-scale projects.Python is dynamically typed and garbage-collected. It supports multiple programming paradigms, including procedural, object-oriented, and  functional programming. Python is often described as a "batteries included" language due to its comprehensive standard library.Python was conceived in the late 1980s as a successor to the ABC language. Python 2.0, released 2000, introduced features like list comprehensions and a garbage collection system capable of collecting reference cycles.   ... 

Similarly, we can get the URL of the page using the url attribute:

print(wikipedia.page("Python").url)   

Output:

https://en.wikipedia.org/wiki/Python_(programming_language)   

We can get the URLs of external links on a Wikipedia page by using the references property of the WikipediaPage object.

print(wikipedia.page("Python").references)   

Output:

[u'http://www.computerworld.com.au/index.php/id;66665771', u'http://neopythonic.blogspot.be/2009/04/tail-recursion-elimination.html', u'http://www.amk.ca/python/writing/gvr-interview', u'http://cdsweb.cern.ch/journal/CERNBulletin/2006/31/News%20Articles/974627?ln=en', u'http://www.2ality.com/2013/02/javascript-influences.html', ...] 

The title property of the WikipediaPage object can be used to we extract the title of the page.

print(wikipedia.page("Python").title)   

Output:

Python (programming language)   

Similarly, the categories attribute can be used to get the list of categories of a Wikipedia page:

print(wikipedia.page("Python").categories)   

Output

['All articles containing potentially dated statements', 'Articles containing potentially dated statements from August 2016', 'Articles containing potentially dated statements from December 2018', 'Articles containing potentially dated statements from March 2018', 'Articles with Curlie links', 'Articles with short description', 'Class-based programming languages', 'Computational notebook', 'Computer science in the Netherlands', 'Cross-platform free software', 'Cross-platform software', 'Dutch inventions', 'Dynamically typed programming languages', 'Educational programming languages', 'Good articles', 'High-level programming languages', 'Information technology in the Netherlands', 'Object-oriented programming languages', 'Programming languages', 'Programming languages created in 1991', 'Python (programming language)', 'Scripting languages', 'Text-oriented programming languages', 'Use dmy dates from August 2015', 'Wikipedia articles with BNF identifiers', 'Wikipedia articles with GND identifiers', 'Wikipedia articles with LCCN identifiers', 'Wikipedia articles with SUDOC identifiers'] 

The links element of the WikipediaPage object can be used to get the list of titles of the pages whose links are present in the page.

print(wikipedia.page("Ubuntu").links)   

Output

[u'/e/ (operating system)', u'32-bit', u'4MLinux', u'ALT Linux', u'AMD64', u'AOL', u'APT (Debian)', u'ARM64', u'ARM architecture', u'ARM v7', ...] 

Finding Pages Based on Coordinates

The geosearch() method is used to do a Wikipedia geo search using latitude and longitude arguments supplied as float or decimal numbers to the method.

print(wikipedia.geosearch(37.787, -122.4))   

Output:

['140 New Montgomery', 'New Montgomery Street', 'Cartoon Art Museum', 'San Francisco Bay Area Planning and Urban Research Association', 'Academy of Art University', 'The Montgomery (San Francisco)', 'California Historical Society', 'Palace Hotel Residential Tower', 'St. Regis Museum Tower', 'Museum of the African Diaspora'] 

As you see, the above method returns articles based on the coordinates provided.

Similarly, we can set the coordinates property of the page() and get the articles related to the geolocation. For example:

print(wikipedia.page(37.787, -122.4))   

Output:

['140 New Montgomery', 'New Montgomery Street', 'Cartoon Art Museum', 'San Francisco Bay Area Planning and Urban Research Association', 'Academy of Art University', 'The Montgomery (San Francisco)', 'California Historical Society', 'Palace Hotel Residential Tower', 'St. Regis Museum Tower', 'Museum of the African Diaspora'] 

Language Settings

You can customize the language of a Wikipedia page to your native language, provided the page exists in your native language. To do so, you can use the set_lang() method. Each language has a standard prefix code which is passed as an argument to the method. For example, let’s get the first 2 sentences of the summary text of “Ubuntu” wiki page in the German language.

wikipedia.set_lang("de")   print(wikipedia.summary("ubuntu", sentences=2))   

Output

Ubuntu (auch Ubuntu Linux) ist eine Linux-Distribution, die auf Debian basiert. Der Name Ubuntu bedeutet auf Zulu etwa „Menschlichkeit“ und bezeichnet eine afrikanische Philosophie.   

You can check the list of currently supported ISO languages along with its prefix, as follows:

print(wikipedia.languages())   

Retrieving Images in a Wikipedia Page

The images list of the WikipediaPage object can be used to fetch images from a Wikipedia page. For instance, the following script returns the first image from Wikipedia’s Ubuntu page:

print(wikipedia.page("ubuntu").images[0])   

Output

https://upload.wikimedia.org/wikipedia/commons/1/1d/Bildschirmfoto_zu_ubuntu_704.png   

The above code returns the URL of the image present at index 0 in the Wikipedia page.

To see the image, you can copy and paste the above URL into your browser.

Retreiving Full HTML Page Content

To get the full Wikipedia page in HTML format, you can use the following script:

print(wikipedia.page("Ubuntu").html())   

Output

<div class="mw-parser-output"><div role="note" class="hatnote navigation-not-searchable">For the African philosophy, see <a href="/wiki/Ubuntu_philosophy" title="Ubuntu philosophy">Ubuntu philosophy</a>. For other uses, see <a href="/wiki/Ubuntu_(disambiguation)" class="mw-disambig" title="Ubuntu (disambiguation)">Ubuntu (disambiguation)</a>.</div>   <div class="shortdescription nomobile noexcerpt noprint searchaux" style="display:none">Linux distribution based on Debian</div>   ... 

As seen in the output, the entire page in HTML format is displayed. This can take a bit longer to load if the page size is large, so keep in mind that it can raise an HTMLTimeoutError when a request to the server times out.

Conclusion

In this tutorial, we had a glimpse of using the Wikipedia API for extracting data from the web. We saw how to get a variety of information such as a page’s title, category, links, images, and retrieve articles based on geo-locations.

Planet Python