Real Python: How to Use sorted() and sort() in Python

All programmers will have to write code to sort items or data at some point. Sorting can be critical to the user experience in your application, whether it’s ordering a user’s most recent activity by timestamp, or putting a list of email recipients in alphabetical order by last name. Python sorting functionality offers robust features to do basic sorting or customize ordering at a granular level.

In this guide, you’ll learn how to sort various types of data in different data structures, customize the order, and work with two different methods of sorting in Python.

By the end of this tutorial, you’ll know how to:

  • Implement basic Python sorting and ordering on data structures
  • Differentiate between sorted() and .sort()
  • Customize a complex sort order in your code based on unique requirements

For this tutorial, you’ll need a basic understanding of lists and tuples as well as sets. Those data structures will be used in this tutorial, and some basic operations will be performed on them. Also, this tutorial uses Python 3, so example output in this tutorial might vary slightly if you’re using Python 2.

Free Bonus: Click here to get access to a chapter from Python Tricks: The Book that shows you Python’s best practices with simple examples you can apply instantly to write more beautiful + Pythonic code.

Ordering Values With sorted()

To get started with Python sorting, you’re first going to see how to sort both numeric data and string data.

Sorting Numbers

You can use Python to sort a list by using sorted(). In this example, a list of integers is defined, and then sorted() is called with the numbers variable as the argument:

>>>

>>> numbers = [6, 9, 3, 1] >>> sorted(numbers) [1, 3, 6, 9] >>> numbers [6, 9, 3, 1] 

The output from this code is a new, sorted list. When the original variable is printed, the initial values are unchanged.

This example shows four important characteristics of sorted():

  1. The function sorted() did not have to be defined. It’s a built-in function that is available in a standard installation of Python.
  2. sorted(), with no additional arguments or parameters, is ordering the values in numbers in an ascending order, meaning smallest to largest.
  3. The original numbers variable is unchanged because sorted() provides sorted output and does not change the original value in place.
  4. When sorted() is called, it provides an ordered list as a return value.

This last point means that sorted() can be used on a list, and the output can immediately be assigned to a variable:

>>>

>>> numbers = [6, 9, 3, 1] >>> numbers_sorted = sorted(numbers) >>> numbers_sorted [1, 3, 6, 9] >>> numbers [6, 9, 3, 1] 

In this example, there is now a new variable numbers_sorted that stored the output of sorted().

You can confirm all of these observations by calling help() on sorted(). The optional arguments key and reverse will be covered later in the tutorial:

>>>

>>> # Python 3 >>> help(sorted) Help on built-in function sorted in module builtins:  sorted(iterable, /, *, key=None, reverse=False)     Return a new list containing all items from the iterable in ascending order.      A custom key function can be supplied to customize the sort order, and the     reverse flag can be set to request the result in descending order. 

Technical Detail: If you’re transitioning from Python 2 and are familiar with its function of the same name, you should be aware of a couple important changes in Python 3:

  1. Python 3’s sorted() does not have a cmp parameter. Instead, only key is used to introduce custom sorting logic.
  2. key and reverse must be passed as keyword arguments, unlike in Python 2, where they could be passed as positional arguments.

If you need to convert a Python 2 cmp function to a key function, then check out functools.cmp_to_key(). This tutorial will not cover any examples using Python 2.

sorted() can be used on tuples and sets very similarly:

>>>

>>> numbers_tuple = (6, 9, 3, 1) >>> numbers_set = {5, 5, 10, 1, 0} >>> numbers_tuple_sorted = sorted(numbers_tuple) >>> numbers_set_sorted = sorted(numbers_set) >>> numbers_tuple_sorted [1, 3, 6, 9] >>> numbers_set_sorted [0, 1, 5, 10] 

Notice how even though the input was a set and a tuple, the output is a list because sorted() returns a new list by definition. The returned object can be cast to a new type if it needs to match the input type. Be careful if attempting to cast the resulting list back to a set, as a set by definition is unordered:

>>>

>>> numbers_tuple = (6, 9, 3, 1) >>> numbers_set = {5, 5, 10, 1, 0} >>> numbers_tuple_sorted = sorted(numbers_tuple) >>> numbers_set_sorted = sorted(numbers_set) >>> numbers_tuple_sorted [1, 3, 6, 9] >>> numbers_set_sorted [0, 1, 5, 10] >>> tuple(numbers_tuple_sorted) (1, 3, 6, 9) >>> set(numbers_set_sorted) {0, 1, 10, 5} 

The numbers_set_sorted value when cast to a set is not ordered, as expected. The other variable, numbers_tuple_sorted, retained the sorted order.

Sorting Strings

str types sort similarly to other iterables, like list and tuple. The example below shows how sorted() iterates through each character in the value passed to it and orders them in the output:

>>>

>>> string_number_value = '34521' >>> string_value = 'I like to sort' >>> sorted_string_number = sorted(string_number_value) >>> sorted_string = sorted(string_value) >>> sorted_string_number ['1', '2', '3', '4', '5'] >>> sorted_string [' ', ' ', ' ', 'I', 'e', 'i', 'k', 'l', 'o', 'o', 'r', 's', 't', 't'] 

sorted() will treat a str like a list and iterate through each element. In a str, each element means each character in the str. sorted() will not treat a sentence differently, and it will sort each character, including spaces.

.split() can change this behavior and clean up the output, and .join() can put it all back together. We will cover the specific order of the output and why it is that way shortly:

>>>

>>> string_value = 'I like to sort' >>> sorted_string = sorted(string_value.split()) >>> sorted_string ['I', 'like', 'sort', 'to'] >>> ' '.join(sorted_string) 'I like sort to' 

The original sentence in this example is converted into a list of words instead of leaving it as a str. That list is then sorted and combined to form a str again instead of a list.

Limitations and Gotchas With Python Sorting

It is worth noting some limitations and odd behavior that can arise when you’re using Python to sort values besides integers.

Lists With Non-Comparable Data Types Can’t Be sorted()

There are data types that can’t be compared to each other using just sorted() because they are too different. Python will return an error if you attempt to use sorted() on a list containing non-comparable data. In this example, a None and an int in the same list can’t be sorted because of their incompatibility:

>>>

>>> mixed_types = [None, 0] >>> sorted(mixed_types) Traceback (most recent call last):   File "<stdin>", line 1, in <module> TypeError: '<' not supported between instances of 'int' and 'NoneType' 

This error shows why Python can’t sort the values given to it. It’s trying to put the values in order by using the less than operator (<) to determine which value is lower in sorting order. You can replicate this error by manually comparing the two values:

>>>

>>> None < 0 Traceback (most recent call last):   File "<stdin>", line 1, in <module> TypeError: '<' not supported between instances of 'NoneType' and 'int' 

The same TypeError is thrown when you try to compare two non-comparable values without using sorted().

If the values within the list can be compared and will not throw a TypeError, then the list can be sorted. This prevents sorting iterables with intrinsically unorderable values and producing output that may not make sense.

For example, should the number 1 come before the word apple? However, if a iterable contains a combination of integers and strings that are all numbers, they can be cast to comparable data types by using a list comprehension:

>>>

>>> mixed_numbers = [5, "1", 100, "34"] >>> sorted(mixed_numbers) Traceback (most recent call last):   File "<stdin>", line 1, in <module> TypeError: '<' not supported between instances of 'str' and 'int' >>> # List comprehension to convert all values to integers >>> [int(x) for x in mixed_numbers] [5, 1, 100, 34] >>> sorted([int(x) for x in mixed_numbers]) [1, 5, 34, 100] 

Each element in mixed_numbers has int() called on it to convert any str values to int values. sorted() is then called and can successfully compare each element and provide a sorted output.

Python can also implicitly convert a value to another type. In the example below, the evaluation of 1 <= 0 is a false statement, so the output of the evaluation will be False. The number 1 can be converted to True as a bool type, while 0 converts to False.

Even though the elements in the list look different, they can all be converted to Booleans (True or False) and compared to each other using sorted():

>>>

>>> similar_values = [False, 0, 1, 'A' == 'B', 1 <= 0] >>> sorted(similar_values) [False, 0, False, False, 1] 

'A' == 'B' and 1 <= 0 are converted to False and returned in the ordered output.

This example illustrates an important aspect of sorting: sort stability. In Python, when you sort equal values, they will retain their original order in the output. Even though the 1 moved, all the other values are equal so they retain their original order relative to each other. In the example below, all the values are considered equal and will retain their original positions:

>>>

>>> false_values = [False, 0, 0, 1 == 2, 0, False, False] >>> sorted(false_values) [False, 0, 0, False, 0, False, False] 

If you inspect the original order and the sorted output, you will see that 1 == 2 is converted to False, and all sorted output is in the original order.

When You’re Sorting Strings, Case Matters

sorted() can be used on a list of strings to sort the values in ascending order, which appears to be alphabetically by default:

>>>

>>> names = ['Harry', 'Suzy', 'Al', 'Mark'] >>> sorted(names) ['Al', 'Harry', 'Mark', 'Suzy'] 

However, Python is using the Unicode Code Point of the first letter in each string to determine ascending sort order. This means that sorted() will not treat the names Al and al the same. This example uses ord() to return the Unicode Code Point of the first letter in each string:

>>>

>>> names_with_case = ['harry', 'Suzy', 'al', 'Mark'] >>> sorted(names_with_case) ['Mark', 'Suzy', 'al', 'harry'] >>> # List comprehension for Unicode Code Point of first letter in each word >>> [(ord(name[0]), name[0]) for name in sorted(names_with_case)] [(77, 'M'), (83, 'S'), (97, 'a'), (104, 'h')] 

name[0] is returning the first character in each element of sorted(names_with_case), and ord() is providing the Unicode Code Point. Even though a comes before M in the alphabet, the code point for M comes before a, so the sorted output has M first.

If the first letter is the same, then sorted() will use the second character to determine order, and the third character if that is the same, and so on, all the way to the end of the string:

>>>

>>> very_similar_strs = ['hhhhhd', 'hhhhha', 'hhhhhc','hhhhhb'] >>> sorted(very_similar_strs) ['hhhhha', 'hhhhhb', 'hhhhhc', 'hhhhhd'] 

Each value of very_similar_strs is identical except for the last character. sorted() will compare the strings, and because the first five characters are the same, the output will be based on the sixth character.

Strings that contain identical values will end up sorted shortest to longest due to the shorter strings not having elements to compare to with the longer strings:

>>>

>>> different_lengths = ['hhhh', 'hh', 'hhhhh','h'] >>> sorted(different_lengths) ['h', 'hh', 'hhhh', 'hhhhh'] 

The shortest string, h, is ordered first with the longest, hhhhh, ordered last.

Using sorted() With a reverse Argument

As shown in the help() documentation for sorted(), there is an optional keyword argument called reverse, which will change the sorting behavior based on the Boolean assigned to it. If reverse is assigned True, then the sorting will be in descending order:

>>>

>>> names = ['Harry', 'Suzy', 'Al', 'Mark'] >>> sorted(names) ['Al', 'Harry', 'Mark', 'Suzy'] >>> sorted(names, reverse=True) ['Suzy', 'Mark', 'Harry', 'Al'] 

The sorting logic remains the same, meaning that the names are still being sorted by their first letter. But the output has been reversed with the reverse keyword set to True.

When False is assigned, the ordering will remain ascending. Any of the previous examples can be used to see the behavior of reverse using both True or False:

>>>

>>> names_with_case = ['harry', 'Suzy', 'al', 'Mark'] >>> sorted(names_with_case, reverse=True) ['harry', 'al', 'Suzy', 'Mark'] >>> similar_values = [False, 1, 'A' == 'B', 1 <= 0] >>> sorted(similar_values, reverse=True) [1, False, False, False] >>> numbers = [6, 9, 3, 1] >>> sorted(numbers, reverse=False) [1, 3, 6, 9] 

sorted() With a key Argument

One of the most powerful components of sorted() is the keyword argument called key. This argument expects a function to be passed to it, and that function will be used on each value in the list being sorted to determine the resulting order.

To demonstrate a basic example, let’s assume the requirement for ordering a specific list is the length of the strings in the list, shortest to longest. The function to return the length of a string, len(), will be used with the key argument:

>>>

>>> word = 'paper' >>> len(word) 5 >>> words = ['banana', 'pie', 'Washington', 'book'] >>> sorted(words, key=len) ['pie', 'book', 'banana', 'Washington'] 

The resulting order is a list with a string order of shortest to longest. The length of each element in the list is determined by len() and then returned in ascending order.

Let’s return to the earlier example of sorting by first letter when the case is different. key can be used to solve that problem by converting the entire string to lowercase:

>>>

>>> names_with_case = ['harry', 'Suzy', 'al', 'Mark'] >>> sorted(names_with_case) ['Mark', 'Suzy', 'al', 'harry'] >>> sorted(names_with_case, key=str.lower) ['al', 'harry', 'Mark', 'Suzy'] 

The output values have not been converted to lowercase because key does not manipulate the data in the original list. During sorting, the function passed to key is being called on each element to determine sort order, but the original values will be in the output.

There are two main limitations when you’re using functions with the key argument.

First, the number of required arguments in the function passed to key must be one.

The example below shows the definition of an addition function that takes two arguments. When that function is used in key on a list of numbers, it fails because it is missing a second argument. Each time add() is called during the sort, it is only receiving one element from the list at a time:

>>>

>>> def add(x, y): ...     return x + y ...  >>> values_to_add = [1, 2, 3] >>> sorted(values_to_add, key=add) Traceback (most recent call last):   File "<stdin>", line 1, in <module> TypeError: add() missing 1 required positional argument: 'y' 

The second limitation is that the function used with key must be able to handle all the values in the iterable. For example, you have a list of numbers represented as strings to be used in sorted(), and key is going to attempt to convert them to numbers using int. If a value in the iterable can’t be cast to an integer, then the function will fail:

>>>

>>> values_to_cast = ['1', '2', '3', 'four'] >>> sorted(values_to_cast, key=int) Traceback (most recent call last):   File "<stdin>", line 1, in <module> ValueError: invalid literal for int() with base 10: 'four' 

Each numeric value as a str can be converted to int, but four can’t. This causes a ValueError to be raised and explain that four can’t be converted to int because it is invalid.

The key functionality is extremely powerful because almost any function, built-in or user-defined, can be used to manipulate the output order.

If the ordering requirement is to order an iterable by the last letter in each string (and if the letter is the same, then to use the next letter), then a function can be defined and then used in the sorting. The example below defines a function that reverses the string passed to it, and then that function is used as the argument for key:

>>>

>>> def reverse_word(word): ...     return word[::-1] ... >>> words = ['banana', 'pie', 'Washington', 'book'] >>> sorted(words, key=reverse_word) ['banana', 'pie', 'book', 'Washington'] 

The word[::-1] slice syntax is used to reverse a string. Each element will have reverse_word() applied to it, and the sorting order will be based on the characters in the backwards word.

Instead of writing a standalone function, you can use a lambda function defined in the key argument.

A lambda is an anonymous function that:

  1. Must be defined inline
  2. Doesn’t have a name
  3. Can’t contain statements
  4. Will execute just like a function

In the example below, the key is defined as a lambda with no name, the argument taken by the lambda is x, and x[::-1] is the operation that will be performed on the argument:

>>>

>>> words = ['banana', 'pie', 'Washington', 'book'] >>> sorted(words, key=lambda x: x[::-1]) ['banana', 'pie', 'book', 'Washington'] 

x[::-1] is called on each element and reverses the word. That reversed output is then used for sorting, but the original words are still returned.

If the requirement changes, and the order should be reversed as well, then the reverse keyword can be used alongside the key argument:

>>>

>>> words = ['banana', 'pie', 'Washington', 'book'] >>> sorted(words, key=lambda x: x[::-1], reverse=True) ['Washington', 'book', 'pie', 'banana'] 

lambda functions are also useful when you need to sort class objects based on a property. If you have a group of students and need to sort them by their final grade, highest to lowest, then a lambda can be used to get the grade property from the class:

>>>

>>> from collections import namedtuple  >>> StudentFinal = namedtuple('StudentFinal', 'name grade') >>> bill = StudentFinal('Bill', 90) >>> patty = StudentFinal('Patty', 94) >>> bart = StudentFinal('Bart', 89) >>> students = [bill, patty, bart] >>> sorted(students, key=lambda x: getattr(x, 'grade'), reverse=True) [StudentFinal(name='Patty', grade=94), StudentFinal(name='Bill', grade=90), StudentFinal(name='Bart', grade=89)] 

This example uses namedtuple to produce classes with name and grade attributes. The lambda calls getattr() on each element and returns the value for grade.

reverse is set to True to make the ascending output flipped to be descending so that the highest grades are ordered first.

The possibilities are endless for how ordering can be done when you leverage both the key and reverse keyword arguments on sorted(). Code can be kept clean and short when you use a basic lambda for a small function, or you can write a whole new function, import it, and use it in the key argument.

Ordering Values With .sort()

The very similarly named .sort() differs quite a bit from the sorted() built-in. They accomplish more or less the same thing, but the help() documentation for list.sort() highlights two of the most critical differences between .sort() and sorted():

>>>

>>> # Python2 Help on method_descriptor:  sort(...)     L.sort(cmp=None, key=None, reverse=False) -- stable sort *IN PLACE*;     cmp(x, y) -> -1, 0, 1  >>> # Python3 >>> help(list.sort) Help on method_descriptor:  sort(...)     L.sort(key=None, reverse=False) -> None -- stable sort *IN PLACE* 

First, sort is a method of the list class and can only be used with lists. It is not a built-in with an iterable passed to it.

Second, .sort() returns None and modifies the values in place. Let’s take a look at the impacts of both of these differences in code:

>>>

>>> values_to_sort = [5, 2, 6, 1] >>> # Try to call .sort() like sorted() >>> sort(values_to_sort) Traceback (most recent call last):   File "<stdin>", line 1, in <module> NameError: name 'sort' is not defined  >>> # Try to use .sort() on a tuple >>> tuple_val = (5, 1, 3, 5) >>> tuple_val.sort() Traceback (most recent call last):   File "<stdin>", line 1, in <module> AttributeError: 'tuple' object has no attribute 'sort'  >>> # Sort the list and assign to new variable >>> sorted_values = values_to_sort.sort() >>> print(sorted_values) None  >>> # Print original variable >>> print(values_to_sort) [1, 2, 5, 6] 

There are some pretty dramatic differences in how .sort() operates compared to sorted() in this code example:

  1. There is no ordered output of .sort(), so the assignment to a new variable only passes a None type.
  2. The values_to_sort list has been changed in place, and the original order is not maintained in any way.

These differences in behavior make .sort() and sorted() absolutely not interchangeable in code, and they can produce wildly unexpected outcomes if one is used in the wrong way.

.sort() has the same key and reverse optional keyword arguments that produce the same robust functionality as sorted(). Here, you can sort a list of phrases by the second letter of the third word and return the list in reverse:

>>>

>>> phrases = ['when in rome',  ...     'what goes around comes around',  ...     'all is fair in love and war' ...     ] >>> phrases.sort(key=lambda x: x.split()[2][1], reverse=True) >>> phrases ['what goes around comes around', 'when in rome', 'all is fair in love and war'] 

In this sample, a lambda is used to do the following:

  1. Split each phrase into a list of words
  2. Find the third element, or word in this case
  3. Find the second letter in that word

When to Use sorted() and When to Use .sort()

You’ve seen the differences between sorted() and .sort(), but when do you use which?

Let’s say there is a 5k race coming up: The First Annual Python 5k. The data from the race needs to be captured and sorted. The data that needs to be captured is the runner’s bib number and the number of seconds it took to finish the race:

>>>

>>> from collections import namedtuple  >>> Runner = namedtuple('Runner', 'bibnumber duration') 

As the runners cross the finish line, each Runner will be added to a list called runners. In 5k races, not all runners cross the starting line at the same time, so the first person to cross the finish line might not actually be the fastest person:

>>>

>>> runners = [] >>> runners.append(Runner('2528567', 1500)) >>> runners.append(Runner('7575234', 1420)) >>> runners.append(Runner('2666234', 1600)) >>> runners.append(Runner('2425234', 1490)) >>> runners.append(Runner('1235234', 1620)) >>> # Thousands and Thousands of entries later... >>> runners.append(Runner('2526674', 1906)) 

Each time a runner crosses the finish line, their bib number and their total duration in seconds is added to runners.

Now, the dutiful programmer in charge of handling the outcome data sees this list, knows that the top 5 fastest participants are the winners that get prizes, and the remaining runners are going to be sorted by fastest time.

There are no requirements for multiple types of sorting by various attributes. The list is a reasonable size. There is no mention of storing the list somewhere. Just sort by duration and grab the five participants with the lowest duration:

>>>

>>> runners.sort(key=lambda x: getattr(x, 'duration')) >>> top_five_runners = runners[:5] 

The programmer chooses to use a lambda in the key argument to get the duration attribute from each runner and sort runners in place using .sort(). After runners is sorted, the first 5 elements are stored in top_five_runners.

Mission accomplished! The race director comes over and informs the programmer that since the current release of Python is 3.7, they have decided that every thirty-seventh person that crossed the finish line is going to get a free gym bag.

At this point, the programmer starts to sweat because the list of runners has been irreversibly changed. There is no way to recover the original list of runners in the order they finished and find every thirty-seventh person.

If you’re working with important data, and there is even a remote possibility that the original data will need to be recovered, then .sort() is not the best option. If the data is a copy, of if it is unimportant working data, of if no one will mind losing it because it can be retrieved, then .sort() can be a fine option.

Alternatively, the runners could have been sorted using sorted() and using the same lambda:

>>>

>>> runners_by_duration = sorted(runners, key=lambda x: getattr(x, 'duration')) >>> top_five_runners = runners_by_duration[:5] 

In this scenario with sorted(), the original list of runners is still intact and has not been overwritten. The impromptu requirement of finding every thirty-seventh person to cross the finish line can be accomplished by interacting with the original values:

>>>

>>> every_thirtyseventh_runners = runners[::37] 

every_thirtyseventh_runners is created by using a stride in the list slice syntax on runners, which still contains the original order in which the runners crossed the finish line.

How to Sort in Python: Conclusion

.sort() and sorted() can provide exactly the sort order you need if you use them properly with both the reverse and key optional keyword arguments.

Both have very different characteristics when it comes to output and in-place modifications, so make sure you think through any application functionality or program that will be using .sort() as it can irrevocably overwrite data.

For the avid Pythonistas looking for a challenge with sorting, try using more complex data types in sorting: nested iterables. Also, feel free to dive into the open source Python code implementations for the built-ins and read about the sort algorithm used in Python called Timsort.


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

Planet Python

Real Python: How to Get the Most Out of PyCon

Congratulations! You’re going to PyCon!

Whether this is your first time or not, going to a conference full of people who love the same thing as you is always a fun experience. There’s so much more to PyCon than just a bunch of people talking about the Python language, and that can be intimidating for first-time attendees. This guide will help you navigate all there is to see and do at PyCon.

PyCon (US/North America) is the biggest conference centered around the Python language. Originally formed in 2003, this conference has grown exponentially and has even spawned several other PyCons and workshops around the world.

This year will mark my fourth year in a row attending PyCon, so I thought that I would share some of the notes that I’ve taken over the years to help you get the most out of your PyCon experience.

Everyone who attends PyCon will have a different experience, and that’s what makes the conference truly unique. This guide is meant to help you, but you don’t need to follow it strictly.

This guide will have links that are specific to PyCon 2019, but it should be useful for future PyCons as well.

By the end of this article, you’ll know:

  • What PyCon involves
  • What to do before you go
  • What to do while you’re at PyCon
  • What to do after PyCon
  • How to have a great PyCon

Free Bonus: Click here to get access to a chapter from Python Tricks: The Book that shows you Python’s best practices with simple examples you can apply instantly to write more beautiful + Pythonic code.

What PyCon Involves

Before we go into how to get the most out of PyCon, let’s first establish what PyCon involves.

PyCon is broken up into three stages, as described by the official PyCon site:

  1. Tutorials: We start with two days of 3-hour classes during which you get to learn in depth with instructors. These are great to go to since the class sizes are small, and you can ask questions of the instructors. I highly recommend going to at least one of these if you can, but they do have an additional cost of $ 150 per class.

  2. Conference: Next, we have three days of talks. Each presentation lasts for 30 to 45 minutes, and there are about 5 talks going on at a time. But that’s not all: there are also open spaces, sponsors, lightning talks, dinners, and so much more.

  3. Sprints: During this stage, you can take what you’ve learned and apply it! This is a 3-day exercise in which people group up to work on various open source projects all related to Python. If you’ve got the time, I highly suggest that you go to this as it’s a great way to practice what you’ve learned, become associated with an open source project, and network with very smart and talented people.

Since most PyCon attendees go to the Conference part, that will be the focus of this article. However, don’t let that deter you from attending the Tutorials or Sprints if you can!

I’ve personally found that I’ve learned more technical skills by attending the Tutorials rather than listening to the talks. The Sprints are great for networking and applying the skills that you’ve already got as well as learning new ones from the people you’ll be working with.

What to Do Before You Go

In general, the more prepared you are for something, the better your experience will be. The same applies for PyCon.

It’s really helpful to plan and prepare ahead of time, which you’re already doing just by reading this article!

Look through the talk schedule and see which talks sound most interesting to you. This doesn’t mean you need to plan out all of the talks you are going to see for every slot possible, but it helps to get an idea of which topics are going to be presented so you can decide what you’re most interested in.

Getting the Guidebook app will help you plan you schedule. This app lets you view the schedule for the talks and add reminders for the ones you want to attend. If you’re having a hard time picking which talks to go to, you can come prepared with a question or problem you need solved. Doing this can help you focus on the topics that are important to you.

If you can, come the day before to check in and attend the opening reception. The line up to check in on the first day is always long, so you’ll save time if you check in the day before. There’s also usually an opening reception that evening, so you can meet other attendees and speakers, as well as get a chance to check out the various sponsors and their booths.

If you’re brand new to PyCon, there’s also a Newcomer Orientation that can help you get caught up on what the conference involves and how you can participate.

Recap:

Here’s what to do before you go:

  1. Look at the talk schedule.
  2. Get the Guidebook App.
  3. Come up with a specific question you want answered.
  4. Check in the day before the conference.

What to Do at PyCon

It’s okay if you’re nervous or excited on your first day at the conference.

There will be a lot of people from all different walks of life, and that’s what makes it so great. You may see some of your Python heroes, such as Guido van Rossum and and have a chance to go up to them and say hello.

Guido van Rossum With the Python Staff of EnlightenmentGuido van Rossum With the Python Staff of Enlightenment

The Python community is very welcoming! But there are also some designated quiet rooms, where speakers and others will go to work in peace. You should refrain from talking to anyone in those rooms to allow them that safe space.

Let’s break down the conference into some key elements and see how you can get the most out of them:

  • Talks
  • Open spaces
  • Sponsors
  • Volunteering opportunities
  • Lightning talks
  • After hours activities
  • Time for yourself

Talks

Go to as many of the talks as you want, but you don’t need to go to all of them. You’ll only find yourself getting stressed as you run from room to room. Instead, make sure that you hit up the talks that you selected before coming. Although it’s pretty rare, the schedule can change, so check the schedule board daily to catch any changes.

A talk at PyCon 2018‘Data Visualization in Mixed Reality with Python’ Talk Given by Anna Nicanorova at PyCon 2018

If there’s a conflict between two talks that you to really want to go to, remember that each talk is recorded and uploaded onto YouTube. Sometimes, they’re even available that same day! Pick the talk that is going to be the most relevant to your situation and seems the most interesting to you. Then take note of the talk that you missed to watch it later that evening or the next day when you have some free time.

Even though all talks are available online, actually going to the talks is still valuable. You’ll retain the information better if you attend, and you’ll have the opportunity to ask questions directly of the presenter.

Keep in mind that you don’t need to see all of the “celebrity” speakers. These speakers get a lot of attention in the Python community and may seem like more worthwhile speakers because of that. But the process of getting a talk approved at PyCon is rigorous and ensures that each speaker and topic are worth listening to.

In fact, sometimes it’s better to go see some of the less famous speakers because you can get better seats and have a greater chance to ask questions.

When you go to a talk, remember to silence your phone and computer. Noise can be very distracting to the audience and the speaker. It might be helpful to put away all your devices and simply listen or take notes using a pad and paper.

Try to think of questions that you may have about what’s being discussed. Typically, there’s time allotted at the end for questions from the audience. If that doesn’t happen, the presenters are usually very accommodating about answering questions in the hall afterwards.

Recap:

Here’s what you need to know about talks at PyCon:

  1. You don’t need to go to all of the talks and will be able to watch them on YouTube.
  2. All talks and speakers are amazing.
  3. Keep the noise to a minimum during the talks.
  4. Think of questions to ask.
  5. Talk to the speakers afterwards.

Open Spaces

Open Spaces are rooms that can be booked by conference attendees. There are 1-hour slots throughout the day available to anyone that wants to use them. These rooms have been used as places to teach people, hold meetups, and even do yoga classes. They’re open for whatever activities you need them for, as long as you follow the Code of Conduct of course.

Example of the PyCon Open Spaces boardOpen Spaces Board at PyCon 2018

You might want to go to these open spaces instead of a talk or a sponsor booth. Be sure to check out the open space board each day since it constantly changes. It might be helpful to take a picture of the board to reference later.

Feel free to create your own open space. Remember that problem or specific question that you were looking for help with? Sign up for a room and ask for advice on that topic! You never know who might show up to help out.

I did that one year. While I was working on one of my open source projects, I ran into a testing problem that I just couldn’t solve. I grabbed an open space for an hour and asked for help. I learned all about the mock module for testing. That saved me literally hours of work!

If you’re an expert or even have some limited knowledge about a topic that you want to share, feel free to grab an open space for that as well! These spaces are intended to be whatever you want and need them to be, so don’t be shy about using them. In fact, be on the lookout for a Real Python open space and feel free to join us as we talk about what we envision for the future.

Recap:

Here’s what you need to know about open spaces at PyCon:

  1. Open spaces can be whatever you need them to be.
  2. Use an open space slot to ask for help.
  3. Use an open space slot to teach others.

Sponsors

Visiting sponsors is a great way to get to know some of the companies that are using Python in their everyday businesses. There are some very big names that come almost every year: Microsoft, JetBrains, and Google, to name a few. There are over 100 sponsors at PyCon 2019! Going to a sponsor booth is great for many reasons, not just the wonderful SWAG that everyone gives out. Here’s some of ours:

Real Python Swag: Stickers and a PinReal Python Swag: Meet us at PyCon and get some stickers

The greatest thing about sponsors being there is that you can speak with the actual developers of the tools and software you use. Let’s say you have a problem with an Anaconda install on a Windows environment. You can go straight up to them and ask questions! This is a great opportunity to talk to the developers and creators of the tools you use.

It’s not just developers that you get to meet either. There are authors and content creators who come too. O’Reily typically has an author or two a day who comes for you to meet and chat with. This year, JetBrains will be hosting content creators at their booth, where you can meet some of the Real Python Team!

Finally, meeting with the sponsors could lead to an opportunity for a job. A lot of the sponsors are also looking for talented Python developers, and you can apply directly with them at the booths or during the job fair towards the end of the conference. If you’re not looking for a job, it’s still great to see what is out there and what skills these companies are looking for to help better choose what to focus on with your learning.

Recap:

Here’s what you need to know about sponsors at PyCon:

  1. Getting sponsor SWAG is great.
  2. Meeting with the developers and content creators is even better.
  3. You can apply for a job or see what skills companies are looking for.

Volunteering Opportunities

Have you ever wished you could contribute or give back to the Python Community? Well, you can at PyCon.

So much work goes into making a conference happen, and none of it would be possible without volunteers. The 2019 PyCon conference is looking for over 300-man hours of volunteering alone!

That sounds like a lot, but you can still make a difference. An hour or two of your time can be a big help but still not take away from your learning time. You also never know who you might rub shoulders with as a volunteer. You can learn more about helping out in the Call for On-Site Volunteers. There’s a little something for everyone.

Recap:

Here’s what you need to know about volunteering at PyCon:

  1. Just Do It.
  2. Really, Just Do It!

After Hours Activities

Even though the conference ends in the early evening, there’s still more to do after the conference wraps up for the day.

The first thing you should check out, even if only for a little while, is the lightning talks. These are 5 minute “soap boxes” for anyone to share on a topic. These talks cover a wide range of topics: anything from a new open source project to social commentary to philanthropic topics. Did you know that Docker was first publicly announced during a PyCon lightning talk? The lightning talks have become a popular staple of regular attendees.

Also be on the lookout for sponsors doing sponsored dinners with members of their company. This is a great way to build up your network and get a free dinner as well. PyCon does host dinner parties in the evenings, but these cost money and typically sell out really fast.

Even if you don’t get in on the official PyCon dinners or the sponsored ones, there’s always plenty to check out near the conference. Each conference location is selected in part because of the interesting things nearby. When you’re getting ready for the conference, look for fun places to go eat.

Recap:

Here’s what you need to know about after-hours events at PyCon:

  1. Check out the lightning talks.
  2. If there are spots available, sign up for one of the official PyCon dinners.
  3. Check early and often for any sponsored dinners.
  4. Have fun exploring the local cuisine and culture.

New Friends

One of the greatest pieces of advice that I received when I first started going to PyCon was to make a new friend each day.

Our very own Dan Bader and Anthony ShawDan Bader and Anthony Shaw

Some of the best times to get to know someone at PyCon are the lunch and snack breaks. Instead of trying to pick an empty table to sit at, find one that already has a person or two and ask if you can join them. Strike up a conversation about what their favorite talk is thus far, or how they use Python in their day to day activities. Pretty soon you’ll make a new friend. You can take a few notes, whether mentally or literally, about your conversation so you can remember that person later on.

You might want to make some business cards with the contact info you want to share with people you meet in order to stay in contact with them. PyCon will actually give you a few to give out, but you can go through them really quickly! Be sure to update any profiles that you share with people such as LinkedIn or GitHub.

Recap:

Here’s what you need to know about making new friends at PyCon:

  1. Take the challenge to meet at least one new person each day.
  2. Make a note of who that person is so that you don’t forget.
  3. Bring business cards to share with people you meet.

What to Do After PyCon

Once the conference is over, there’s still a lot you can do. First off, if you have the time, there are the Sprints, which are a great opportunity for you to hone your skills and even gain new ones as a Python developer. They last for four days after the Conference, but you don’t need to stay for the entire time. Stay for as long as you like, whether that’s for a few hours or a few days.

After you get home, make sure to check out the YouTube videos of talks you missed or made a note to watch again. There’s also all of the tutorials, keynote speaker talks, and even the lightning talks to check out. There’s plenty there for you to get your PyCon fix throughout the year.

The greatest thing about PyCon is the feeling of belonging to a community. That’s only made possible by the great people who give back to the Python community, and you can become one of them!

There are lots of ways that you can give back to this great community:

  • Contribute to an open source project that uses Python.
  • Join a local Python meetup group. Don’t have one? Create one!
  • Share with others what you’ve learned.
  • Submit a talk or poster for next year’s PyCon.

Finally, you can start preparing for the next PyCon. When you purchase tickets early, you get a discount on the pricing, but those tickets go pretty fast too. You can also start taking note of any problems or questions that you can’t find the answer to in preparation for selecting talks to check out at the next PyCon.

Recap:

Here’s what you need to know about what to do after the conference:

  1. If you have the time, stay for the Sprints.
  2. Check out the YouTube videos of the talks you missed or loved.
  3. Look for a way to contribute back to the Python community with what you learned.
  4. Start looking at next year’s PyCon.

Welcome to the Greatest Community!

Congratulation! You’re about to attend one of the greatest technical conferences there is out there.

In this article, you learned:

  1. What PyCon is all about
  2. What you can do before coming to PyCon
  3. What you can do at PyCon
  4. What you can do after PyCon

With the tips in this article, you’ll be able to have a great PyCon. We look forward to seeing you there!


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

Planet Python

Real Python: Python KeyError Exceptions and How to Handle Them

Python’s KeyError exception is a common exception encountered by beginners. Knowing why a KeyError can be raised and some solutions to prevent it from stopping your program are essential steps to improving as a Python programmer.

By the end of this tutorial, you’ll know:

  • What a Python KeyError usually means
  • Where else you might see a KeyError in the standard library
  • How to handle a KeyError when you see it

Free Bonus: Click here to get our free Python Cheat Sheet that shows you the basics of Python 3, like working with data types, dictionaries, lists, and Python functions.

What a Python KeyError Usually Means

A Python KeyError exception is what is raised when you try to access a key that isn’t in a dictionary (dict).

Python’s official documentation says that the KeyError is raised when a mapping key is accessed and isn’t found in the mapping. A mapping is a data structure that maps one set of values to another. The most common mapping in Python is the dictionary.

The Python KeyError is a type of LookupError exception and denotes that there was an issue retrieving the key you were looking for. When you see a KeyError, the semantic meaning is that the key being looked for could not be found.

In the example below, you can see a dictionary (ages) defined with the ages of three people. When you try to access a key that is not in the dictionary, a KeyError is raised:

>>>

>>> ages = {'Jim': 30, 'Pam': 28, 'Kevin': 33} >>> ages['Michael'] Traceback (most recent call last):   File "<stdin>", line 1, in <module> KeyError: 'Michael' 

Here, attempting to access the key 'Michael' in the ages dictionary results in a KeyError being raised. At the bottom of the traceback, you get the relevant information:

  • The fact that a KeyError was raised
  • The key that couldn’t be found, which was 'Michael'

The second-to-last line tells you which line raised the exception. This information is more helpful when you execute Python code from a file.

Note: When an exception is raised in Python, it is done with a traceback. The traceback gives you all the relevant information to be able to determine why the exception was raised and what caused it.

Learning how to read a Python traceback and understanding what it is telling you is crucial to improving as a Python programmer.

In the program below, you can see the ages dictionary defined again. This time, you will be prompted to provide the name of the person to retrieve the age for:

 1 # ages.py  2   3 ages = {'Jim': 30, 'Pam': 28, 'Kevin': 33}  4 person = input('Get age for: ')  5 print(f'{person} is {ages[person]} years old.') 

This code will take the name that you provide at the prompt and attempt to retrieve the age for that person. Whatever you type in at the prompt will be used as the key to the ages dictionary, on line 4.

Repeating the failed example from above, we get another traceback, this time with information about the line in the file that the KeyError is raised from:

$   python ages.py Get age for: Michael Traceback (most recent call last): File "ages.py", line 4, in <module>   print(f'{person} is {ages[person]} years old.') KeyError: 'Michael' 

The program fails when you give a key that is not in the dictionary. Here, the traceback’s last few lines point to the problem. File "ages.py", line 4, in <module> tells you which line of which file raised the resulting KeyError exception. Then you are shown that line. Finally, the KeyError exception provides the missing key.

So you can see that the KeyError traceback’s final line doesn’t give you enough information on its own, but the lines before it can get you a lot closer to understanding what went wrong.

Note: Like the example above, most of the other examples in this tutorial make use of f-strings, which were introduced in Python 3.6.

Where Else You Might See a Python KeyError in the Standard Library

The large majority of the time, a Python KeyError is raised because a key is not found in a dictionary or a dictionary subclass (such as os.environ).

In rare cases, you may also see it raised in other places in Python’s Standard Library, such as in the zipfile module, if an item is not found in a ZIP archive. However, these places keep the same semantic meaning of the Python KeyError, which is not finding the key requested.

In the following example, you can see using the zipfile.ZipFile class to extract information about a ZIP archive using .getinfo():

>>>

>>> from zipfile import ZipFile >>> zip_file = ZipFile('the_zip_file.zip') >>> zip_file.getinfo('something') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/path/to/python/installation/zipfile.py", line 1304, in getinfo   'There is no item named %r in the archive' % name) KeyError: "There is no item named 'something' in the archive" 

This doesn’t really look like a dictionary key lookup. Instead, it is a call to zipfile.ZipFile.getinfo() that raises the exception.

The traceback also looks a little different with a little more information given than just the missing key: KeyError: "There is no item named 'something' in the archive".

The final thing to note here is that the line that raised the KeyError isn’t in your code. It is in the zipfile code, but previous lines of the traceback indicate which lines in your code caused the problem.

When You Need to Raise a Python KeyError in Your Own Code

There may be times when it makes sense for you to raise a Python KeyError exception in your own code. This can be done by using the raise keyword and calling the KeyError exception:

raise KeyError(message) 

Usually, the message would be the missing key. However, as in the case of the zipfile package, you could opt to give a bit more information to help the next developer better understand what went wrong.

If you decide to raise a Python KeyError in your own code, just make sure that your use case matches the semantic meaning behind the exception. It should denote that the key being looked for could not be found.

How to Handle a Python KeyError When You See It

When you encounter a KeyError, there are a few standard ways to handle it. Depending on your use case, some of these solutions might be better than others. The ultimate goal is to stop unexpected KeyError exceptions from being raised.

The Usual Solution: .get()

If the KeyError is raised from a failed dictionary key lookup in your own code, you can use .get() to return either the value found at the specified key or a default value.

Much like the age retrieval example from before, the following example shows a better way to get the age from the dictionary using the key provided at the prompt:

 1 # ages.py  2   3 ages = {'Jim': 30, 'Pam': 28, 'Kevin': 33}  4 person = input('Get age for: ')  5 age = ages.get(person)  6   7 if age:  8     print(f'{person} is {age} years old.')  9 else: 10     print(f"{person}'s age is unknown.") 

Here, line 5 shows how you can get the age value from ages using .get(). This will result in the age variable having the age value found in the dictionary for the key provided or a default value, None in this case.

This time, you will not get a KeyError exception raised because of the use of the safer .get() method to get the age rather than attempting to access the key directly:

$   python ages.py Get age for: Michael Michael's age is unknown. 

In the example execution above, the KeyError is no longer raised when a bad key is provided. The key 'Michael' is not found in the dictionary, but by using .get(), we get a None returned rather than a raised KeyError.

The age variable will either have the person’s age found in the dictionary or the default value (None by default). You can also specify a different default value in the .get() call by passing a second argument or the keyword argument default.

This is line 5 from the example above with a different default age specified using .get():

age = ages.get(person, default=0) 

Here, instead of 'Michael' returning None, it would return 0 because the key isn’t found, and the default value to return is now 0.

The Rare Solution: Checking for Keys

There are times when you need to determine the existence of a key in a dictionary. In these cases, using .get() might not give you the correct information. Getting a None returned from a call to .get() could mean that the key wasn’t found or that the value found at the key in the dictionary is actually None.

With a dictionary or dictionary-like object, you can use the in operator to determine whether a key is in the mapping. This operator will return a boolean (True or False) value indicating whether the key is found in the dictionary.

In this example, you are getting a response dictionary from calling an API. This response might have an error key value defined in the response, which would indicate that the response is in an error state:

 1 # parse_api_response.py  2 ...  3 # Assuming you got a `response` from calling an API that might  4 # have an error key in the `response` if something went wrong  5   6 if 'error' in response:  7     ...  # Parse the error state  8 else:  9     ...  # Parse the success state 

Here, there is a difference in checking to see if the error key exists in the response and getting a default value from the key. This is a rare case where what you are actually looking for is if the key is in the dictionary and not what the value at that key is.

The General Solution: try except

As with any exception, you can always use the try except block to isolate the potential exception-raising code and provide a backup solution.

You can use the try except block in a similar example as before, but this time providing a default message to be printed should a KeyError be raised in the normal case:

 1 # ages.py  2   3 ages = {'Jim': 30, 'Pam': 28, 'Kevin': 33}  4 person = input('Get age for: ')  5   6 try:  7     print(f'{person} is {ages[person]} years old.')  8 except KeyError:  9     print(f"{person}'s age is unknown.") 

Here, you can see the normal case in the try block of printing the person’s name and age. The backup case is in the except block, where if a KeyError is raised in the normal case, then the backup case is to print a different message.

The try except block solution is also a great solution for other places that might not support .get() or the in operator. It is also the best solution if the KeyError is being raised from another person’s code.

Here is an example using the zipfile package again. This time, the try except block gives us a way to stop the KeyError exception from stopping the program:

>>>

>>> from zipfile import ZipFile >>> zip = ZipFile('the_zip_file.zip') >>> try: ...     zip.getinfo('something') ... except KeyError: ...     print('Can not find "something"') ... Can not find "something" 

Since the ZipFile class does not provide .get(), like the dictionary does, you need to use the try except solution. In this example, you don’t have to know ahead of time what values are valid to pass to .getinfo().

Conclusion

You now know some common places where Python’s KeyError exception could be raised and some great solutions you could use to prevent them from stopping your program.

Now, the next time you see a KeyError raised, you will know that it is probably just a bad dictionary key lookup. You will also be able to find all the information you need to determine where the error is coming from by looking at the last few lines of the traceback.

If the problem is a dictionary key lookup in your own code, then you can switch from accessing the key directly on the dictionary to using the safer .get() method with a default return value. If the problem isn’t coming from your own code, then using the try except block is your best bet for controlling your code’s flow.

Exceptions don’t have to be scary. Once you know how to understand the information provided to you in their tracebacks and the root cause of the exception, then you can use these solutions to make your programs flow more predictably.


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

Planet Python

Real Python: How to Work With a PDF in Python

The Portable Document Format or PDF is a file format that can be used to present and exchange documents reliably across operating systems. While the PDF was originally invented by Adobe, it is now an open standard that is maintained by the International Organization for Standardization (ISO). You can work with a preexisting PDF in Python by using the PyPDF2 package.

PyPDF2 is a pure-Python package that you can use for many different types of PDF operations.

By the end of this article, you’ll know how to do the following:

  • Extract document information from a PDF in Python
  • Rotate pages
  • Merge PDFs
  • Split PDFs
  • Add watermarks
  • Encrypt a PDF

Let’s get started!

Free Bonus: Click here to get access to a chapter from Python Tricks: The Book that shows you Python’s best practices with simple examples you can apply instantly to write more beautiful + Pythonic code.

History of pyPdf, PyPDF2, and PyPDF4

The original pyPdf package was released way back in 2005. The last official release of pyPdf was in 2010. After a lapse of around a year, a company called Phasit sponsored a fork of pyPdf called PyPDF2. The code was written to be backwards compatible with the original and worked quite well for several years, with its last release being in 2016.

There was a brief series of releases of a package called PyPDF3, and then the project was renamed to PyPDF4. All of these projects do pretty much the same thing, but the biggest difference between pyPdf and PyPDF2+ is that the latter versions added Python 3 support. There is a different Python 3 fork of the original pyPdf for Python 3, but that one has not been maintained for many years.

While PyPDF2 was recently abandoned, the new PyPDF4 does not have full backwards compatibility with PyPDF2. Most of the examples in this article will work perfectly fine with PyPDF4, but there are some that cannot, which is why PyPDF4 is not featured more heavily in this article. Feel free to swap out the imports for PyPDF2 with PyPDF4 and see how it works for you.

pdfrw: An Alternative

Patrick Maupin created a package called pdfrw that can do many of the same things that PyPDF2 does. You can use pdfrw for all of the same sorts of tasks that you will learn how to do in this article for PyPDF2, with the notable exception of encryption.

The biggest difference when it comes to pdfrw is that it integrates with the ReportLab package so that you can take a preexisting PDF and build a new one with ReportLab using some or all of the preexisting PDF.

Installation

Installing PyPDF2 can be done with pip or conda if you happen to be using Anaconda instead of regular Python.

Here’s how you would install PyPDF2 with pip:

$   pip install pypdf2 

The install is quite quick as PyPDF2 does not have any dependencies. You will likely spend as much time downloading the package as you will installing it.

Now let’s move on and learn how to extract some information from a PDF.

How to Extract Document Information From a PDF in Python

You can use PyPDF2 to extract metadata and some text from a PDF. This can be useful when you’re doing certain types of automation on your preexisting PDF files.

Here are the current types of data that can be extracted:

  • Author
  • Creator
  • Producer
  • Subject
  • Title
  • Number of pages

You need to go find a PDF to use for this example. You can use any PDF you have handy on your machine. To make things easy, I went to Leanpub and grabbed a sample of one of my books for this exercise. The sample you want to download is called reportlab-sample.pdf.

Let’s write some code using that PDF and learn how you can get access to these attributes:

# extract_doc_info.py  from PyPDF2 import PdfFileReader  def extract_information(pdf_path):     with open(pdf_path, 'rb') as f:         pdf = PdfFileReader(f)         information = pdf.getDocumentInfo()         number_of_pages = pdf.getNumPages()      txt = f"""     Information about {pdf_path}:       Author: {information.author}     Creator: {information.creator}     Producer: {information.producer}     Subject: {information.subject}     Title: {information.title}     Number of pages: {number_of_pages}     """      print(txt)     return information  if __name__ == '__main__':     path = 'reportlab-sample.pdf'     extract_information(path) 

Here you import PdfFileReader from the PyPDF2 package. The PdfFileReader is a class with several methods for interacting with PDF files. In this example, you call .getDocumentInfo(), which will return an instance of DocumentInformation. This contains most of the information that you’re interested in. You also call .getNumPages() on the reader object, which returns the number of pages in the document.

Note: That last code block uses Python 3’s new f-strings for string formatting. If you’d like to learn more, you can check out Python 3’s f-Strings: An Improved String Formatting Syntax (Guide).

The information variable has several instance attributes that you can use to get the rest of the metadata you want from the document. You print out that information and also return it for potential future use.

While PyPDF2 has .extractText(), which can be used on its page objects (not shown in this example), it does not work very well. Some PDFs will return text and some will return an empty string. When you want to extract text from a PDF, you should check out the PDFMiner project instead. PDFMiner is much more robust and was specifically designed for extracting text from PDFs.

Now you’re ready to learn about rotating PDF pages.

How to Rotate Pages

Occasionally, you will receive PDFs that contain pages that are in landscape mode instead of portrait mode. Or perhaps they are even upside down. This can happen when someone scans a document to PDF or email. You could print the document out and read the paper version or you can use the power of Python to rotate the offending pages.

For this example, you can go and pick out a Real Python article and print it to PDF.

Let’s learn how to rotate a few of the pages of that article with PyPDF2:

# rotate_pages.py  from PyPDF2 import PdfFileReader, PdfFileWriter  def rotate_pages(pdf_path):     pdf_writer = PdfFileWriter()     pdf_reader = PdfFileReader(path)     # Rotate page 90 degrees to the right     page_1 = pdf_reader.getPage(0).rotateClockwise(90)     pdf_writer.addPage(page_1)     # Rotate page 90 degrees to the left     page_2 = pdf_reader.getPage(1).rotateCounterClockwise(90)     pdf_writer.addPage(page_2)     # Add a page in normal orientation     pdf_writer.addPage(pdf_reader.getPage(2))      with open('rotate_pages.pdf', 'wb') as fh:         pdf_writer.write(fh)  if __name__ == '__main__':     path = 'Jupyter_Notebook_An_Introduction.pdf'     rotate_pages(path) 

For this example, you need to import the PdfFileWriter in addition to PdfFileReader because you will need to write out a new PDF. rotate_pages() takes in the path to the PDF that you want to modify. Within that function, you will need to create a writer object that you can name pdf_writer and a reader object called pdf_reader.

Next, you can use .GetPage() to get the desired page. Here you grab page zero, which is the first page. Then you call the page object’s .rotateClockwise() method and pass in 90 degrees. Then for page two, you call .rotateCounterClockwise() and pass it 90 degrees as well.

Note: The PyPDF2 package only allows you to rotate a page in increments of 90 degrees. You will receive an AssertionError otherwise.

After each call to the rotation methods, you call .addPage(). This will add the rotated version of the page to the writer object. The last page that you add to the writer object is page 3 without any rotation done to it.

Finally you write out the new PDF using .write(). It takes a file-like object as its parameter. This new PDF will contain three pages. The first two will be rotated in opposite directions of each other and be in landscape while the third page is a normal page.

Now let’s learn how you can merge multiple PDFs into one.

How to Merge PDFs

There are many situations where you will want to take two or more PDFs and merge them together into a single PDF. For example, you might have a standard cover page that needs to go on to many types of reports. You can use Python to help you do that sort of thing.

For this example, you can open up a PDF and print a page out as a separate PDF. Then do that again, but with a different page. That will give you a couple of inputs to use for example purposes.

Let’s go ahead and write some code that you can use to merge PDFs together:

# pdf_merging.py  from PyPDF2 import PdfFileReader, PdfFileWriter  def merge_pdfs(paths, output):     pdf_writer = PdfFileWriter()      for path in paths:         pdf_reader = PdfFileReader(path)         for page in range(pdf_reader.getNumPages()):             # Add each page to the writer object             pdf_writer.addPage(pdf_reader.getPage(page))      # Write out the merged PDF     with open(output, 'wb') as out:         pdf_writer.write(out)  if __name__ == '__main__':     paths = ['document1.pdf', 'document2.pdf']     merge_pdfs(paths, output='merged.pdf') 

You can use merge_pdfs() when you have a list of PDFs that you want to merge together. You will also need to know where to save the result, so this function takes a list of input paths and an output path.

Then you loop over the inputs and create a PDF reader object for each of them. Next you will iterate over all the pages in the PDF file and use .addPage() to add each of those pages to itself.

Once you’re finished iterating over all of the pages of all of the PDFs in your list, you will write out the result at the end.

One item I would like to point out is that you could enhance this script a bit by adding in a range of pages to be added if you didn’t want to merge all the pages of each PDF. If you’d like a challenge, you could also create a command line interface for this function using Python’s argparse module.

Let’s find out how to do the opposite of merging!

How to Split PDFs

There are times where you might have a PDF that you need to split up into multiple PDFs. This is especially true of PDFs that contain a lot of scanned-in content, but there are a plethora of good reasons for wanting to split a PDF.

Here’s how you can use PyPDF2 to split your PDF into multiple files:

# pdf_splitting.py  from PyPDF2 import PdfFileReader, PdfFileWriter  def split(path, name_of_split):     pdf = PdfFileReader(path)     for page in range(pdf.getNumPages()):         pdf_writer = PdfFileWriter()         pdf_writer.addPage(pdf.getPage(page))          output = f'{name_of_split}{page}.pdf'         with open(output, 'wb') as output_pdf:             pdf_writer.write(output_pdf)  if __name__ == '__main__':     path = 'Jupyter_Notebook_An_Introduction.pdf'     split(path, 'jupyter_page') 

In this example, you once again create a PDF reader object and loop over its pages. For each page in the PDF, you will create a new PDF writer instance and add a single page to it. Then you will write that page out to a uniquely named file. When the script is finished running, you should have each page of the original PDF split into separate PDFs.

Now let’s take a moment to learn how you can add a watermark to your PDF.

How to Add Watermarks

Watermarks are identifying images or patterns on printed and digital documents. Some watermarks can only be seen in special lighting conditions. The reason watermarking is important is that it allows you to protect your intellectual property, such as your images or PDFs. Another term for watermark is overlay.

You can use Python and PyPDF2 to watermark your documents. You need to have a PDF that only contains your watermark image or text.

Let’s learn how to add a watermark now:

# pdf_watermarker.py  from PyPDF2 import PdfFileWriter, PdfFileReader  def create_watermark(input_pdf, output, watermark):     watermark_obj = PdfFileReader(watermark)     watermark_page = watermark_obj.getPage(0)      pdf_reader = PdfFileReader(input_pdf)     pdf_writer = PdfFileWriter()      # Watermark all the pages     for page in range(pdf_reader.getNumPages()):         page = pdf_reader.getPage(page)         page.mergePage(watermark_page)         pdf_writer.addPage(page)      with open(output, 'wb') as out:         pdf_writer.write(out)  if __name__ == '__main__':     create_watermark(         input_pdf='Jupyter_Notebook_An_Introduction.pdf',          output='watermarked_notebook.pdf',         watermark='watermark.pdf') 

create_watermark() accepts three arguments:

  1. input_pdf: the PDF file path to be watermarked
  2. output: the path you want to save the watermarked version of the PDF
  3. watermark: a PDF that contains your watermark image or text

In the code, you open up the watermark PDF and grab just the first page from the document as that is where your watermark should reside. Then you create a PDF reader object using the input_pdf and a generic pdf_writer object for writing out the watermarked PDF.

The next step is to iterate over the pages in the input_pdf. This is where the magic happens. You will need to call .mergePage() and pass it the watermark_page. When you do that, it will overlay the watermark_page on top of the current page. Then you add that newly merged page to your pdf_writer object.

Finally, you write the newly watermarked PDF out to disk, and you’re done!

The last topic you will learn about is how PyPDF2 handles encryption.

How to Encrypt a PDF

PyPDF2 currently only supports adding a user password and an owner password to a preexisting PDF. In PDF land, an owner password will basically give you administrator privileges over the PDF and allow you to set permissions on the document. On the other hand, the user password just allows you to open the document.

As far as I can tell, PyPDF2 doesn’t actually allow you to set any permissions on the document even though it does allow you to set the owner password.

Regardless, this is how you can add a password, which will also inherently encrypt the PDF:

# pdf_encrypt.py  from PyPDF2 import PdfFileWriter, PdfFileReader  def add_encryption(input_pdf, output_pdf, password):     pdf_writer = PdfFileWriter()     pdf_reader = PdfFileReader(input_pdf)      for page in range(pdf_reader.getNumPages()):         pdf_writer.addPage(pdf_reader.getPage(page))      pdf_writer.encrypt(user_pwd=password, owner_pwd=None,                         use_128bit=True)      with open(output_pdf, 'wb') as fh:         pdf_writer.write(fh)  if __name__ == '__main__':     add_encryption(input_pdf='reportlab-sample.pdf',                    output_pdf='reportlab-encrypted.pdf',                    password='twofish') 

add_encryption() takes in the input and output PDF paths as well as the password that you want to add to the PDF. It then opens a PDF writer and a reader object, as before. Since you will want to encrypt the entire input PDF, you will need to loop over all of its pages and add them to the writer.

The final step is to call .encrypt(), which takes the user password, the owner password, and whether or not 128-bit encryption should be added. The default is for 128-bit encryption to be turned on. If you set it to False, then 40-bit encryption will be applied instead.

Note: PDF encryption uses either RC4 or AES (Advanced Encryption Standard) to encrypt the PDF according to pdflib.com.

Just because you have encrypted your PDF does not mean it is necessarily secure. There are tools to remove passwords from PDFs. If you’d like to learn more, Carnegie Mellon University has an interesting paper on the topic.

Conclusion

The PyPDF2 package is quite useful and is usually pretty fast. You can use PyPDF2 to automate large jobs and leverage its capabilities to help you do your job better!

In this tutorial, you learned how to do the following:

  • Extract metadata from a PDF
  • Rotate pages
  • Merge and split PDFs
  • Add watermarks
  • Add encryption

Also keep an eye on the newer PyPDF4 package as it will likely replace PyPDF2 soon. You might also want to check out pdfrw, which can do many of the same things that PyPDF2 can do.

Further Reading

If you’d like to learn more about working with PDFs in Python, you should check out some of the following resources for more information:


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

Planet Python

Real Python: Python String Formatting Tips & Best Practices

Learn the four main approaches to string formatting in Python, as well as their strengths and weaknesses. You’ll also get a simple rule of thumb for how to pick the best general purpose string formatting approach in your own programs.


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

Planet Python