The Code Bits: Introduction to Generators in Python

In this post, we will learn what generators are, how to create them, how they work and how to use them in Python.

Generator function

Generators are functions that allow us to create iterators in Python. They provide a convenient, simple and memory-efficient approach to creating iterators. These are useful when dealing with large amounts of data.

Before starting with generators, it would be good to understand how a for-loop works in Python. It will be also be useful to know what iterable, iterator and the iterator protocol are.

An example: Generate even numbers using a generator function

Let us start with a simple example. We will be creating a generator function which generates a specific count of even numbers starting from a given value. We will be using this same example throughout this post.

def generate_even_numbers(start, count):     # Make sure that the first number is even.     start = start if start % 2 == 0 else start + 1      while count > 0:         yield start         start += 2         count -= 1 

Note that we used a yield statement within the function body to return our data. If you don’t understand it right away, no need to worry, we will get to its roots soon enough!

Let us see how we would use this generator function in a for-loop.

>>> generator_iterator = generate_even_numbers(0, 3) >>> for num in generator_iterator: ...     print(num) ... 0 2 4 

As you can see, we were able to use the value returned by the generator function in a for-loop, so it must have been an iterable.

Generator function returns a generator iterator

Let us check the type of the value returned by the generator function.

>>> generator_iterator = generate_even_numbers(0, 3) >>> type(generator_iterator) <class 'generator'> 

Okay, so the value returned is of type ‘generator’. This value is usually referred to as the generator iterator, even though the term generator is sometimes used interchangeably to refer to both the generator function as well as the generator iterator.

Now let us confirm that the generator_iterator is indeed an iterator. As per the iterator protocol, an iterator must:

  1. return its elements one by one when next() method is called on it. When all the elements are exhausted, it must raise StopIteration.
  2.  >>> generator_iterator = generate_even_numbers(0, 3) >>> next(generator_iterator) 0 >>> next(generator_iterator) 2 >>> next(generator_iterator) 4 >>> next(generator_iterator) Traceback (most recent call last):   File "", line 1, in  StopIteration 
  3. return itself when iter() method is called on it.
  4.  >>> generator_iterator = generate_even_numbers(0, 3) >>> generator_iterator <generator object generate_even_numbers at 0x10cb431b0> >>> iter(generator_iterator) <generator object generate_even_numbers at 0x10cb431b0> 

So now we know that the generator function is a convenient way to create an iterator. But what makes this function different from our normal methods in Python? How does it return an iterator? The answer lies in the yield statement.

How does the generator function work?

Let us revisit our generator example, now with some prints so that we can clearly understand how it works.

def generate_even_numbers(start, count):     print("In the generator function")      # Make sure that the first number is even.     start = start if start % 2 == 0 else start + 1      while count > 0:         print("[count:{}] Hello! Before I yield....".format(count))         yield start         print("[count:{}] Hey! I am back!!".format(count))         start += 2         count -= 1     print("[count:{}] That's all I have got...".format(count)) 

Now let us see its usage.

>>> generator_iterator = generate_even_numbers(3, 2) >>> for num in generator_iterator: ...     print("Processing even number: {}".format(num)) ... In the generator function [count:2] Hello! Before I yield.... Processing even number: 4 [count:2] Hey! I am back!! [count:1] Hello! Before I yield.... Processing even number: 6 [count:1] Hey! I am back!! [count:0] That's all I have got... 

There are a couple of things you should notice:

  1. Lazy evaluation
  2. How the yield statement works

Let us discuss these.

Lazy evaluation

Calling the function generate_even_numbers(3, 2) just returns a generator iterator. It does not start executing the function. This is called lazy evaluation. They start executing and yielding values only when it is needed, that is, when next() is called. As a result, only one element of the iterator is held in memory at a time. This makes them memory efficient and hence useful when dealing with large amounts of data.

How does the yield statement work?

By now, you may have gathered that the only special thing about the generator function with respect to normal functions is that they use yield to return their values. However, the yield statement is very much different from a normal return statement.

The yield statement makes a function a generator.

When next() is called on the generator iterator, the generator function executes till a yield statement is encountered. When the yield statement is reached, the execution state of the function is remembered (including the local variables and any try statements) and the function’s execution is temporarily suspended. The value associated with the yield statement is returned by the next() method.

When next() is called again, the generator function resumes execution. The saved local execution state is recollected and the statement next to yield is executed first. Then it continues executing till the next yield statement is encountered. Thus goes the process.

Finally, if there is no more yield in the generator function when next() is called, it ends up raising StopIteration. At this point, the for-loop would exit.

A simpler example: Generator function to yield some strings

Let us make sure that all of that is clear with a simpler example.

def generate_hello_world():     print("....Started executing the generator function")     yield "Hello"     print("....Between yields!")     yield "World"     print("....Done with yields!") 

Let us see how to use the iterator returned by the generator function using next() method.

 >>> """ We get the generator iterator """ >>> generator_iterator = generate_hello_world()  >>> """ When next() is called, the function executes till the first yield statement """ >>> next(generator_iterator) ....Started executing the generator function 'Hello'  >>> """ When next() is called again, it picks up where it left off and executes till the next yield statement """ >>> next(generator_iterator) ....Between yields! 'World'  >>> """ When there are no more yields, calling next() raises StopIteration """ >>> next(generator_iterator) ....Done with yields! Traceback (most recent call last):   File "", line 1, in  StopIteration 

Now let us see how to use the generator function in a for-loop.

>>> for word in generate_hello_world(): ...     print(word) ... ....Started executing the generator function Hello ....Between yields! World ....Done with yields! >>> 

On a side note, pay attention to how we did not use a separate variable to hold the generator iterator as in our previous examples. We directly called the generator function with the for-loop. This is doable because of how a for-loop works in Python. The expression following “in” is evaluated only once. This expression is expected to result in an iterable. In this case, it will result in the generator iterator. Then the method iter() is called on the iterable to get the iterator associated with it. Then next() is called repeatedly on the iterator until the iterator is exhausted.

Conclusion

In this post, we learned how to create generator functions in Python, how they work and how to use them.

Planet Python

Vasudev Ram: Python’s dynamic nature: sticking an attribute onto an object


By Vasudev Ram – Online Python training / SQL training / Linux training


Hi, readers,

[This is a beginner-level Python post.]

Python, being a dynamic language, has some interesting features that some static languages may not have (and vice versa too, of course).

One such feature, which I noticed a while ago, is that you can add an attribute to a Python object even after it has been created. (Conditions apply.)

I had used this feature some time ago to work around some implementation issue in a rudimentary RESTful server that I created as a small teaching project. It was based on the BaseHTTPServer module.

Here is a (different) simple example program, stick_attrs_onto_obj.py, that demonstrates this Python feature.
My informal term for this feature is “sticking an attribute onto an object” after the object is created.

Since the program is simple, and there are enough comments in the code, I will not explain it in detail.

# stick_attrs_onto_obj.py

# A program to show:
# 1) that you can "stick" attributes onto a Python object after it is created, and
# 2) one use of this technique, to count the number# of calls to a function.

# Copyright 2019 Vasudev Ram
# Web site: https://vasudevram.github.io
# Blog: https://jugad2.blogspot.com
# Training: https://jugad2.blogspot.com/p/training.html
# Product store: https://gumroad.com/vasudevram
# Twitter: https://twitter.com/vasudevram

from __future__ import print_function

# Define a function.
def foo(arg):
# Print something to show that the function has been called.
print("in foo: arg = {}".format(arg))
# Increment the "stuck-on" int attribute inside the function.
foo.call_count += 1

# A function is also an object in Python.
# So we can add attributes to it, including after it is defined.
# I call this "sticking" an attribute onto the function object.
# The statement below defines the attribute with an initial value,
# which is changeable later, as we will see.
foo.call_count = 0

# Print its initial value before any calls to the function.
print("foo.call_count = {}".format(foo.call_count))

# Call the function a few times.
for i in range(5):
foo(i)

# Print the attribute's value after those calls.
print("foo.call_count = {}".format(foo.call_count))

# Call the function a few more times.
for i in range(3):
foo(i)

# Print the attribute's value after those additional calls.
print("foo.call_count = {}".format(foo.call_count))

And here is the output of the program:

$   python stick_attrs_onto_obj.py
foo.call_count = 0
in foo: arg = 0
in foo: arg = 1
in foo: arg = 2
in foo: arg = 3
in foo: arg = 4
foo.call_count = 5
in foo: arg = 0
in foo: arg = 1
in foo: arg = 2
foo.call_count = 8

There may be other ways to get the call count of a function, including using a profiler, and maybe by using a closure or decorator or other way. But this way is really simple. And as you can see from the code, it is also possible to use it to find the number of calls to the function, between any two points in the program code. For that, we just have to store the call count in a variable at the first point, and subtract that value from the call count at the second point. In the above program, that would be 8 – 5 = 3, which matches the 3 that is the number of calls to function foo made by the 2nd for loop.

Enjoy.


Vasudev Ram – Online Python training and consulting


I conduct online courses on Python programming, Unix / Linux commands and shell scripting and SQL programming and database design, with course material and personal coaching sessions.

The course details and testimonials are here.

Contact me for details of course content, terms and schedule.

Try FreshBooks: Create and send professional looking invoices in less than 30 seconds.

Getting a new web site or blog, and want to help preserve the environment at the same time? Check out GreenGeeks.com web hosting.

Sell your digital products via DPD: Digital Publishing for Ebooks and Downloads.

Learning Linux? Hit the ground running with my vi quickstart tutorial. I wrote it at the request of two Windows system administrator friends who were given additional charge of some Unix systems. They later told me that it helped them to quickly start using vi to edit text files on Unix. Of course, vi/vim is one of the most ubiquitous text editors around, and works on most other common operating systems and on some uncommon ones too, so the knowledge of how to use it will carry over to those systems too.

Check out WP Engine, powerful WordPress hosting.

Creating online products for sale? Check out ConvertKit, email marketing for online creators.

Teachable: feature-packed course creation platform, with unlimited video, courses and students.

Posts about: Python * DLang * xtopdf

My ActiveState Code recipes

Follow me on:


Planet Python

Matillion 101: Completing My First Job in Matillion

Matillion 101 first job

If you’ve been following along in this Matillion series, you’ve already set up and logged into Matillion. From here, I’m going to talk through the first part of the first job I created in Matillion, leaning heavily on Holt’s recent post about setting up an API profile.

Pulling in Metadata to Matillion

For my first Matillion job, I wanted to pull all the data from XKCD comics into a single table in Snowflake. I’m a big XKCD fan, and a couple months ago, a friend shared with me that you can get all the comics and metadata through a JSON interface. More information can be found here, where you’ll see the structure of the URLs for each comic. It’s important to note for this Matillion project that each comic has its own URL, so we’ll need to incorporate an iterator to loop through all the URLs to get all the data we want.

Following the instructions in Holt’s blog, I got my initial API profile set up, generated an RSD profile (shown below) and built an orchestration job that took the output from the one URL (https://xkcd.com/1/info.0.json) I put in the URI and dropped it into a table:

orchestration job in Matillion with API profile

However, that’s only one URL. Now, I need to iterate through the other few thousand to get all the data. To iterate through this, I do three things:

  1. Create an environment variable that will allow me to pass my iterated values through to my API
  2. Update my API profile to include this new environment variable
  3. Update my API query properties to look for this new environment variable

Creating an Environment Variable

To create an environment variable, click on Project in the top-left corner and then Manage Environment Variables. Add a new one. Since the value I’m changing is a number to signify the comic number, I choose a type of numeric and name it urliterate:

edit environment variable in Matillion

Updating the API Profile to Include New Environment Variable

Next, I update the API profile. Click on Project > Edit API Profile. Here, I do three things. First, I create a parameter that lets Matillion know how I’m going to refer to our environment variable in the .rsd file. I add a new Other parameter and, for the value, specify that any time I write urliterate in my .rsd file, I’ll be referring to my urliterate Environment Variable, which is indicated by $ {}:

urliterate=$ {URLIterate}

Next, as the Matillion support explains, “We are creating a dummy column for this parameter, so we can use the parameter in the API Component.” To do this, I name the input the same as our parameter value (urliterate) and specify the type (numeric) (Line 17):

<input name=”urliterate” xs:type=”numeric”/>

Lastly, on line 20, I specify where in the URL I use Environment Variable. Again, the Matillion support explains that “the required format to specify a parameter in the RSD file is”:

[_input.<parametername>].

I add that in to the URI value:

<rsb:set attr=”uri” value=”https://xkcd.com/[_input.urlterate]/info.0.json” />

Updating the API Profile to Include New Environment Variable

Update the API Query Properties to Search for the Environment Variable

Now that the API profile is updated, I update it in my API query properties as well. For this API component, I want to circle through our urliterate values in the API profile (xkcd_urloutput). To do this, I input SQL to specify that I want all values from this table where urliterate=$ {urliterate}.

To see a SQL input option, change from Basic to Advanced mode in the API query properties. For the SQL query, I type:

SELECT *

FROM xkcd_urloutput

WHERE urliterate = ‘$ {urliterate}’

Update the API Query Properties to Search for the Environment Variable

Now I’m ready to cycle through this environment variable. However, If I just drop an iterator on top of the API query tool, the results from each subsequent query are going to overwrite the first. To create a full table with all of the data, I set up a transformation job to take the output from one URL, write to a master table and then append subsequent records to that Master Table.

A Closer Look at the Iterator Process

Let’s break down that process a little more and what it’s going to look like.

Job 1 – Create an Output_Iterated table where we can append iterated records (and call Job 2):

output_iterated table in Matillion

Job 2 – Have Matillion run through URL 1 (https://xkcd.com/1/info.0.json):

  1. Run API query
  2. Write to xkcd_output
  3. Call Job 3

Matillion run URL query

Job 3 – Take xkcd_output table and write to Output_iterated table:

writing data to output_iterated table in Matillion

Let’s walk through each of those in a bit more detail.

Creating an Output_Iterated Table

The first job, Conduct Iteration, uses a Create Table component called Output Iterated. I manually input the column names in my XKCD dataset. Next in this job, I want to point Matillion to Job 2, which is the Comic URL Input orchestration job we created above, so I drag that in the workflow.

I’ll want to iterate Jobs 2 and 3, but we’ll come back to that more a little later. I need the result of Create Table components for the other jobs in this workflow, so I right-click on the Create Table component and run the component:

create table component and run

The second job was the result of the initial API query setup and calls a specific URL and writes out the results. With only one URL, the last component was an output table, but since I have multiple outputs that I need to append, I call Job 3.

Note: You will need to create the third job before you’re able to pull it into the second.

The last job is appending each individual API query to the master table; I take the xkcd_output that’s a result of the Comic URL Input and write it to the new table I created, Output Iterated. I ensure the column mapping between the two tables matches up in the Column Mapping properties:

column mapping in Matillion

column mapping in Matillion

Iterating Through the Full Process

The final step is to iterate through this entire process. Since my API defines each unique URL with a number, the loop iterator was exactly what I needed to pass that value to the API query. Since I also needed to loop through writing out the data from that query, I put the iterator on the Comic URL Input orchestration job within the Conduct Iteration job.

Setting up the properties of iterator, I’ve already set up the environment variable I want to loop through (urliterate) and can now set our Starting Value, End Value and Increment Value. I start with 10 to test my iteration job:

iteration job in Matillion

Now, my first Matillion project is complete! Below are the three jobs we’ve created:

Matillion project individual jobs

How Our Matillion Jobs Work Together

When we run our Conduct Iteration job, we create a table (Output_Iterated) to write out our final inputs to and kick off the first URL. For that URL, we go to the Comic URL Input job, pull the data using the API query, write that to our xkcd_output table, which then in our Write Iterator job writes to our Output_Iterated table.

From there, we go back to the Conduct Iteration job and start the second URL, which the API query gets data from and writes to the xkcd_output file (overwriting the first URL) in the Comic URL Input job. We then take this second value and append it to the first in our Output_iterated table in the Write Iterator job. We continue this process through all 10 (or n) URLs. From there, this data is output into Snowflake and I have all data from all 10 URLs:

iterator data in Snowflake

WOOHOO! Now, I can access this data altogether in Tableau or any other BI tool … any XKCD fans have fun ideas to explore?

The post Matillion 101: Completing My First Job in Matillion appeared first on InterWorks.

InterWorks

Real Python: How to Work With a PDF in Python

The Portable Document Format or PDF is a file format that can be used to present and exchange documents reliably across operating systems. While the PDF was originally invented by Adobe, it is now an open standard that is maintained by the International Organization for Standardization (ISO). You can work with a preexisting PDF in Python by using the PyPDF2 package.

PyPDF2 is a pure-Python package that you can use for many different types of PDF operations.

By the end of this article, you’ll know how to do the following:

  • Extract document information from a PDF in Python
  • Rotate pages
  • Merge PDFs
  • Split PDFs
  • Add watermarks
  • Encrypt a PDF

Let’s get started!

Free Bonus: Click here to get access to a chapter from Python Tricks: The Book that shows you Python’s best practices with simple examples you can apply instantly to write more beautiful + Pythonic code.

History of pyPdf, PyPDF2, and PyPDF4

The original pyPdf package was released way back in 2005. The last official release of pyPdf was in 2010. After a lapse of around a year, a company called Phasit sponsored a fork of pyPdf called PyPDF2. The code was written to be backwards compatible with the original and worked quite well for several years, with its last release being in 2016.

There was a brief series of releases of a package called PyPDF3, and then the project was renamed to PyPDF4. All of these projects do pretty much the same thing, but the biggest difference between pyPdf and PyPDF2+ is that the latter versions added Python 3 support. There is a different Python 3 fork of the original pyPdf for Python 3, but that one has not been maintained for many years.

While PyPDF2 was recently abandoned, the new PyPDF4 does not have full backwards compatibility with PyPDF2. Most of the examples in this article will work perfectly fine with PyPDF4, but there are some that cannot, which is why PyPDF4 is not featured more heavily in this article. Feel free to swap out the imports for PyPDF2 with PyPDF4 and see how it works for you.

pdfrw: An Alternative

Patrick Maupin created a package called pdfrw that can do many of the same things that PyPDF2 does. You can use pdfrw for all of the same sorts of tasks that you will learn how to do in this article for PyPDF2, with the notable exception of encryption.

The biggest difference when it comes to pdfrw is that it integrates with the ReportLab package so that you can take a preexisting PDF and build a new one with ReportLab using some or all of the preexisting PDF.

Installation

Installing PyPDF2 can be done with pip or conda if you happen to be using Anaconda instead of regular Python.

Here’s how you would install PyPDF2 with pip:

$   pip install pypdf2 

The install is quite quick as PyPDF2 does not have any dependencies. You will likely spend as much time downloading the package as you will installing it.

Now let’s move on and learn how to extract some information from a PDF.

How to Extract Document Information From a PDF in Python

You can use PyPDF2 to extract metadata and some text from a PDF. This can be useful when you’re doing certain types of automation on your preexisting PDF files.

Here are the current types of data that can be extracted:

  • Author
  • Creator
  • Producer
  • Subject
  • Title
  • Number of pages

You need to go find a PDF to use for this example. You can use any PDF you have handy on your machine. To make things easy, I went to Leanpub and grabbed a sample of one of my books for this exercise. The sample you want to download is called reportlab-sample.pdf.

Let’s write some code using that PDF and learn how you can get access to these attributes:

# extract_doc_info.py  from PyPDF2 import PdfFileReader  def extract_information(pdf_path):     with open(pdf_path, 'rb') as f:         pdf = PdfFileReader(f)         information = pdf.getDocumentInfo()         number_of_pages = pdf.getNumPages()      txt = f"""     Information about {pdf_path}:       Author: {information.author}     Creator: {information.creator}     Producer: {information.producer}     Subject: {information.subject}     Title: {information.title}     Number of pages: {number_of_pages}     """      print(txt)     return information  if __name__ == '__main__':     path = 'reportlab-sample.pdf'     extract_information(path) 

Here you import PdfFileReader from the PyPDF2 package. The PdfFileReader is a class with several methods for interacting with PDF files. In this example, you call .getDocumentInfo(), which will return an instance of DocumentInformation. This contains most of the information that you’re interested in. You also call .getNumPages() on the reader object, which returns the number of pages in the document.

Note: That last code block uses Python 3’s new f-strings for string formatting. If you’d like to learn more, you can check out Python 3’s f-Strings: An Improved String Formatting Syntax (Guide).

The information variable has several instance attributes that you can use to get the rest of the metadata you want from the document. You print out that information and also return it for potential future use.

While PyPDF2 has .extractText(), which can be used on its page objects (not shown in this example), it does not work very well. Some PDFs will return text and some will return an empty string. When you want to extract text from a PDF, you should check out the PDFMiner project instead. PDFMiner is much more robust and was specifically designed for extracting text from PDFs.

Now you’re ready to learn about rotating PDF pages.

How to Rotate Pages

Occasionally, you will receive PDFs that contain pages that are in landscape mode instead of portrait mode. Or perhaps they are even upside down. This can happen when someone scans a document to PDF or email. You could print the document out and read the paper version or you can use the power of Python to rotate the offending pages.

For this example, you can go and pick out a Real Python article and print it to PDF.

Let’s learn how to rotate a few of the pages of that article with PyPDF2:

# rotate_pages.py  from PyPDF2 import PdfFileReader, PdfFileWriter  def rotate_pages(pdf_path):     pdf_writer = PdfFileWriter()     pdf_reader = PdfFileReader(path)     # Rotate page 90 degrees to the right     page_1 = pdf_reader.getPage(0).rotateClockwise(90)     pdf_writer.addPage(page_1)     # Rotate page 90 degrees to the left     page_2 = pdf_reader.getPage(1).rotateCounterClockwise(90)     pdf_writer.addPage(page_2)     # Add a page in normal orientation     pdf_writer.addPage(pdf_reader.getPage(2))      with open('rotate_pages.pdf', 'wb') as fh:         pdf_writer.write(fh)  if __name__ == '__main__':     path = 'Jupyter_Notebook_An_Introduction.pdf'     rotate_pages(path) 

For this example, you need to import the PdfFileWriter in addition to PdfFileReader because you will need to write out a new PDF. rotate_pages() takes in the path to the PDF that you want to modify. Within that function, you will need to create a writer object that you can name pdf_writer and a reader object called pdf_reader.

Next, you can use .GetPage() to get the desired page. Here you grab page zero, which is the first page. Then you call the page object’s .rotateClockwise() method and pass in 90 degrees. Then for page two, you call .rotateCounterClockwise() and pass it 90 degrees as well.

Note: The PyPDF2 package only allows you to rotate a page in increments of 90 degrees. You will receive an AssertionError otherwise.

After each call to the rotation methods, you call .addPage(). This will add the rotated version of the page to the writer object. The last page that you add to the writer object is page 3 without any rotation done to it.

Finally you write out the new PDF using .write(). It takes a file-like object as its parameter. This new PDF will contain three pages. The first two will be rotated in opposite directions of each other and be in landscape while the third page is a normal page.

Now let’s learn how you can merge multiple PDFs into one.

How to Merge PDFs

There are many situations where you will want to take two or more PDFs and merge them together into a single PDF. For example, you might have a standard cover page that needs to go on to many types of reports. You can use Python to help you do that sort of thing.

For this example, you can open up a PDF and print a page out as a separate PDF. Then do that again, but with a different page. That will give you a couple of inputs to use for example purposes.

Let’s go ahead and write some code that you can use to merge PDFs together:

# pdf_merging.py  from PyPDF2 import PdfFileReader, PdfFileWriter  def merge_pdfs(paths, output):     pdf_writer = PdfFileWriter()      for path in paths:         pdf_reader = PdfFileReader(path)         for page in range(pdf_reader.getNumPages()):             # Add each page to the writer object             pdf_writer.addPage(pdf_reader.getPage(page))      # Write out the merged PDF     with open(output, 'wb') as out:         pdf_writer.write(out)  if __name__ == '__main__':     paths = ['document1.pdf', 'document2.pdf']     merge_pdfs(paths, output='merged.pdf') 

You can use merge_pdfs() when you have a list of PDFs that you want to merge together. You will also need to know where to save the result, so this function takes a list of input paths and an output path.

Then you loop over the inputs and create a PDF reader object for each of them. Next you will iterate over all the pages in the PDF file and use .addPage() to add each of those pages to itself.

Once you’re finished iterating over all of the pages of all of the PDFs in your list, you will write out the result at the end.

One item I would like to point out is that you could enhance this script a bit by adding in a range of pages to be added if you didn’t want to merge all the pages of each PDF. If you’d like a challenge, you could also create a command line interface for this function using Python’s argparse module.

Let’s find out how to do the opposite of merging!

How to Split PDFs

There are times where you might have a PDF that you need to split up into multiple PDFs. This is especially true of PDFs that contain a lot of scanned-in content, but there are a plethora of good reasons for wanting to split a PDF.

Here’s how you can use PyPDF2 to split your PDF into multiple files:

# pdf_splitting.py  from PyPDF2 import PdfFileReader, PdfFileWriter  def split(path, name_of_split):     pdf = PdfFileReader(path)     for page in range(pdf.getNumPages()):         pdf_writer = PdfFileWriter()         pdf_writer.addPage(pdf.getPage(page))          output = f'{name_of_split}{page}.pdf'         with open(output, 'wb') as output_pdf:             pdf_writer.write(output_pdf)  if __name__ == '__main__':     path = 'Jupyter_Notebook_An_Introduction.pdf'     split(path, 'jupyter_page') 

In this example, you once again create a PDF reader object and loop over its pages. For each page in the PDF, you will create a new PDF writer instance and add a single page to it. Then you will write that page out to a uniquely named file. When the script is finished running, you should have each page of the original PDF split into separate PDFs.

Now let’s take a moment to learn how you can add a watermark to your PDF.

How to Add Watermarks

Watermarks are identifying images or patterns on printed and digital documents. Some watermarks can only be seen in special lighting conditions. The reason watermarking is important is that it allows you to protect your intellectual property, such as your images or PDFs. Another term for watermark is overlay.

You can use Python and PyPDF2 to watermark your documents. You need to have a PDF that only contains your watermark image or text.

Let’s learn how to add a watermark now:

# pdf_watermarker.py  from PyPDF2 import PdfFileWriter, PdfFileReader  def create_watermark(input_pdf, output, watermark):     watermark_obj = PdfFileReader(watermark)     watermark_page = watermark_obj.getPage(0)      pdf_reader = PdfFileReader(input_pdf)     pdf_writer = PdfFileWriter()      # Watermark all the pages     for page in range(pdf_reader.getNumPages()):         page = pdf_reader.getPage(page)         page.mergePage(watermark_page)         pdf_writer.addPage(page)      with open(output, 'wb') as out:         pdf_writer.write(out)  if __name__ == '__main__':     create_watermark(         input_pdf='Jupyter_Notebook_An_Introduction.pdf',          output='watermarked_notebook.pdf',         watermark='watermark.pdf') 

create_watermark() accepts three arguments:

  1. input_pdf: the PDF file path to be watermarked
  2. output: the path you want to save the watermarked version of the PDF
  3. watermark: a PDF that contains your watermark image or text

In the code, you open up the watermark PDF and grab just the first page from the document as that is where your watermark should reside. Then you create a PDF reader object using the input_pdf and a generic pdf_writer object for writing out the watermarked PDF.

The next step is to iterate over the pages in the input_pdf. This is where the magic happens. You will need to call .mergePage() and pass it the watermark_page. When you do that, it will overlay the watermark_page on top of the current page. Then you add that newly merged page to your pdf_writer object.

Finally, you write the newly watermarked PDF out to disk, and you’re done!

The last topic you will learn about is how PyPDF2 handles encryption.

How to Encrypt a PDF

PyPDF2 currently only supports adding a user password and an owner password to a preexisting PDF. In PDF land, an owner password will basically give you administrator privileges over the PDF and allow you to set permissions on the document. On the other hand, the user password just allows you to open the document.

As far as I can tell, PyPDF2 doesn’t actually allow you to set any permissions on the document even though it does allow you to set the owner password.

Regardless, this is how you can add a password, which will also inherently encrypt the PDF:

# pdf_encrypt.py  from PyPDF2 import PdfFileWriter, PdfFileReader  def add_encryption(input_pdf, output_pdf, password):     pdf_writer = PdfFileWriter()     pdf_reader = PdfFileReader(input_pdf)      for page in range(pdf_reader.getNumPages()):         pdf_writer.addPage(pdf_reader.getPage(page))      pdf_writer.encrypt(user_pwd=password, owner_pwd=None,                         use_128bit=True)      with open(output_pdf, 'wb') as fh:         pdf_writer.write(fh)  if __name__ == '__main__':     add_encryption(input_pdf='reportlab-sample.pdf',                    output_pdf='reportlab-encrypted.pdf',                    password='twofish') 

add_encryption() takes in the input and output PDF paths as well as the password that you want to add to the PDF. It then opens a PDF writer and a reader object, as before. Since you will want to encrypt the entire input PDF, you will need to loop over all of its pages and add them to the writer.

The final step is to call .encrypt(), which takes the user password, the owner password, and whether or not 128-bit encryption should be added. The default is for 128-bit encryption to be turned on. If you set it to False, then 40-bit encryption will be applied instead.

Note: PDF encryption uses either RC4 or AES (Advanced Encryption Standard) to encrypt the PDF according to pdflib.com.

Just because you have encrypted your PDF does not mean it is necessarily secure. There are tools to remove passwords from PDFs. If you’d like to learn more, Carnegie Mellon University has an interesting paper on the topic.

Conclusion

The PyPDF2 package is quite useful and is usually pretty fast. You can use PyPDF2 to automate large jobs and leverage its capabilities to help you do your job better!

In this tutorial, you learned how to do the following:

  • Extract metadata from a PDF
  • Rotate pages
  • Merge and split PDFs
  • Add watermarks
  • Add encryption

Also keep an eye on the newer PyPDF4 package as it will likely replace PyPDF2 soon. You might also want to check out pdfrw, which can do many of the same things that PyPDF2 can do.

Further Reading

If you’d like to learn more about working with PDFs in Python, you should check out some of the following resources for more information:


[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

Planet Python

codingdirectional: Reverse a number with Python

In this snippet, we are going to create a python method to reverse the order of a number. This is one of the questions on Codewars. If you enter -123 into the method you will get -321. If you enter 1000 into the method you will get 1. Below is the entire solution.

 def reverse_number(n):     num_list = list(str(n))     num_list.reverse()      if "-" in num_list:         num_list.pop(len(num_list)-1)         num_list.insert(0, "-")      return int("".join(num_list)) 

If you do follow my website you know that I always write simple python code and post them here, but starting from the next article, I will stop posting python code for a while and start to talk about my python journey and the cool software which is related to python. I hope you will appreciate this new style of writing and thus will make learning python for everyone a lot more fun than just staring at the boring and sometimes long python snippet.

Like, share or follow me on Twitter.

If you have any solution for this problem do comment below.

Planet Python