How To Install Pypdf2 Python For Mac

Conda search pypdf2 -channel conda-forge About conda-forge. Conda-forge is a community-led conda channel of installable packages. In order to provide high-quality builds, the process has been automated into the conda-forge GitHub organization. The conda-forge organization contains one repository for each of the installable packages. To use it with Python, install Eclipse IDE for JavaScript and Web Developers, based on Eclipse Neon 4.6, then add the PyDev plugin. PyDev and Eclipse work together to create an excellent IDE for Python which is updated regularly by the hardworking Python community. TextWrangler is a free download from the App Store.

The PyPDF2 package is a pure-Python PDF library that you can use for splitting, merging, cropping and transforming pages in your PDFs. According to the PyPDF2 website, you can also use PyPDF2 to add data, viewing options and passwords to the PDFs too. Finally you can use PyPDF2 to extract text and metadata from your PDFs.

PyPDF2 is actually a fork of the original pyPdf which was written by Mathiew Fenniak and released in 2005. However, the original pyPdf’s last release was in 2014. A company called Phaseit, Inc spoke with Mathieu and ended up sponsoring PyPDF2 as a fork of pyPdf

At the time of writing this book, the PyPDF2 package hasn’t had a release since 2016. However it is still a solid and useful package that is worth your time to learn.

The following lists what we will be learning in this article:

Extracting metadata
Splitting documents
Merging 2 PDF files into 1
Rotating pages
Overlaying / Watermarking Pages
Encrypting / decrypting

Let’s start by learning how to install PyPDF2!

Installation

PyPDF2 is a pure Python package, so you can install it using pip (assuming pip is in your system’s path):

As usual, you should install 3rd party Python packages to a Python virtual environment to make sure that it works the way you want it to.

Extracting Metadata from PDFs

You can use PyPDF2 to extract a fair amount of useful data from any PDF. For example, you can learn the author of the document, its title and subject and how many pages there are. Let’s find out how by downloading the sample of this book from Leanpub. The sample I downloaded was called “reportlab-sample.pdf”. I will include this PDF for you to use in the Github source code as well.

Here’s the code:

Here we import the PdfFileReader class from PyPDF2. This class gives us the ability to read a PDF and extract data from it using various accessor methods. The first thing we do is create our own get_info function that accepts a PDF file path as its only argument. Then we open the file in read-only binary mode. Next we pass that file handler into PdfFileReader and create an instance of it.

Now we can extract some information from the PDF by using the getDocumentInfo method. This will return an instance of PyPDF2.pdf.DocumentInformation, which has the following useful attributes, among others:

author
creator
producer
subject
title

If you print out the DocumentInformation object, this is what you will see:

We can also get the number of pages in the PDF by calling the getNumPages method.

Extracting Text from PDFs

PyPDF2 has limited support for extracting text from PDFs. It doesn’t have built-in support for extracting images, unfortunately. I have seen some recipes on StackOverflow that use PyPDF2 to extract images, but the code examples seem to be pretty hit or miss.

Let’s try to extract the text from the first page of the PDF that we downloaded in the previous section:

You will note that this code starts out in much the same way as our previous example. We still need to create an instance of PdfFileReader. But this time, we grab a page using the getPage method. PyPDF2 is zero-based, much like most things in Python, so when you pass it a one, it actually grabs the second page. The first page in this case is just an image, so it wouldn’t have any text.

Interestingly, if you run this example you will find that it doesn’t return any text. Instead all I got was a series of line break characters. Unfortunately, PyPDF2 has pretty limited support for extracting text. Even if it is able to extract text, it may not be in the order you expect and the spacing may be different as well.

To get this example code to work, you will need to try running it against a different PDF. I found one on the United States Internal Revenue Service website here: https://www.irs.gov/pub/irs-pdf/fw9.pdf

This is a W9 form for people who are self-employed or contract employees. It can be used in other situations too. Anyway, I downloaded it as w9.pdf. If you use that PDF instead of the sample one, it will happily extract some of the text from page 2. I won’t reproduce the output here as it is kind of lengthy though.

Splitting PDFs

The PyPDF2 package gives you the ability to split up a single PDF into multiple ones. You just need to tell it how many pages you want. For this example, we will open up the W9 PDF from the previous example and loop over all six of its pages. We will split off each page and turn it into its own standalone PDF.

Let’s find out how:

For this example, we need to import both the PdfFileReader and the PdfFileWriter. Then we create a fun little function called pdf_splitter. It accepts the path of the input PDF. The first line of this function will grab the name of the input file, minus the extension. Next we open the PDF up and create a reader object. Then we loop over all the pages using the reader object’s getNumPages method.

Inside of the for loop, we create an instance of PdfFileWriter. We then add a page to our writer object using its addPage method. This method accepts a page object, so to get the page object, we call the reader object’s getPage method. Now we had added one page to our writer object. The next step is to create a unique file name which we do by using the original file name plus the word “page” plus the page number + 1. We add the one because PyPDF2’s page numbers are zero-based, so page 0 is actually page 1.

Finally we open the new file name in write-binary mode and use the PDF writer object’s write method to write the object’s contents to disk.

Merging Multiple PDFs Together

Now that we have a bunch of PDFs, let’s learn how we might take them and merge them back together. One useful use case for doing this is for businesses to merge their dailies into a single PDF. I have needed to merge PDFs for work and for fun. One project that sticks out in my mind is scanning documents in. Depending on the scanner you have, you might end up scanning a document into multiple PDFs, so being able to join them together again can be wonderful.

When the original PyPdf came out, the only way to get it to merge multiple PDFs together was like this:

Here we create a PdfFileWriter object and several PdfFileReader objects. For each PDF path, we create a PdfFileReader object and then loop over its pages, adding each and every page to our writer object. Then we write out the writer object’s contents to disk.

PyPDF2 made this a bit simpler by creating a PdfFileMerger class:

Here we just need to create the PdfFileMerger object and then loop through the PDF paths, appending them to our merging object. PyPDF2 will automatically append the entire document so you don’t need to loop through all the pages of each document yourself. Then we just write it out to disk. Lego mindstorms ev4.

The PdfFileMerger class also has a merge method that you can use. Its code definition looks like this:

Basically the merge method allows you to tell PyPDF where to merge a page by page number. So if you have created a merging object with 3 pages in it, you can tell the merging object to merge the next document in at a specific position. This allows the developer to do some pretty complex merging operations. Give it a try and see what you can do!

Rotating Pages

PyPDF2 gives you the ability to rotate pages. However you must rotate in 90 degrees increments. You can rotate the PDF pages either clockwise or counter clockwise. Here’s a simple example:

Here we create our PDF reader and writer objects as before. Then we get the first and second pages of the PDF that we passed in. We then rotate the first page 90 degrees clockwise or to the right. Then we rotate the second page 90 degrees counter-clockwise. Finally we add the third page in its normal orientation to the writer object and write out our new 3-page PDF file.

If you open the PDF, you will find that the first two pages are now rotated in opposite directions of each other with the third page in its normal orientation.

Overlaying / Watermarking Pages

PyPDF2 also supports merging PDF pages together, or overlaying pages on top of each other. This can be useful if you want to watermark the pages in your PDF. For example, one of the eBook distributors I use will “watermark” the PDF versions of my book with the buyer’s email address. Another use case that I have seen is to add printer control marks to the edge of the page to tell the printer when a certain document has reached its end.

For this example we will take one of the logos I use for my blog, “The Mouse vs. the Python”, and overlay it on top of the W9 form from earlier:

The first thing we do here is extract the watermark page from the PDF. Then we open the PDF that we want to apply the watermark to. We use a for loop to iterate over each of its pages and call the page object’s mergePage method to apply the watermark. Next we add that watermarked page to our PDF writer object. Once the loop finishes, we write our new watermarked version out to disk.

Here’s what the first page looked like:

That was pretty easy.

PDF Encryption

The PyPDF2 package also supports adding a password and encryption to your existing PDFs. As you may recall from Chapter 10, PDFs support a user password and an owner password. The user password only allows the user to open and read a PDF, but may have some restrictions applied to the PDF that could prevent the user from printing, for example. As far as I can tell, you can’t actually apply any restrictions using PyPDF2 or it’s just not documented well.

Here’s how to add a password to a PDF with PyPDF2: Mountain duck mountain duck for mac pro.

All we did here was create a set of PDF reader and write objects and read all the pages with the reader. Then we added those pages out to the specified writer object and added the specified password. If you only set the user password, then the owner password is set to the user password automatically. Whenever you add a password, 128-bit encryption is applied by default. If you set that argument to False, then the PDF will be encrypted at 40-bit encryption instead.

Wrapping Up

We covered a lot of useful information in this article. You learned how to extract metadata and text from your PDFs. We found out how to split and merge PDFs. You also learned how to rotate pages in a PDF and apply watermarks. Finally we discovered that PyPDF2 can add encryption and passwords to our PDFs.

PyCharm Installation

1. Go to this link: https://www.jetbrains.com/pycharm/download/ and download the community edition.

2. Install the downloaded file.
Mac: Double click the .dmg file and drag PyCharm to the Application Folder.
Windows: Double click the .exe file and follow the installation steps for the default PyCharm installation.

Launch PyCharm

Mac: Go to the Applications folder and click on the PyCharm icon. Alternatively, you can drag the icon to your dock to open the IDE quickly by clicking on the icon in dock.

Windows: If you have followed the default installation process then you can see the PyCharm icon on your desktop. If you don’t find the icon then go to the PyCharm folder – C:Program Files (x86)JetBrainsPyCharm 2017.1bin (the path may be different for your system) and click on the PyCharm.exe file to launch the IDE

Whats Next

We will create a Python project in PyCharm IDE in the next tutorial.

How To Install Pypdf2 Python For Mac