EzDevInfo.com

pyPdf

Pure-Python PDF Library; this repository is no longer maintained, please see <a href="https://github.com/knowah/PyPDF2/">https://github.com/knowah/PyPDF2/</a> insead. mstamy2/PyPDF2 · GitHub pypdf2 - a utility to read and write pdfs with python

Get it on Github
Language : Python

https://github.com/knowah/PyPDF2/

Generating & Merging PDF Files in Python

I want to automatically generate booking confirmation PDF files in Python. Most of the content will be static (i.e. logos, booking terms, phone numbers), with a few dynamic bits (dates, costs, etc).

From the user side, the simplest way to do this would be to start with a PDF file with the static content, and then using python to just add the dynamic parts. Is this a simple process?

From doing a bit of search, it seems that I can use reportlab for creating content and pyPdf for merging PDF's together. Is this the best approach? Or is there a really funky way that I haven't come across yet?

Thanks!

Source: (StackOverflow)

pypdf Merging multiple pdf files into one pdf

If I have 1000+ pdf files need to be merged into one pdf,

input = PdfFileReader()
output = PdfFileWriter()
filename0000 ----- filename 1000
    input = PdfFileReader(file(filename, "rb"))
    pageCount = input.getNumPages()
    for iPage in range(0, pageCount):
        output.addPage(input.getPage(iPage))
outputStream = file("document-output.pdf", "wb")
output.write(outputStream)
outputStream.close()

Execute the above code，when input = PdfFileReader(file(filename500+, "rb")),

An error message： IOError: [Errno 24] Too many open files:

I think this is a bug, If not, What should I do？

Source: (StackOverflow)

Advertisements

PDF bleed detection

I'm currently writing a little tool (Python + pyPdf) to test PDFs for printer conformity.

Alas I already get confused at the first task: Detecting if the PDF has at least 3mm 'bleed' (border around the pages where nothing is printed). I already got that I can't detect the bleed for the complete document, since there doesn't seem to be a global one. On the pages however I can detect a total of five different boxes:

mediaBox
bleedBox
trimBox
cropBox
artBox

I read the pyPdf documentation concerning those boxes, but the only one I understood is the mediaBox which seems to represent the overall page size (i.e. the paper).

The bleedBox pretty obviously ought to define the bleed, but that doesn't always seem to be the case.

Another thing I noted was that for instance with the PDF, all those boxes have the exact same size (implying no bleed at all) on each page, but when I open it there's a huge amount of bleed; This leads me to think that the individual text elements have their own offset.

So, obviously, just calculating the bleed from mediaBox and bleedBox is not a viable option.

I would be more than delighted if anyone could shed some light on what those boxes actually are and what I can conclude from that (e.g. is one box always smaller than another one).

Bonus question: Can someone tell me what exactly the "default user space unit" mentioned in the documentation? I'm pretty sure this refers to mm on my machine, but I'd like to enforce mm everywhere.

Source: (StackOverflow)

Split PDF files in python - ValueError: invalid literal for int() with base 10: '' "

I am trying to split a huge pdf file into several small pdfs usinf pyPdf. I was trying with this oversimplified code:

from pyPdf import PdfFileWriter, PdfFileReader 
inputpdf = PdfFileReader(file("document.pdf", "rb"))

for i in xrange(inputpdf.numPages):
  output = PdfFileWriter()
  output.addPage(inputpdf.getPage(i))
  outputStream = file("document-page%s.pdf" % i, "wb")
  output.write(outputStream)
  outputStream.close()

but I got the follow error message:

Traceback (most recent call last):
File "./hltShortSummary.py", line 24, in <module>
  for i in xrange(inputpdf.numPages):
File "/usr/lib/pymodules/python2.7/pyPdf/pdf.py", line 342, in <lambda>
  numPages = property(lambda self: self.getNumPages(), None, None)
File "/usr/lib/pymodules/python2.7/pyPdf/pdf.py", line 334, in getNumPages
  self._flatten()
File "/usr/lib/pymodules/python2.7/pyPdf/pdf.py", line 500, in _flatten
  pages = catalog["/Pages"].getObject()
File "/usr/lib/pymodules/python2.7/pyPdf/generic.py", line 466, in __getitem__
  return dict.__getitem__(self, key).getObject()
File "/usr/lib/pymodules/python2.7/pyPdf/generic.py", line 165, in getObject
  return self.pdf.getObject(self).getObject()
File "/usr/lib/pymodules/python2.7/pyPdf/pdf.py", line 549, in getObject
  retval = readObject(self.stream, self)
File "/usr/lib/pymodules/python2.7/pyPdf/generic.py", line 67, in readObject
  return DictionaryObject.readFromStream(stream, pdf)
File "/usr/lib/pymodules/python2.7/pyPdf/generic.py", line 517, in readFromStream
  value = readObject(stream, pdf)
File "/usr/lib/pymodules/python2.7/pyPdf/generic.py", line 58, in readObject
  return ArrayObject.readFromStream(stream, pdf)
File "/usr/lib/pymodules/python2.7/pyPdf/generic.py", line 153, in readFromStream
  arr.append(readObject(stream, pdf))
File "/usr/lib/pymodules/python2.7/pyPdf/generic.py", line 87, in readObject
  return NumberObject.readFromStream(stream)
File "/usr/lib/pymodules/python2.7/pyPdf/generic.py", line 232, in readFromStream
  return NumberObject(name)
ValueError: invalid literal for int() with base 10: ''

any ideas???

Source: (StackOverflow)

how to insert a string to pdf using pypdf?

sorry,.. i'am a noob in python..

I need to create a pdf file, without using an existing pdf files.. (pure create a new one)
i have googling, and lot of them is merge 2 pdf or create a new file copies from a particular page in another file... what i want to achieve is make a report page (in chart), but for first step or the simple one "how to insert a string into my pdf file ? (hello world mybe)"..

this is my code to make a new pdf file with a single blankpage

from pyPdf import PdfFileReader, PdfFileWriter
op = PdfFileWriter()  

# here to add blank page
op.addBlankPage(200,200)  

#how to add string here, and insert it to my blank page ?

ops = file("document-output.pdf", "wb")  
op.write(ops)  
ops.close()

Source: (StackOverflow)

Adding information to pdf, PyPDF2 merging too slow

I want a text on each page of a pdf. This text is a html code that looks like <p style="color: #ff0000">blabla</p> as to appear red on the final doc, I convert it in pdf (html2pdf lib) then I merge it (PyPDF2 lib) to each page of my pdf. ...but the merging is very slow !

My question would be : Is there a faster way to merge pdf than page.mergePage method of PyPDF2 ? (Or maybe is there a faster way to add my text to this pdf?)

Thanks ! (using python 2.7.5 on Windows 8)

Source: (StackOverflow)

python encoding for turkish characters

I have to read pdf books that are turkish stories. I found a library which is called pyPdf. My test function whichis the below doesn't encode correctly. I think, I need to have turkish codec packet. Am i wrong ? if i am wrong how can I solve this problem orelse how can I find this turkish codec packet?

from StringIO import StringIO
import pyPdf,os

def getPDFContent(path):
    content = ""
    num_pages = 10
    p = file(path, "rb")
    pdf = pyPdf.PdfFileReader(p)
    for i in range(0, num_pages):
        content += pdf.getPage(i).extractText() + "\n"
    content = " ".join(content.replace(u"\xa0", " ").strip().split())   
    return content

if __name__ == '__main__':
    pdfContent = StringIO(getPDFContent(os.path.abspath("adiaylin-aysekulin.pdf")).encode("utf-8", "ignore"))
    for line in pdfContent:
        print line.strip()
    input("Press Enter to continue...")

Source: (StackOverflow)

reading/writing xmp metadatas on pdf files through pypdf

I can read xmp metadatas through pyPdf with this code:

a = pyPdf.PdfFileReader(open(self.fileName))
b = a.getXmpMetadata()                      
c = b.pdf_keywords

but: is this the best way?

And if I don't use the pdf_keywords property?

And is there any way to set these metadatas with pyPdf?

Source: (StackOverflow)

python and pyPdf - how to extract text from the pages so that there are spaces between lines

currently, if I make a page object of a pdf page with pyPdf, and extractText(), what happens is that lines are concatenated together. For example, if line 1 of the page says "hello" and line 2 says "world" the resulting text returned from extractText() is "helloworld" instead of "hello world." Does anyone know how to fix this, or have suggestions for a work around? I really need the text to have spaces in between the lines because i'm doing text mining on this pdf text and not having spaces in between lines kills it....

Source: (StackOverflow)

Merging two PDFs

import PyPDF2 
import glob
import os
from fpdf import FPDF
import shutil

class MyPDF(FPDF): # adding a footer, containing the page number
    def footer (self):
        self.set_y(-15)
        self.set_font("Arial", Style="I", size=8)
        pageNum = "page %s/{nb}" % self.page_no()
        self.cell(0,10, pageNum, align="C")


if __name__ == "__main__":
    os.chdir("pathtolocation/docs/") # docs location
    os.system("libreoffice --headless --invisible --convert-to pdf *") # this converts everything to pdf
    for file in glob.glob("*"):
        if file not in glob.glob("*.pdf"):
            shutil.move(file,"/newlocation") # moving files we don't need to another folder

    # adding the cover and footer
    path = open(file, 'wb')
    path2 = open ('/pathtocover/cover.pdf')
    merger = PyPDF2.PdfFileMerger()
    pdf = MyPDF()

    for file in glob.glob("*.pdf"):
        pdf.footer()
        merger.merge(position=0, fileobj=path2)
        merger.merge(position=0, fileobj=path)
        merger.write(open(file, 'wb'))

This script converts to pdf, add a cover to the pdf and footer containing the page number, fixed some stuff and now I run it for the last time to see if it's working, it's taking too much time, no error, did I do something wrong or does it need that long to merge and add footers? I'm working with 3 files, and it converted them so fast.

Exception output

convert /home/projects/convert-pdf/docs/sample (1).doc ->
/home/projects/convert-pdf/docs/sample (1).pdf using writer_pdf_Export

so it is converting and moving, I think the problem is somewhere here

   for file in glob.glob("*.pdf"):
        pdf.footer()
        merger.merge(position=0, fileobj=path2)
        merger.merge(position=0, fileobj=path)
        merger.write(open(file, 'wb'))

Since I'm trying to merge position=0 with position=0, not sure about it though

Source: (StackOverflow)

How to get bookmark's page number

from pyPdf import PdfFileReader
f = open('document.pdf', 'rb')
p = PdfFileReader(f)
o = p.getOutlines()

List object o consists of Dictionary objects pyPdf.pdf.Destination (bookmarks), which has many properties, but I can't find any referring page number of that bookmark

How can I return page number of, let's say o[1] bookmark?

For example o[1].page.idnum return number which is approximately 3 times bigger than referenced page number in PDF document, which I assume references some object smaller then page, as running .page.idnum on whole PDF document outline returns array of numbers which is not even linearly correlated with "real" page number destinations in PDF document and it's roughly multiple by ~ 3

Update: This question is same as this: split a pdf based on outline although I don't understand what author did in his self answer there. Seems too complicated to me to be usable

Source: (StackOverflow)

Fast PDF splitter library

pyPdf is a great library to split, merge PDF files. I'm using it to split pdf documents into 1 page documents. pyPdf is pure python and spends quite a lot of time in the _sweepIndirectReferences() method of the PdfFileWriter object when saving the extracted page. I need something with better performance. I've tried using multi-threading but since most of the time is spent in python code there was no speed gain because of the GIL (it actually ran slower).

Is there any library written in c that provides the same functionality? or does anyone have a good idea on how to improve performance (other than spawning a new process for each pdf file that I want to split)

Thank you in advance.

Follow up. Links to a couple of command line solutions, that can prove sometimes faster than pyPDF:

I modified pyPDF PdfWriter class to keep track of how much time has been spent on the _sweepIndirectReferences() method. If it has been too long (right now I use the magical value of 3 seconds) then I revert to using ghostscript by making a call to it from python.

Thanks for all your answers. (codelogic's xpdf reference is the one that made me look for a different approach)

Source: (StackOverflow)

Parsing a PDF with no /Root object using PDFMiner

I'm trying to extract text from a large number of PDFs using PDFMiner python bindings. The module I wrote works for many PDFs, but I get this somewhat cryptic error for a subset of PDFs:

ipython stack trace:

/usr/lib/python2.7/dist-packages/pdfminer/pdfparser.pyc in set_parser(self, parser)
    331                 break
    332         else:
--> 333             raise PDFSyntaxError('No /Root object! - Is this really a PDF?')
    334         if self.catalog.get('Type') is not LITERAL_CATALOG:
    335             if STRICT:

PDFSyntaxError: No /Root object! - Is this really a PDF?

Of course, I immediately checked to see whether or not these PDFs were corrupted, but they can be read just fine.

Is there any way to read these PDFs despite the absence of a root object? I'm not too sure where to go from here.

Many thanks!

Edit:

I tried using PyPDF in an attempt to get some differential diagnostics. The stack trace is below:

In [50]: pdf = pyPdf.PdfFileReader(file(fail, "rb"))
---------------------------------------------------------------------------
PdfReadError                              Traceback (most recent call last)
/home/louist/Desktop/pdfs/indir/<ipython-input-50-b7171105c81f> in <module>()
----> 1 pdf = pyPdf.PdfFileReader(file(fail, "rb"))

/usr/lib/pymodules/python2.7/pyPdf/pdf.pyc in __init__(self, stream)
    372         self.flattenedPages = None
    373         self.resolvedObjects = {}
--> 374         self.read(stream)
    375         self.stream = stream
    376         self._override_encryption = False

/usr/lib/pymodules/python2.7/pyPdf/pdf.pyc in read(self, stream)
    708             line = self.readNextEndLine(stream)
    709         if line[:5] != "%%EOF":
--> 710             raise utils.PdfReadError, "EOF marker not found"
    711 
    712         # find startxref entry - the location of the xref table


PdfReadError: EOF marker not found

Quonux suggested that perhaps PDFMiner stopped parsing after reaching the first EOF character. This would seem to suggest otherwise, but I'm very much clueless. Any thoughts?

Source: (StackOverflow)

How to open a generated PDF file in browser?

I have written a Pdf merger which merges an original file with a watermark.

What I want to do now is to open 'document-output.pdf' file in the browser by a Django view. I already checked Django's related articles, but since my approach is relatively different, I don't directly create the PDF object, using the response object as its "file.", so I am kind of lost.

So, how can I do is in a Django view?

from pyPdf import PdfFileWriter, PdfFileReader
from reportlab.pdfgen.canvas import Canvas
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont

output = PdfFileWriter()
input = PdfFileReader(file('file.pdf', 'rb'))

# get number of pages
num_pages = input.getNumPages()

# register new chinese font
pdfmetrics.registerFont(TTFont('chinese_font','/usr/share/fonts/truetype/mac/LiHeiPro.ttf'))

# generate watermark on the fly
pdf = Canvas("watermark.pdf")
pdf.setFont("chinese_font", 12)
pdf.setStrokeColorRGB(0.5, 1, 0)
pdf.drawString(10, 830, "你好")
pdf.save()

# put on watermark
watermark = PdfFileReader(file('watermark.pdf', 'rb'))
page1 = input.getPage(0)

page1.mergePage(watermark.getPage(0))

# add processed pdf page
output.addPage(page1)

# then, add rest of pages
for num in range(1, num_pages):
    output.addPage(input.getPage(num))

outputStream = file("document-output.pdf", "wb")
output.write(outputStream)
outputStream.close()

Source: (StackOverflow)

How do you shift all pages of a PDF document right by one inch?

I want to shift all the pages of an existing pdf document right one inch so they can be three hole punched without hitting the content. The pdf documents will be already generated so changing the way they are generated is not possible.

It appears iText can do this from a previous question.

What is an equivalent library (or way do this) for C++ or Python?

If it is platform dependent I need one that would work on Linux.

Update: Figured I would post a little script I wrote to do this in case anyone else finds this page and needs it.

Working code thanks to Scott Anderson's suggestion:

rightshift.py

#!/usr/bin/python2
import sys
import os
from  pyPdf import PdfFileReader, PdfFileWriter

#not sure what default user space units are. 
# just guessed until current document i was looking at worked
uToShift = 50;

if (len(sys.argv) < 3):
  print "Usage rightshift [in_file] [out_file]"
  sys.exit()

if not os.path.exists(sys.argv[1]):
  print "%s does not exist." % sys.argv[1]
  sys.exit()

pdfInput = PdfFileReader(file( sys.argv[1], "rb"))
pdfOutput = PdfFileWriter()

pages=pdfInput.getNumPages()

for i in range(0,pages):
  p = pdfInput.getPage(i)
  for box in (p.mediaBox, p.cropBox, p.bleedBox, p.trimBox, p.artBox):
    box.lowerLeft = (box.getLowerLeft_x() - uToShift, box.getLowerLeft_y())
    box.upperRight = (box.getUpperRight_x() - uToShift, box.getUpperRight_y())
  pdfOutput.addPage( p )

outputStream = file(sys.argv[2], "wb")
pdfOutput.write(outputStream)
outputStream.close()

Source: (StackOverflow)