Reading the "clean" text from PDF with PHP
Portable Document Format (PDF) is a file format created for the document exchange. Each PDF file encapsulates a complete description of a fixed-layout 2D document (and, with Acrobat 3D, embedded 3D documents) that includes the text, fonts, images, and 2D vector graphics which compose the documents.
PDF file structure
At first letís look into the PDF file.
PDF implements documents as a hierarchy of tagged objects organized into trees and/or linked lists. The objects can encapsulate various types of content, or attributes, or pointers to external resources.
There are eight basic kinds of objects in PDF: Booleans, numbers, names, strings, arrays, dictionaries, streams and the null object. Letís take a look at some of these objects we need to work with.
In PDF a string consists of a series of 8-bit bytes surrounded by parentheses. A string can be divided into several lines by using the backslash (\) at the end of the line. The backslash itself is not considered as part of the string. For example:
( This is a string. )
( This is a longer \
Any 8-bit value can be represented either by its octal equivalent (in the form \ddd, where ddd is the octal number), or by its two-digit hex equivalent, surrounded by angle brackets. Later we will search for the text data in the strings.
An array is a sequence of PDF objects, enclosed in square brackets. For example:
A dictionary is the key/value pairs, enclosed in two left angle brackets (<<) in the beginning and two right angle brackets (>>) at the end:
<< /Length 4 0 R
A dictionary is used to assign some properties to an object. We will use these data to determine how to decrypt the stream, find its length, or, for example, omit the current object (if it is an image).
A stream is a sequence of 8-bit bytes between the keywords stream and endstream. Any type of content made up of raw binary data is represented by a stream.
Streams are represented as objects (see below), which also means the stream will be bracketed by obj and endobj keywords. Before the stream keyword there must be a stream attribute dictionary, giving information about stream length (/Length key) and, often, the kind of compression employed (/Filter key).
As an example, a small text stream might look like:
2 0 obj
/F1 12 Tf
72 712 Td (A short text stream.) Tj
In this example, the text itself is given as a string followed by the display text operator Tj.
An object can enclose the content of any PDF data types (Boolean, number, name, string, etc.), bracketed between obj and endobj keywords. We are primarily interested in objects with the streams inside.
How to get ďcleanĒ text?
So, where should we look for text objects in a PDF-document? The answer is simple: we look for objects that contain streams.
Another few things we need to consider:
Now we have obtained enough theoretical knowledge to read our first PDF file. Below you can find the most interesting code parts with comments and the link to the source code.
You can find the source code HERE.
We must say that this code will parse correctly the simple PDF files. You can use this code as a basis and improve it according to your needs.
To read PDF file (e.g. sample.pdf) and display received plain text in the browser window, add the following code to the source code before the first function.
Do not forget to replace sample.pdf with your PDF file name.
|Copyright © 2005-2007 www.WebCheatSheet.com All Rights Reserved.|