Table of Contents
Portable Document Format (PDF) is a file format created by Adobe Systems for document exchange. Each Adobe PDF file encapsulates a complete description of a fixed-layout 2D document that includes the text, fonts, images, and 2D vector graphics which compose the documents. PDF is an ISO standard, published under title Document management—Portable document format—Part 1: PDF 1.7.
PDFLeo allows users to post-process PDF files generated by other programs, such as PDF printers, word processors (such as Word 2007 and OpenOffice Writer), and other programs such as FOP and iText. Many of these programs lack some features. For example, Some does not provide encryption. Some produce files with quite large sizes. And some leave metadata fields empty. Most of them do not produce web optimized files. Those shortcomings can be fixed with pdfleo.
PDFLeo supports the following kinds of PDF processing:
Encryption. Encrypt a PDF document with either password security or public key security. Remove PDF encryption if you are able to open the file. Retain PDF encryption and permission settings in the new PDF with other part of document modified.
Linearization. Optimize PDF documents to be viewed over slow connection from a capable web server.
Size Optimization. Reduce the file size by removing redundant contents, compressing streams and moving objects to streams.
Query document information, such as meta data, security, document permission and font information.
Insert and modify predefined or custom document information entries.
Insert, view and modify XMP metadata.
A PDF file consists primarily of objects, of which there are eight types:
Boolean values, representing true or false
Numbers
Strings
Names
Arrays, ordered collections of objects
Dictionaries, collections of objects indexed by Names
Streams, usually containing large amounts of data
The null object
Objects may be either direct (embedded in another object) or indirect. Indirect objects are numbered with an object number and a generation number. An index table called the xref table gives the byte offset of each indirect object from the start of the file. This design allows for efficient random access to the objects in the file, and also allows for small changes to be made without rewriting the entire file (incremental update).
Beginning with PDF version 1.5, indirect objects may also be located in special streams known as object streams. This technique reduces the size of files that have large numbers of small indirect objects.
There are two layouts to the PDF files: normal and linearized. Non-linearized PDF files consume less disk space than their linear counterparts, though they are slower to access because portions of the data required to assemble pages of the document are scattered throughout the PDF file. Linear PDF files (also called "optimized" or "web optimized" PDF files) are constructed in a manner that enables them to be read in a Web browser plugin without waiting for the entire file to download, since they are written to disk in a linear (as in page order) fashion.
Metadata is specified in two ways: the "native" info dictionary, and the metadata stream. The info dictionary method suffers several shortcomings: Unless the indexing software understands the format, it is not possible to locate the metadata. Adobe only published a few metadata entries, and there is no standard method on the names and contents. Furthermore, only a string per entry is supported in the info dictionary. If the file is encrypted, the index software requires the key to access the entries.
PDF metadata stream stores metadata in a standard format called XMP. XMP is serialized and stored using W3C RDP format (which is based on XML), and can be placed into a variety of multimedia file such JPEG. When stored as plain text in the file, XMP contains a special header indicating that it is XMP as well as the encoding format. Therefore, an indexing software can search the file for this special mark, and loads the content without understanding the PDF file format.
PDFLeo allows user to insert XMP metadata into existing PDF files, as well as modifying existing info dictionary entries. When info dictionary is modified, the change is synced back to the metada.
PDF allows all string and stream objects (except metadata stream) to be encrypted to prevent unauthorized people to access the content. The encryption method employed can be either RC4 (with key length from 40 bits to 128 bits) and AES (128 bits). Before decrypting the content, a master key must be obtained. Security handler is to calculate the master key. PDF standard defines two kinds of security handlers, and pdfleo supports both.
Password security handler. Called standard handler in PDF specification, this security handler allows user to specify two passwords: the user password, and the owner password. Through some data stored in the PDF file, the user password can be calculated once the owner password is known. The user password is used to decrypt the master key.
Public key security handler. This security handler uses X509 public certificate to encrypt the master key. Multiple receipts can be specified in this manner. The user supplies his private key when he views the file. In this manner no password exchange is required. The creator only needs the public certificate of the recipient.
It is possible that a PDF uses a non-documented security handler, or uses a live server for authentication. pdfleo can't decrypt these files, as most other PDF programs.
If the source document is encrypted, credentials are required to gain access to the document. PDF standard defines two types of securities - password and public key. There are other types of possible securities, but they are not standard therefore can not be read by pdfleo.
Password security allows two passwords - the owner password and the user password. Users with owner passwords have unrestricted privileges to the document, while those with user passwords have limited permissions defined by the author. It should be noted that the permission is enforced by the application. Users with user password effectively have unrestricted access to the document with the help of the application. For example, a PDF document with empty user password can have the encryption settings removed with pdfleo.
Passwords are specified through --password
switch. If password is empty, there is no need to specify as pdfleo will try using empty password if other
attempts failed.
Multiple passwords, up to 10 can be specified at the command line. pdfleo will try them in the order specified. pdfleo does not honor the permission flag, specifying either password will work in most cases.
C:\>pdfleo --password=oooo134 --password=sjusl \ source.pdf target.pdf
PDFLeo will try passwords in the order that they are specified. Some action, such as copying owner password to the new file, requires the file opened by an owner password. Therefore, if both passwords are known, owner password should be place before user password.
Public key encrypted PDF files require private key file and the
password to access the private key file. A colon (:) separates the two
parts. The private key file must be in standard pkcs#12 format and
usually has extension of pvk
or p12
. If password is required to access the private
key, it must be specified. For example, the following parameter identify
private key file c:\company.pvk
and the
password is password
.
C:\>pdfleo --digital-id=c:\company.pvk:mypass source.pdf target.pdf
Multiple digital ID switches, up to 10 can exist at the command line. PDFLeo will try them in the order that they are specified.
The detailed information on how to encrypt, decrypt or change permissions is covered in Chapter 3, PDF Security. This section gives a quick summary and some examples.
Security settings are displayed using --info switch.
C:\>pdfleo -i test2.pdf ... =============== Document Security ============================== Security Method: Password SecurityAuthorized by: User Password
Print: None
Accessibility: No Modify: Assembly Encryption Level: AES (128-bit)
...
Document is encrypted with passwords | |
Document is opened using user password (which is empty) | |
Print permission: no print allowed | |
Encryption method is 128-bit AES. |
For public key encrypted documents, the program prints list of recipients:
C:\>pdfleo --digital-id=tester.pfx:1234demo pubkey.pdf ... ================ Document Security ============================== Security Method: Certificate Security Print: High Resolution Accessibility: No Modify: Assembly Clear Metadata: No Encryption Level: AES (128-bit) List of Recipients: C=US, O=Equifax, OU=Equifax Secure Certificate AuthorityCN=Test Group, O=Morovia, OU=Morovia/emailAddress=tester@morovia.com, C=CA
...
PDF documents can be encrypted with password security or public key security. A number of parameters can be specified such as encryption level, bit length and so on.
C:\>pdfleo --encrypt=new;AES-128;;yes \--password-security=demo;;print=high;modify=none
Encryption is removed by specifying --encrypt=discard
without additional parameters.
C:\>pdfleo --password=Monster --encrypt=discard \ source.pdf target.pdf
By default, the encryption is preserved in the output PDF. All security setting remain intact. For example, the below command creates a linearized PDF file with encryption copied from the source PDF:
C:\>pdfleo --password=Monster \ --linearize source.pdf target.pdf
The preserve mode can be explicitly specified though --encrypt=preserve
parameter. The command below
achieves the same results:
C:\>pdfleo --password=Monster --encrypt=preserve\ --linearize source.pdf target.pdf
A non-linearized PDF file must be fully downloaded to the client computer before a viewer can display the pages. Linearization ( called Fast Web View in Acrobat and Adobe Reader) transforms PDF into such a format that a capable viewer can find out which byte range to ask for when display the page requested with only a few K bytes downloaded. It then asks web server for bytes within that rage. The capability of “byte-serving” is required by HTTP 1.1 protocol and is supported by most web servers.
In order to take advantage this feature, you need:
A web server that supports byteserving. As the protocol is part of http 1.1, most current web servers already are capable.
The viewer must support this feature. In Acrobat or Adobe Viewer, make sure that the option Allow fast web view is checked. This option is enabled by default.
The PDF file is linearized, which can be achieved by pdfleo.
Linearization provides better user experience when serving large PDF files (measured in number of pages or file size in MB) over web or other slow connections. Linearization will generally increase the file size. In some cases it increases the file size significantly if you have a large PDF file and many objects are compressed into an object stream.
Linearization feature is applied through -linearize
switch. It can be used in conjunction with other transforms, such as
encryption and compression.
C:\>pdfleo --linearize source.pdf target.pdf
You can verify the result using pdfleo --info. As illustrated below:
C:\>pdfleo --info target.pdf ... Number of Pages: 30 Tagged PDF: No Linearized: YesPage Size: 8.50x11.00 in ...
You can also verify it through Adobe
Reader or Acrobat. Open the
document and select → (Ctrl+D). The Fast Web View entry will show Yes
, which indicates that the document is linearized.
Note
Linearization is not preserved during
transformation, unless --linearize
option is specified.
If the source PDF is linearized but no --linearize
is
specified, the resulted PDF will not be linearized.
Due to the method that pdfleo utilizes to read the source document, some optimization techniques are always applied. Additional steps can be taken to further reduce file size, such as compressing stream objects and placing objects into streams. The techniques involved include:
Removing unused objects. Unused objects will be discarded. If a PDF is produced through incremental update, many objects are not needed. Incremental update is a feature to allow a processing application to append changes at the file end without removing prior object definitions. This technique reduces the memory usage at the cost of bigger file size.
Writing objects in a compact syntax. PDFLeo writes output using compact syntax Extra white spaces are removed. Hexadecimal strings are written with more compact binary representations.
Compressed streams. When specified, pdfleo compresses all streams except those who must be kept intact.
Object streams. Non stream objects can be placed into a special object stream and compressed.
Optimization is controlled through two switches: --stream-data
and --object-stream
. The --stream-data
option controls how individual stream is
compressed. --object-stream
controls whether or not
object streams should be generated.
PDFLeo does not apply any optimization steps which could result loss of information such as image quality degradation, loss of fonts etc.
Contents of stream objects can be encoded with various techniques, such as LZW compression or Flate compression. The techniques are referred as filters. An application program that produces a PDF file can encode certain information (for example, data for sampled images) to compress it or to convert it to a portable ASCII representation. Then an application that reads (consumes) the PDF file can invoke the corresponding decoding filter to convert the information back to its original form.
stream compression options is specified through --stream-data
switches and can be one of three values: none
, preserve
or compress
.
Table 2.1. Stream Compression Options
Value | Description |
---|---|
none | With this option all streams are uncompressed. This is useful if you want to look at the plain text content. |
preserve | Preserve the state of current streams. i.e., if the original stream is compressed, the output PDF is also compressed. |
compress | Apply deflate compression on all data streams, unless application could cause adverse effects. |
PDFLeo understands many filters. However, in terms of compression, Flate filter is utilized exclusively (another compression filter is LZW, which does not offer better performance). However, pdfleo will preserve stream data when it encounters an unknown filter or a loss filter such as DCTDecode. For streams with predictor, it will decompress as required, but will preserve them at compression.
Metadata streams are handled differently in pdfleo. In order for
metadata is search able, the stream should be in clear text
(uncompressed and unencrypted). Therefore, PDFLeo does not compress
metadata streams even --stream-data=compress
is
specified. This behavior happens when the PDF document is either
unencrypted, or encrypted with clear text metadata. If metadata is
encrypted, metadata streams will also be compressed.
The command below compresses data streams to reduce file size:
C:\>pdfleo --data-streams=compress source.pdf target.pdf
Occasionally some users may want the data streams in clear text, so that they can see the drawing commands in each page. The following command line produces such output.
C:\>pdfleo --data-streams=uncompress source.pdf target.pdf
PDF file size can be substantially reduced by placing uncompressed
PDF objects into a stream and compressing the stream. This type of
stream is called Object Stream. Object streams
option is specified through --object-streams
, and can
be one of the three values below:
Table 2.2. Object Stream Options
Value | Description |
---|---|
preserve | Original object streams are preserved. If the source document does not contain object streams they are not generated. |
disable | Removing object streams. This often results a PDF file with bigger size. |
generate | Generating object streams whenever possible. |
The following command compresses all data streams, and placing object into object streams whenever possible, with the goal to minimize the file size.
C:\>pdfleo --data-streams=compress \ --object-streams=generate source.pdf target.pdf
The default option is preserve
.
You can obtain various of metadata of a PDF document by using option --info
, or abbreviated option -i
. The
output first lists all information dictionary entries, followed by
document security attributes such as security method and permission. The
last section lists all the fonts required by the document (either embedded
or required to be present in the system).
Below is the sample output:
C:\pdfleo --info Brother_HL_4050_CDN_Manual.pdf Morovia (R) pdfleo 32-bit Professional Version 1.0 File: Brother_HL_4050_CDN_Manual.pdf Title: HL4040CN_HL4050CDN_HL4070CDW.bookAuthor: ZZPZ3635 Subject: N/A Keywords: N/A Created: 06/29/2007 10:38:30 AM Modified: 06/29/2007 04:05:36 PM Application: FrameMaker 7.0 PDF Producer: Acrobat Distiller 6.0 (Windows) PDF Version: 1.5 (Acrobat 6.x) Number of Pages: 211 Tagged PDF: No Linearized: Yes Page Size: 8.50x11.00 in ================ Document Security ============================== Security Method: Password Security
Authorized by: User Password Print: Allowed Modify: Not Allowed Extract: Allowed Annotate: Not Allowed Encryption Level: RC4 (40-bit) ================ Fonts Info ===================================== Font Name Encoding Type ---------------------------------------- ------------ ------------ TT9A1o00(embedded subset) WinAnsi Type 1C
TT9A2o00(embedded subset) WinAnsi Type 1C TT9A3o00(embedded subset) WinAnsi Type 1C TT9A4o00(embedded subset) WinAnsi Type 1C TT9A5o00(embedded subset) WinAnsi Type 1C ... (remaining skipped)
PDF stores metadata entries in a special dictionary, called information dictionary. PDF defines several standard keys, and authors can define custom entries. Often info dictionary is called native. Because the limitations of the dictionary, multiple language values are not supported.
Despite the efforts to push XMP adoption, many PDF software read the info dictionary for metadata. Therefore, it is necessary to keep the info dictionary and XMP metadata in synchronization.
PDFLeo allows users to insert, modify or remove entries in the information dictionary. Changes will synchronize to XMP metadata.
The following command snippet changes the document title to PDFLeo User Manual
, document producer to Morovia PDF Writer
. If the specified key does not
exist, it will be added.
--info-dict="Title=PDFLeo User Manual;Producer=Morovia PDF Writer"
By specifying empty value you can remove an entry. For example,
--info-dict="Title=;Producer:Morovia PDF Writer"
removes the Title
entry.
Standard entries in the dictionary are listed as below:
Table 2.3. Entries in the document information dictionary
Key | Value |
---|---|
Title | The document's title |
Author | The name of the person who created the document. |
Subject | The subject of the document |
Keywords | Keywords associated with the document. |
Creator | The name of the application that created the original text (such as Microsoft Word). |
Producer | The name of the application that converted the original document into the PDF format (such as PDF printer) |
CreationDate | The date and time that the document is created. |
ModDate | The date and time that the document is modified. |
Trapped | A value indicating the document has been modified to include trapping information. Valid values include: True, False and Unknown. |
Note that the keys are case sensitive. For entries requiring a
Date (such as ModDate
and CreationDate
), the
value must conform to the date format defined in the PDF standard, which
you can find in Section A.1, “PDF Date Format”.
XMP (Extensible Metadata Platform1) is an XML framework with many predefined properties. However, as the name implies, XMP can be extended to satisfy specific requirements using custom extension schema. XMP is much more powerful than document information dictionary, and is for example required in the PDF/A standard. Many industry groups have published standards based on XMP for various vertical applications, e.g. digital imaging or prepress data exchange.
With pdfleo you can extract duocument level XMP metadata, replace or merge existing metadata with contents from an external file.
XMP can be extracted through --dump-xmp
switch.
The dump content is expressed in UTF-8 format.
Figure below shows the results of running pdfleo with --dump-xmp
option on the xmpguide.pdf
. Note that the metadata contains five
schema: PDF schema (http://ns.adobe.com/pdf/1.3/", XMP Basic Schema
(http://ns.adobe.com/xap/1.0/"), Dublic Core Schema
(http://purl.org/dc/elements/1.1/"), XMP PDFX Schema
(http://ns.adobe.com/pdfx/1.3/") and XMP Media Management Schema
(http://ns.adobe.com/xap/1.0/mm/").

PDFLeo provides another switch, --dump-xmp-text
to display metadata in a more readable format:
c:\>pdfleo --dump-xmp-text xmpspec3.pdf dumping metadata:Dumping XMPMeta object "" (0x0) pdf: http://ns.adobe.com/pdf/1.3/ (0x80000000 : schema) pdf:Copyright = "Copyright 2010, Adobe Systems ... pdf:Marked = "True" pdf:Producer = "Acrobat Distiller 8.1.0 (Windows)" pdf:Keywords = "XMP metadata EXIF TIFF IPTC PSIR" ...
You can replace existing XMP metadata with contents of an external file
by using --replace-xmp
switch. The file must be a valid
XML file with UTF-8 encoding.
The following example replaces metadata with contents from a file named ticks.xmp
:
C:\pdfleo --replace-xmp=ticks.xmp source.pdf output.pdf
Note that you should make sure that the content is correct. The file should be coded with UTF-8. And the content of metadata is not synchronized to the info dictionaries.
PDFLeo provides another switch, --merge-xmp
to allow
users to merge contents of an external file with the existing metadata. New
properties are added, and old properties are replaced.
The following example updates metadata with contents from a file named ticks.xmp
:
C:\pdfleo --merge-xmp=ticks.xmp source.pdf output.pdf
Note that you should make sure that the content is correct. The file should be coded with UTF-8. And the content of metadata is not synchronized to the info dictionaries.