PHPackages                             pmdunggh/pdftotext - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [PDF &amp; Document Generation](/categories/documents)
4. /
5. pmdunggh/pdftotext

ActiveLibrary[PDF &amp; Document Generation](/categories/documents)

pmdunggh/pdftotext
==================

Extract text, image from PDF file

16.3k1[1 issues](https://github.com/pmdunggh/pdftotext/issues)[1 PRs](https://github.com/pmdunggh/pdftotext/pulls)PHP

Since Jun 23Pushed 3y ago1 watchersCompare

[ Source](https://github.com/pmdunggh/pdftotext)[ Packagist](https://packagist.org/packages/pmdunggh/pdftotext)[ RSS](/packages/pmdunggh-pdftotext/feed)WikiDiscussions master Synced 2mo ago

READMEChangelogDependenciesVersions (1)Used By (0)

INTRODUCTION
============

[](#introduction)

The PdfToText class has been designed to extract textual contents from a PDF file.

It's pretty simple to use :

```
include ( 'PdfToText.phpclass' ) ;

$pdf 	=  new PdfToText ( 'sample.pdf' ) ;
echo $pdf -> Text ; 		// or you could also write : echo ( string ) $pdf ;

```

The same PdfToText object can be reused to process additional files :

```
$pdf -> Load ( 'sample2.pdf' ) ;
echo $pdf -> Text ;

```

Additionally, the **PdfToText** class provides support methods for getting the page number of any text in the underlying PDF file.

Look at the class' blog for an overview on the underlying mechanics that are involved into extracting text contents from pdf files.

Examples are also provided in the **examples/** directory. Please have a look at the [examples/README.md](examples/README.md "README.md") file for a brief explanation on their structure.

**IMPORTANT** : the **PdfToText** class generates UTF8-encoded text. If your default character set is not UTF-8, you may need to add the following **meta** tag in the &lt;head&gt; part of your HTML page :

```

```

FEATURES
========

[](#features)

Text rendering in a PDF file is made using an obscure language which provides multiple ways to position the same text at the same location on a page. You could say for example :

```
. Goto coordinates (x,y)
. Render text ( "Mail : someone@somewhere.com" )

```

Or :

```
. Goto next line
. Goto (x1,y)
. Render text ( "Mail" )
. Goto (x2, y)
. Render text ( ":" )
. Goto (x3, y)
. Render text ( "someone@somewhere.com" )

```

(note that I'm using a pseudo-language here). Both pieces of code would probably display the same text at the same position, by using rather different ways.

This is why the **PdfToText** class tracks the following information from the drawing-instruction stream to provide more accurate text rendering (even if the output is only pure text) :

- The currently selected font is tracked. This is important because :

    - Each font in a PDF file can have its own character map. This means in this case that characters to be drawn using the Adobe language do not specify actual character codes, but an index into the font's character map.
    - The current font size is memorized ; this helps to evaluate what is the current y-coordinate when relative positioning instructions are used (such as "goto next line"). Although approximative, this works in a great majority of cases
- If multiple strings are rendered using identical y-coordinate, they will be grouped onto the same line. Note that they must appear sequentially in the instruction flow for this trick to work
- Sub/super-scripted text is usually written at a slightly different y-coordinate than the line it appears in. Such a situation is detected, and the sub/super-scripted text will correctly appear onto the same line

These symptoms will not appear if the *PDFOPT\_BASIC\_LAYOUT* option is specified.

ADVANCED FEATURES
=================

[](#advanced-features)

The class is able to :

- Render basic page layout (ie, the text is drawn in the same order that Acrobat Reader renders it) using the *PDFOPT\_BASIC\_LAYOUT* option.
- Retrieve form data as a standalone object, using the *GetFormData()* method.
- Capture areas of text within a page

KNOWN ISSUES
============

[](#known-issues)

Here is a list about known issues for the **PdfToText** class ; I'm working on solving them, so I hope this whole paragraph will soon completely disappear !

- Unwanted line breaks may occur within text lines. This is due to the fact that the pdf file contains drawing instructions that use relative positioning. This is especially true for file created with generators such as **PdfCreator**. However, some provisions have been made to try to track put text with roughly the same y-coordinates onto the same line. This limitation does not apply if the *PDFOPT\_BASIC\_LAYOUT* option is specified.
- Encrypted PDF files are not supported

A NOTE FOR WINDOWS USERS
========================

[](#a-note-for-windows-users)

An Apache server on Linux platforms allocates a default stack size of 8Mb for its threads. This value is set to 1Mb on Windows platforms.

However, some regular expressions used by the **PdfToText** class may cause the PHP PCRE extension to require a little bit more than 1Mb of stack space when processing certain PDF files.

Such a situation will cause your Windows Apache server to crash and your browser to display a message such as : *Connection reset*. This behavior affect several products such as EasyPHP, XAMPP or Wamp.

To solve this issue, you will have to enable the **mpm** module in your *httpd.conf* file and define a new stack size, as in the following example, given for a Wamp server :

```
	Include conf/extra/httpd-mpm.conf
	ThreadStackSize 8388608

```

TESTING
=======

[](#testing)

I have tested this class against dozens of documents from various origins, and tested the output generated from each sample document by the *PdfCreator*, *PrimoPdf* and *PDF Pro* tools.

I also compared the output of the **PdfToText** class with that of *Acrobat Reader*, when you choose the *Save as...Text* option. In many situations, the class performs better in positioning the final text than *Acrobat Reader* does.

However, all of that will not guarantee that it will work in every situation ; so, if you find something weird or not functioning properly using the **PdfToText** class, feel free to contact me on this class' blog, and/or send me a sample PDF file at the following email address :

```
	christian.vigh@wuthering-bytes.com

```

OTHER LINKS
===========

[](#other-links)

This class can also be found here :

[http://www.phpclasses.org/package/9732-PHP-Extract-text-contents-from-PDF-files.html](http://www.phpclasses.org/package/9732-PHP-Extract-text-contents-from-PDF-files.html "http://www.phpclasses.org/package/9732-PHP-Extract-text-contents-from-PDF-files.html")

and here, where you will also find a FAQ section and be able to upload your PDF file samples for live testing :

[http://www.pdftotext.eu](http://www.pdftotext.eu "http://www.pdftotext.eu")

and also here :

REFERENCE
=========

[](#reference)

METHODS
-------

[](#methods)

### Constructor

[](#constructor)

```
$pdf 	=  new PdfToText ( $filename = null, $options = self::PDFOPT_NONE, $user\_password = false, $owner\_password = false ) ;

```

Instantiates a **PdfToText** object. If a filename has been specified, its text contents will be loaded and made available in the *Text* property (otherwise, you will have to call the *Load()* method for that).

See the *Options* property for a description of the *$options* parameter.

The *$user\_password* and *$owner\_password* parameters specify the user/owner password to be used for decrypting a password-protected file (note that this class is not a password cracker !).

In the current version, decryption of password-protected files is not yet supported.

### Load ( $filename, $user\_password = false, $owner\_password = false )

[](#load--filename-user_password--false-owner_password--false-)

Loads the text contents of the specified filename.

The *$user\_password* and *$owner\_password* parameters specify the user/owner password to be used for decrypting a password-protected file (note that this class is not a password cracker !).

In the current version, decryption of password-protected files is not yet supported.

The method returns the decoded text contents, which are also available through the *Text* property.

### LoadFromString ( $contents, $user\_password = false, $owner\_password = false )

[](#loadfromstring--contents-user_password--false-owner_password--false-)

Loads the text contents of the specified PDF contents.

The *$user\_password* and *$owner\_password* parameters specify the user/owner password to be used for decrypting a password-protected file (note that this class is not a password cracker !).

In the current version, decryption of password-protected files is not yet supported.

The method returns the decoded text contents, which are also available through the *Text* property.

### AddAdobeExtraMappings ( $mappings )

[](#addadobeextramappings--mappings-)

Adobe supports 4 predefined fonts : standard, Mac, WinAnsi and PDF). All the characters in these fonts are identified by a character time, a little bit like HTML entities ; for example, 'one' will be the character '1', 'acircumflex' will be 'â', etc.

There are thousands of character names defined by Adobe (see [https://mupdf.com/docs/browse/source/pdf/pdf-glyphlist.h.html](https://mupdf.com/docs/browse/source/pdf/pdf-glyphlist.h.html "https://mupdf.com/docs/browse/source/pdf/pdf-glyphlist.h.html")).

Some of them are not in this list ; this is the case for example of the 'ax' character names, where 'x' is a decimal number. When such a character is specified in a /Differences array, then there is somewhere a CharProc\[\] array giving an object id for each of those characters.

The referenced object(s) in turn contain drawing instructions to draw the glyph. At no point you could guess what is the corresponding Unicode character for this glyph, since the information is not contained in the PDF file.

The *AddAdobeExtraMappings()* method allows you to specify such correspondences. Specify an array as the *$mappings* parameter, whose keys are the Adobe character name (for example, "a127") and values the corresponding Unicode values.

The *$mappings* parameter is an associative array whose keys are Adobe character names. The array values can take several forms :

- A character
- An integer value
- An array of up to four character or integer values. Internally, every specified value is converted to an array of four integer values, one for each of the standard Adobe character sets (Standard, Mac, WinAnsi and PDF). The following rules apply :
    - If the input value is a single character, the output array corrsponding the Adobe character name will be a set of 4 elements corresponding to the ordinal value of the supplied character.
    - If the input value is an integer, the output array will be a set of 4 identical values
    - If the input value is an array :
    - Arrays with less that 4 elements will be padded, using the last array item for padding
    - Arrays with more than 4 elements will be silently truncated
    - Each array value can either be a character or a numeric value.

**Note**

In this current implementation, the method applies the mappings to ALL Adobe default fonts. That is, you cannot have one mapping for one Adobe font referenced in the PDF file, then a second mapping for a second Adobe font, etc.

### GetCaptures

[](#getcaptures)

```
$captures 	=  $pdf -> GetCaptures ( $full = false ) ;

```

Retrieves the areas of text captured by the **PdfToText** class. This assumes that you specified first the *PDFOPT\_CAPTURE* flag to the class constructor, then called either the *SetCaptures()* or *SetCapturesFromString()* method.

When the *$full* parameter is set to *false* (the default), the returned object is a hierarchy of **stdClass** objects which maps capture names to their values.

When set to *true*, the returned object is of type **PdfToTextCaptures**, which holds much more information. It should be useful only when doing internal debugging of the PdfToText class.

**Note** :

Accessing a property value when the *$full* parameter is *false* can be performed like this (here, we are accessing the value of the *Title* capture in the first page) :

```
$captures -> Title [1]

```

When this parameter is *true*, you have to specify the *Text* property to retrieve its contents :

```
$captures -> Title [1] -> Text

```

If you plan to switch between both return types during your development phase, you can use a unified approach that works in both cases :

```
( string ) $captures -> Title [1]

```

See the \**Capturing Text* section for more information on capturing text from PDF files.

### GetFormCount

[](#getformcount)

```
$count 		=  $pdf -> GetFormCount ( ) ;

```

Returns the actual number of top-level forms defined in the PDF file.

### GetFormData

[](#getformdata)

```
$object 	=  $pdf -> GetFormData ( $template_xml, $index = 0 ) ;

```

Retrieves the form data for the specified top-level form index. Data is returned as an object inheriting from class **PdfToTextFormData**, which provides ony helper functions to its derived classes.

Data retrieval can be based on a template XML file, or, when the *$template\_xml* parameter is null, a default template will be created using the field names defined in the PDF file.

See the *Form templates* later in this file to get more information on how templates are used and how form data objects are built.

### GetPageFromOffset ( $offset )

[](#getpagefromoffset--offset-)

Given a byte offset in the Text property, returns its page number in the pdf document.

Page numbers start from 1.

### HasFormData

[](#hasformdata)

```
$status 	=  $pdf -> HasFormData ( ) ;

```

Returns true if the PDF file contains form data.

### MarkTextLike ( $regex, $mark\_start, $mark\_end )

[](#marktextlike--regex-mark_start-mark_end-)

Sometimes it may be convenient, when you want to extract only a portion of text, to say : "I want to extract text between this title and this title". The MarkTextLike() method provides some support for such a task. Imagine you have documents that have the same structure, all starting with an "Introduction" title :

```
Introduction
	...
	some text
	...
Some other title
	...

```

By calling the MarkTextLike() method such as in the example below :

```
$pdf -> MarkTextLike ( '/\bIntroduction\b/', '', '