PHPackages                             prinsfrank/pdfparser - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [PDF &amp; Document Generation](/categories/documents)
4. /
5. prinsfrank/pdfparser

ActiveLibrary[PDF &amp; Document Generation](/categories/documents)

prinsfrank/pdfparser
====================

maintainable, fast &amp; low-memory; built from scratch

v2.7.0(6mo ago)12620.0k↑96.9%12[7 issues](https://github.com/PrinsFrank/pdfparser/issues)[6 PRs](https://github.com/PrinsFrank/pdfparser/pulls)2MITPHPPHP ^8.1CI failing

Since Jan 1Pushed 4mo ago6 watchersCompare

[ Source](https://github.com/PrinsFrank/pdfparser)[ Packagist](https://packagist.org/packages/prinsfrank/pdfparser)[ GitHub Sponsors](https://github.com/PrinsFrank)[ RSS](/packages/prinsfrank-pdfparser/feed)WikiDiscussions main Synced 1mo ago

READMEChangelog (10)Dependencies (7)Versions (38)Used By (2)

  ![Banner](https://github.com/PrinsFrank/pdfparser/raw/main/docs/images/banner_light.png)PDF Parser
==========

[](#pdf-parser)

[![GitHub](https://camo.githubusercontent.com/db6097fa553eb64f7d4efe2f4a7329e776fd76cf10236506bbd3f28a696c1faa/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f6c6963656e73652f7072696e736672616e6b2f706466706172736572)](https://github.com/PrinsFrank/pdfparser/blob/main/LICENSE)[![PHP Version Support](https://camo.githubusercontent.com/0570fc2130bced0ec51bab91ba66f95fc2eaeeefb634936feedc21566c132403/68747470733a2f2f696d672e736869656c64732e696f2f7061636b61676973742f7068702d762f7072696e736672616e6b2f706466706172736572)](https://github.com/PrinsFrank/pdfparser/blob/main/composer.json)[![codecov](https://camo.githubusercontent.com/c88ea8a1532471261233a8551c088327308d5ac1d7e29c61f20c6f947502c4fa/68747470733a2f2f636f6465636f762e696f2f67682f5072696e734672616e6b2f7064667061727365722f6272616e63682f6d61696e2f67726170682f62616467652e7376673f746f6b656e3d324b584f34334d434943)](https://codecov.io/gh/PrinsFrank/pdfparser)[![PHPStan Level](https://camo.githubusercontent.com/2b1732baa25914ee5ccbeaf42980d671de29700b49e0639e1edc8e66181f6905/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f5048505374616e2d6c6576656c25323031302d627269676874677265656e2e7376673f7374796c653d666c6174)](https://github.com/PrinsFrank/pdfparser/blob/main/phpstan.neon)[![](https://camo.githubusercontent.com/d2432db86e1cfca636933728a664bf886cff501efab549a0af6abab26fe8851b/68747470733a2f2f696d672e736869656c64732e696f2f7374617469632f76313f6c6162656c3d53706f6e736f72266d6573736167653d254532253944254134266c6f676f3d47697448756226636f6c6f723d253233666538653836)](https://github.com/sponsors/PrinsFrank)

Maintainable, fast &amp; low-memory; built from scratch

Why this library?
-----------------

[](#why-this-library)

Previously, there wasn't a PDF library that allows parsing of PDFs that was open source, MIT licensed and under active development. The PDFParser by smalot, while having been very useful over the years isn't under active development anymore. The parser of Setasign is not MIT licensed and not open source. And several other packages rely on java/js/python dependencies being installed that are called by PHP behind the scenes, losing any type information and underlying structure.

Instead, this package allows for parsing of a wide variety of PDF files while not relying on external dependencies, all while being MIT licensed!

Setup
-----

[](#setup)

To start right away, run the following command in your composer project;

```
composer require prinsfrank/pdfparser
```

 Installation without ComposerIf you don't want to install this package using Composer, or cannot due to some constraints, you can still download the contents of the latest release and use this package directly.

As you don't have Composer to handle autoloading for you, you'll need to register the custom autoloader from this project. To do so, simply add the following line at the top of your custom bootstrap script or the file you want to parse PDFs in:

 `require 'path/to/package/directory/.al-custom.php';`This needs to point to the `.al-custom.php` file in the directory that the contents of this package is in.

The most common use case - extracting text from a document - is then just as simple as this;

```
use PrinsFrank\PdfParser\PdfParser;

$document = (new PdfParser())
    ->parseFile($path);

$document->getText();
```

Support
-------

[](#support)

This is one of my biggest projects I've ever worked on, and over the past few months I spent hundreds of hours working on this. Please consider [Sponsoring me on GitHub](https://github.com/sponsors/PrinsFrank) to support this project. Thanks!!

Opening a PDF
-------------

[](#opening-a-pdf)

To open a PDF file, you'll first need to load it and retrieve a `Document` object. That can be done by either parsing a file directly, or parsing a PDF from a string variable.

### Parsing a PDF file

[](#parsing-a-pdf-file)

Parsing a PDF from a file directly is the easiest option. To do so, simply call the `parseFile` method on a `PdfParser` instance:

```
use PrinsFrank\PdfParser;

$document = (new PdfParser())
    ->parseFile(dirname(__DIR__, 3) . '/path/to/file.pdf');
```

By default, this loads the file into memory while parsing. This greatly improves parsing speed, at the cost of a bigger memory footprint. If you want to reduce the base memory footprint and use a file handle instead, you can set the second parameter `useInMemoryStream` to `false`.

### Parsing PDF from string

[](#parsing-pdf-from-string)

It is also possible to parse a PDF from a string in a variable. To do so, pass the string as an argument for the `parseFile` method on a `PdfParser` instance.

```
use PrinsFrank\PdfParser;

$pdfAsString = file_get_contents(dirname(__DIR__, 3) . '/path/to/file.pdf');

$document = (new PdfParser())
    ->parseString($pdfAsString);
```

If you want to decrease the average memory footprint, you can also do so here, by setting the `useFileCache` parameter to `true`. This will result in the file content being written to a temporary file and the parser using the handle to that file from then on. This will be at the cost of speed.

The `Document`
--------------

[](#the-document)

Once you have opened a file from the filesystem with `parseFile` or from a string variable using `parseString`, you'll get back an instance of a `Document`.

While initially parsing the document, a small number of variables are populated in the `Document` instance that allow for further accessing of that document. This includes:

- The public `$stream` property: a PHP stream handle to the file in memory or on the filesystem.
- The public `$version` property: Information about the PDF version of the file.
- the public `$crossReferenceSource` property: A parsed crossReference table or stream, containing several crossReference(Sub)Sections that contain information about objects stored in the document and where to find them.
- The private `$pages` property to cache any pages that have already been retrieved. This property is only set when the pages are actually retrieved using the `getPages` method. (See below)

The document also contains several methods to retrieve specific objects from it. Those are discussed below.

If you want to quickly retrieve all text from a document, you can use the `getText` method.

Objects in a `Document` and their decorators
--------------------------------------------

[](#objects-in-a-document-and-their-decorators)

A PDF is organized in objects. Not all objects are created equally. Some objects might be a Page, while others a Font. Some objects might be Generic and without a specific type. There are currently 18 specific types, and a generic object type. Some of those will be specified below.

Code specific for certain object types lives in that object types' decorator. Retrieving text for a Page makes sense, retrieving the text from a Font not so much, so the Page decorator contains the `getText` method. Below you'll find some documentation for specific object decorators.

If you want to retrieve an object by its number, you can call the `$document->getObject($objectNumber)` method. If you know that the object with that number is supposed to be of a specific type, you can supply the second argument. For example, if you want to get object 42 which you know is of type Page, you can call the method like this:

```
$page = $document->getObject(42, Page::class);
```

If the object is not of the correct type, this will result in an exception. If you don't care about the object type, pass null as the second argument or don't supply the second argument at all.

### Decorated `InformationDictionary` objects

[](#decorated-informationdictionary-objects)

If a PDF has a title, producer, author, creator, creationDate or modificationDate, it is stored in an InformationDictionary.

If a PDF has an InformationDictionary, it can be retrieved using the `$document->getInformationDictionary()` method. Not All PDFs have this available, so this method might return null.

To access information from the InformationDictionary, there are several methods available:

```
/** @var \PrinsFrank\PdfParser\Document\Document $document */
$title = $document->getInformationDictionary()?->getTitle();
$producer = $document->getInformationDictionary()?->getProducer();
$author = $document->getInformationDictionary()?->getAuthor();
$creator = $document->getInformationDictionary()?->getCreator();
$creationDate = $document->getInformationDictionary()?->getCreationDate();
$modificationDate = $document->getInformationDictionary()?->getModificationDate();
```

If you want to access non-standard data from the information dictionary, you can also retrieve the entire dictionary from the object:

```
/** @var \PrinsFrank\PdfParser\Document\Document $document */
$dictionary = $document->getInformationDictionary()?->getDictionary();
```

### Decorated `Page` objects

[](#decorated-page-objects)

Page objects can be retrieved from a document by calling the `$document->getPage($pageNumber)` method for a single page, or `$document->getPages()` for all pages. Note that `$pageNumber` is zero-indexed, so even if different format page numbers are displayed at the bottom of a page, the first page in a document is still page 0, etc.

Once you have a `Page` object, there are several methods available to retrieve information from that page. The main method of interest here is the `$page->getText()` method. To retrieve all text from all pages, you could do something like this:

```
use PrinsFrank\PdfParser\PdfParser;

$document = (new PdfParser())->parseFile('/path/to/file.pdf');

foreach ($document->getPages() as $index => $page) {
    echo 'Text on page ' . $index . ' : ' . $page->getText();
}
```

There is also a `getText` method available on the Document to retrieve all text at once without even having to retrieve pages.

There are also methods available to get the underlying positioned Text Elements using `$page->getPositionedTextElements()`, the resource dictionary for a page using `$page->getResourceDictionary()` and the font dictionary using `$page->getFontDictionary()`.

### Decorated `XObject` objects

[](#decorated-xobject-objects)

Images and forms are stored in XObjects. These can be retrieved on a page-by-page basis using `$page->getXObjects()`. If you are only interested in images, you can retrieve all image XObjects by calling `$page->getImages()`.

For XObjects, there are some additional methods: `getWidth` returns the width of the object in pixels if available, `getHeight()` the height in pixels, and `getLength()` the length in bytes. To determine the subtype, there are two methods available: `isImage()` and `isForm`. If the XObject is an image, `getImageType()` will return the image type.

To extract all images on a page and store them on your machine, you could do something like this:

```
/** @var \PrinsFrank\PdfParser\Document\Document $document */
foreach ($document->getImages() as $index => $image) {
    if (($imageFileExtension = $image->getImageType()?->getFileExtension()) === null) {
        continue; // You could still save the file with a default file extension like 'jpg', but it is not clear what kind of image this is.
    }

    file_put_contents(sprintf('%s/image_%d.%s', __DIR__, $index, $imageFileExtension), $image->getContent());
}
```

###  Health Score

51

—

FairBetter than 96% of packages

Maintenance69

Regular maintenance activity

Popularity44

Moderate usage in the ecosystem

Community22

Small or concentrated contributor base

Maturity59

Maturing project, gaining track record

 Bus Factor1

Top contributor holds 99.2% of commits — single point of failure

How is this calculated?**Maintenance (25%)** — Last commit recency, latest release date, and issue-to-star ratio. Uses a 2-year decay window.

**Popularity (30%)** — Total and monthly downloads, GitHub stars, and forks. Logarithmic scaling prevents top-heavy scores.

**Community (15%)** — Contributors, dependents, forks, watchers, and maintainers. Measures real ecosystem engagement.

**Maturity (30%)** — Project age, version count, PHP version support, and release stability.

###  Release Activity

Cadence

Every ~9 days

Recently: every ~2 days

Total

31

Last Release

209d ago

Major Versions

v0.1.3 → v1.0.02025-02-17

v1.1.0 → v2.0.0-alpha.12025-05-04

PHP version history (2 changes)v0.0.1PHP ~8.1.0 || ~8.2.0 || ~8.3.0 || ~8.4.0

v0.1.3PHP ^8.1

### Community

Maintainers

![](https://www.gravatar.com/avatar/288919c24dc651727390578a2bfe6ef020f6f508c30db717b54c943a9e2ac0b3?d=identicon)[PrinsFrank](/maintainers/PrinsFrank)

---

Top Contributors

[![PrinsFrank](https://avatars.githubusercontent.com/u/25006490?v=4)](https://github.com/PrinsFrank "PrinsFrank (604 commits)")[![dependabot[bot]](https://avatars.githubusercontent.com/in/29110?v=4)](https://github.com/dependabot[bot] "dependabot[bot] (2 commits)")[![szepeviktor](https://avatars.githubusercontent.com/u/952007?v=4)](https://github.com/szepeviktor "szepeviktor (2 commits)")[![k00ni](https://avatars.githubusercontent.com/u/381727?v=4)](https://github.com/k00ni "k00ni (1 commits)")

###  Code Quality

TestsPHPUnit

Static AnalysisPHPStan

Code StylePHP CS Fixer

Type Coverage Yes

### Embed Badge

![Health badge](/badges/prinsfrank-pdfparser/health.svg)

```
[![Health](https://phpackages.com/badges/prinsfrank-pdfparser/health.svg)](https://phpackages.com/packages/prinsfrank-pdfparser)
```

###  Alternatives

[spatie/browsershot

Convert a webpage to an image or pdf using headless Chrome

5.2k32.1M102](/packages/spatie-browsershot)[barryvdh/laravel-snappy

Snappy PDF/Image for Laravel

2.8k24.8M48](/packages/barryvdh-laravel-snappy)[openspout/openspout

PHP Library to read and write spreadsheet files (CSV, XLSX and ODS), in a fast and scalable way

1.2k57.6M131](/packages/openspout-openspout)[keboola/csv

Keboola CSV reader and writer

1451.8M21](/packages/keboola-csv)[setasign/tfpdf

This class is a modified version of FPDF that adds UTF-8 support. The latest version is based on FPDF 1.85.

426.1M30](/packages/setasign-tfpdf)[aspera/xlsx-reader

Spreadsheet reader library for XLSX files

52742.2k5](/packages/aspera-xlsx-reader)

PHPackages © 2026

[Directory](/)[Categories](/categories)[Trending](/trending)[Changelog](/changelog)[Analyze](/analyze)