PHPackages                             lecodeurdudimanche/document-data-extractor - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [Parsing &amp; Serialization](/categories/parsing)
4. /
5. lecodeurdudimanche/document-data-extractor

ActiveLibrary[Parsing &amp; Serialization](/categories/parsing)

lecodeurdudimanche/document-data-extractor
==========================================

A simple library to extract data from documents with a known structure

0.1.1(6y ago)02MITPHPCI failing

Since Nov 11Pushed 6y ago1 watchersCompare

[ Source](https://github.com/LeCodeurDuDimanche/document-data-extractor)[ Packagist](https://packagist.org/packages/lecodeurdudimanche/document-data-extractor)[ RSS](/packages/lecodeurdudimanche-document-data-extractor/feed)WikiDiscussions master Synced 1w ago

READMEChangelogDependencies (3)Versions (3)Used By (0)

Document Data Extractor
=======================

[](#document-data-extractor)

A simple PHP library to automate data extraction from documents with known formats.

Requirements
------------

[](#requirements)

This library uses Tesseract to read text from documents and Imagick to manipulate the images.

It relies on GhostScript (`gs`) to convert pdf files to images.

Installation
------------

[](#installation)

Install required php libraries : `php-imagick`. On Ubuntu :

```
apt install php7-imagick
```

Then install the package via composer :

```
composer require lecodeurdudimanche/document-data-extractor
```

Usage
-----

[](#usage)

First, you'll need to define what data you want to extract and where it is on the image :

```
    $extractor = new Extractor();
    $regionsOfInterest = [
        // The name of the company is in the rectangle with the top left corner (700, 180) and a size of (1080, 160)
        new ROI('Name of the company')->setRect(700, 180, 1080, 160),
        new ROI('Total', 'integer')->setRect(1980, 1572, 58, 52);
    ];
```

Next you can add some options forwarded to tesseract in order to get more precise results :

```
    $tesseractConfiguration = [
        'psm' => 8, // Page segmentation method is set to 8 (single word)
        'tessdataDir' => '/usr/share/tessdata' // Other tesseract options ...
    ];
    $config = Configuration::fromArray(compact('regionsOfInterest', 'tesseractConfiguration'));
    $extractor->setConfig($config);
```

Then you set the document you want to extract data from :

```
    $extractor->loadImage('/path/to/image.png'); // or
    $extractor->loadPDF('/path/to/document.pdf'); // or
    $extractor->setImage($imageData); // could be an Imagick or GD image or raw image data
```

And finally you call the `run()` method to extract the data :

```
    $data = $extractor->run();
    /*
    * $data = [
    * ['label' => 'Name of the company', 'type' => 'text', 'data' => 'Company Limited'],
    * ['label' => 'Total', 'type' => 'integer', 'data' => '55']
    * ];
    */
```

You can save and load a `Configuration` object with the `toFile` and `fromFile` methods. The file format is pretty formatted JSON.

###  Health Score

21

—

LowBetter than 18% of packages

Maintenance20

Infrequent updates — may be unmaintained

Popularity2

Limited adoption so far

Community7

Small or concentrated contributor base

Maturity48

Maturing project, gaining track record

 Bus Factor1

Top contributor holds 100% of commits — single point of failure

How is this calculated?**Maintenance (25%)** — Last commit recency, latest release date, and issue-to-star ratio. Uses a 2-year decay window.

**Popularity (30%)** — Total and monthly downloads, GitHub stars, and forks. Logarithmic scaling prevents top-heavy scores.

**Community (15%)** — Contributors, dependents, forks, watchers, and maintainers. Measures real ecosystem engagement.

**Maturity (30%)** — Project age, version count, PHP version support, and release stability.

###  Release Activity

Cadence

Every ~0 days

Total

2

Last Release

2379d ago

### Community

Maintainers

![](https://www.gravatar.com/avatar/91ab2cb172b0a9fc1312802e15daf21497045ca73d7a19c4304c64c78b6acd65?d=identicon)[LeCodeurDuDimanche](/maintainers/LeCodeurDuDimanche)

---

Top Contributors

[![LeCodeurDuDimanche](https://avatars.githubusercontent.com/u/43851820?v=4)](https://github.com/LeCodeurDuDimanche "LeCodeurDuDimanche (12 commits)")

###  Code Quality

TestsPHPUnit

Code StylePHP\_CodeSniffer

### Embed Badge

![Health badge](/badges/lecodeurdudimanche-document-data-extractor/health.svg)

```
[![Health](https://phpackages.com/badges/lecodeurdudimanche-document-data-extractor/health.svg)](https://phpackages.com/packages/lecodeurdudimanche-document-data-extractor)
```

###  Alternatives

[mtdowling/jmespath.php

Declaratively specify how to extract elements from a JSON document

2.0k472.8M135](/packages/mtdowling-jmespathphp)[opis/closure

A library that can be used to serialize closures (anonymous functions) and arbitrary data.

2.6k230.0M284](/packages/opis-closure)[masterminds/html5

An HTML5 parser and serializer.

1.8k242.8M229](/packages/masterminds-html5)[sabberworm/php-css-parser

Parser for CSS Files written in PHP

1.8k191.2M65](/packages/sabberworm-php-css-parser)[michelf/php-markdown

PHP Markdown

3.5k52.4M345](/packages/michelf-php-markdown)[jms/metadata

Class/method/property metadata management in PHP

1.8k152.8M88](/packages/jms-metadata)

PHPackages © 2026

[Directory](/)[Categories](/categories)[Trending](/trending)[Changelog](/changelog)[Analyze](/analyze)
