PHPackages                             darlanschmeller/doc-ocr-php - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [PDF &amp; Document Generation](/categories/documents)
4. /
5. darlanschmeller/doc-ocr-php

ActiveLibrary[PDF &amp; Document Generation](/categories/documents)

darlanschmeller/doc-ocr-php
===========================

Document OCR and ingestion pipeline for PHP applications, powered by Mistral AI.

v1.0.4(3mo ago)30MITPHPPHP ^8.1CI passing

Since Jan 20Pushed 3mo agoCompare

[ Source](https://github.com/DarlanSchmeller/doc-ocr-php)[ Packagist](https://packagist.org/packages/darlanschmeller/doc-ocr-php)[ Docs](https://github.com/darlan/doc-ocr-php)[ RSS](/packages/darlanschmeller-doc-ocr-php/feed)WikiDiscussions main Synced 1mo ago

READMEChangelog (1)Dependencies (2)Versions (19)Used By (0)

🧾 Doc Ocr PHP [![Doc OCR PHP](https://camo.githubusercontent.com/0ddf7a8e4d4aaa6318803a799587f40aef873bb344e78ebb0ec45dd57e618bec/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4f4352253230446f63756d656e74732d5048502d2532333838393946463f7374796c653d666f722d7468652d6261646765266c6162656c436f6c6f723d334333433343266c6f676f3d66696c6573266c6f676f436f6c6f723d7768697465)](https://camo.githubusercontent.com/0ddf7a8e4d4aaa6318803a799587f40aef873bb344e78ebb0ec45dd57e618bec/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4f4352253230446f63756d656e74732d5048502d2532333838393946463f7374796c653d666f722d7468652d6261646765266c6162656c436f6c6f723d334333433343266c6f676f3d66696c6573266c6f676f436f6c6f723d7768697465)
======================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================

[](#-doc-ocr-php-)

[![PHP](https://camo.githubusercontent.com/0fab92c36bf4f5ecf36a89b9e0406349c55f9c76baef96272873e5cd5e72333a/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f5048502d253345253344382e302d626c75653f6c6f676f3d706870266c6f676f436f6c6f723d7768697465)](https://www.php.net/) [![Packagist](https://camo.githubusercontent.com/058314cf350d50e7688b7014bbf3d3dd409cae394cfef4d084e96c72e5fcb626/68747470733a2f2f696d672e736869656c64732e696f2f7061636b61676973742f762f6461726c616e7363686d656c6c65722f646f632d6f63722d706870)](https://packagist.org/packages/darlanschmeller/doc-ocr-php) [![License](https://camo.githubusercontent.com/f8df3091bbe1149f398a5369b2c39e896766f9f6efba3477c63e9b4aa940ef14/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f6c6963656e73652d4d49542d677265656e)](LICENSE) [![PHP Composer](https://github.com/DarlanSchmeller/doc-ocr-php/actions/workflows/php.yml/badge.svg)](https://github.com/DarlanSchmeller/doc-ocr-php/actions/workflows/php.yml/badge.svg)

DocOcr is a lightweight, pipeline-based PHP library that turns documents like **PDF, CSV, and XLSX** into structured data using **Mistral's OCR API**.

### Designed for:

[](#designed-for)

- Ingestion pipelines
- AI preprocessing
- Finance / accounting docs
- Backend automation

Features
--------

[](#features)

- Supports **PDF, CSV, XLSX**
- Normalization layer for OCR-friendly input
- OCR powered by **Mistral AI**
- Extracts structured content (pages, text, tables)
- Fluent pipeline API (`normalize → ocr → toArray`)
- PHPUnit automated testing
- Custom OCR client injection

Why DocOcr?
-----------

[](#why-dococr)

Most OCR libraries return raw text blobs. DocOcr focuses on **pipeline-friendly, structured extraction**designed for backend systems, AI preprocessing, and financial workflows.

It handles:

- File normalization (CSV/XLSX → OCR-friendly layout)
- OCR execution
- Predictable output for downstream processing

Installation
------------

[](#installation)

### Via Composer (recommended)

[](#via-composer-recommended)

```
composer require darlanschmeller/doc-ocr-php
```

Include in your project:

```
require __DIR__ . '/vendor/autoload.php';

use DocOcr\Document;
```

### From source (for development only)

[](#from-source-for-development-only)

```
git clone https://github.com/DarlanSchmeller/doc-ocr-php.git
```

Include in your project:

```
require __DIR__ . '/src/Document.php';

use DocOcr\Document;
```

Configuration
-------------

[](#configuration)

Set your Mistral API key in your `.env` file:

```
MISTRAL_API_KEY=your_api_key_here
MISTRAL_OCR_ENDPOINT=mistral_ocr_endpoint_here # (OPTIONAL) default included
```

Usage
-----

[](#usage)

### Basic Usage

[](#basic-usage)

```
$ocr = Document::from(__DIR__ . '')
    ->normalize()
    ->ocr()
    ->toArray();

$ocrResult = $ocr->getResult();
```

### Injecting your own client instance

[](#injecting-your-own-client-instance)

If you wish to use a different api key or custom OCR client you may inject it this way:

```
 $client = new MistralOcrClient(new OcrClient(''));
        return Document::fromWithClient(__DIR__ . $fixture, $client)
            ->normalize()
            ->ocr()
            ->toArray();
```

Pipeline Stages
---------------

[](#pipeline-stages)

1. **`normalize()`**

    - Converts **CSV** and **XLSX** files into OCR-friendly layouts
    - Reads **PDFs** and **images** as-is
2. **`ocr()`**

    - Sends the document to **Mistral OCR**
    - Stores the **raw OCR response**
3. **`toArray()`**

    - Decodes the OCR JSON response into a **PHP array**

> All pipeline stages are idempotent and safe to call multiple times.

Output Example
--------------

[](#output-example)

```
[
  'pages' => [
    [
      'index' => 0,
      'markdown' => '
        Invoice Number: #20130304
        ATTENTION TO: Denny Gunawan
        221 Queen St, Melbourne 3000
        Total: $39.60
      ',
      'images' => [],
      'tables' => [
        [
          'id' => 'tbl-0.html',
          'format' => 'html',
          'content' => '
            Organic Items | Price/kg | Quantity | Subtotal
            Apple         | $5.00    | 1        | $5.00
            Orange        | $1.99    | 2        | $3.98
          '
        ]
      ]
    ]
  ]
]
```

📂 Supported Formats
-------------------

[](#-supported-formats)

FormatNormalizedOCRPDF✅✅CSV✅✅XLSX✅✅Images (png, jpg, webp)⏭ skipped✅Run automated tests
-------------------

[](#run-automated-tests)

```
./vendor/bin/phpunit tests
```

> OCR tests are **skipped** automatically if `MISTRAL_API_KEY` is not set.

###  Health Score

37

—

LowBetter than 83% of packages

Maintenance80

Actively maintained with recent releases

Popularity4

Limited adoption so far

Community6

Small or concentrated contributor base

Maturity51

Maturing project, gaining track record

 Bus Factor1

Top contributor holds 100% of commits — single point of failure

How is this calculated?**Maintenance (25%)** — Last commit recency, latest release date, and issue-to-star ratio. Uses a 2-year decay window.

**Popularity (30%)** — Total and monthly downloads, GitHub stars, and forks. Logarithmic scaling prevents top-heavy scores.

**Community (15%)** — Contributors, dependents, forks, watchers, and maintainers. Measures real ecosystem engagement.

**Maturity (30%)** — Project age, version count, PHP version support, and release stability.

###  Release Activity

Cadence

Every ~0 days

Total

17

Last Release

106d ago

Major Versions

v0.2.1 → v1.0.02026-01-23

### Community

Maintainers

![](https://www.gravatar.com/avatar/1ca13f0b14c2f3886054bb63eaa82df6feb2204a7f3388faf52f7e899bd2a010?d=identicon)[DarlanSchmeller](/maintainers/DarlanSchmeller)

---

Top Contributors

[![DarlanSchmeller](https://avatars.githubusercontent.com/u/157293733?v=4)](https://github.com/DarlanSchmeller "DarlanSchmeller (19 commits)")

---

Tags

aiautomationcsvdocument-processinggithub-actionsocrpdfphppipelineunit-testingxlsxphppdfaixlsxcsvOCRpipelinemistraldocument-processingdocument-ai

###  Code Quality

TestsPHPUnit

### Embed Badge

![Health badge](/badges/darlanschmeller-doc-ocr-php/health.svg)

```
[![Health](https://phpackages.com/badges/darlanschmeller-doc-ocr-php/health.svg)](https://phpackages.com/packages/darlanschmeller-doc-ocr-php)
```

###  Alternatives

[openspout/openspout

PHP Library to read and write spreadsheet files (CSV, XLSX and ODS), in a fast and scalable way

1.1k57.6M128](/packages/openspout-openspout)[gotenberg/gotenberg-php

A PHP client for interacting with Gotenberg, a developer-friendly API for converting numerous document formats into PDF files, and more!

3685.2M19](/packages/gotenberg-gotenberg-php)[kartik-v/yii2-export

A library to export server/db data in various formats (e.g. excel, html, pdf, csv etc.)

1623.1M35](/packages/kartik-v-yii2-export)[nilgems/laravel-textract

A Laravel package to extract text from files like DOC, XL, Image, Pdf and more. I've developed this package by inspiring "npm textract".

195.2k](/packages/nilgems-laravel-textract)

PHPackages © 2026

[Directory](/)[Categories](/categories)[Trending](/trending)[Changelog](/changelog)[Analyze](/analyze)
