PHPackages                             content-extract/content-processor - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [PSR &amp; Standards](/categories/psr-standards)
4. /
5. content-extract/content-processor

ActiveLibrary[PSR &amp; Standards](/categories/psr-standards)

content-extract/content-processor
=================================

Robust PHP library for batch document processing. Extracts content from PDFs/text and generates structured JSON according to user-defined schemas. Now with semantic structuring, OCR support for scanned PDFs, text normalization, and alias-driven field matching. Production-ready, secure, zero unnecessary dependencies.

1.5.0(1mo ago)117MITPHPPHP &gt;=8.1

Since Apr 19Pushed 1mo agoCompare

[ Source](https://github.com/saul9809/content_extract-library)[ Packagist](https://packagist.org/packages/content-extract/content-processor)[ Docs](https://github.com/saul9809/content_extract-library)[ RSS](/packages/content-extract-content-processor/feed)WikiDiscussions main Synced 1w ago

READMEChangelogDependencies (3)Versions (6)Used By (0)

Content Processor
=================

[](#content-processor)

**Production-ready PHP library for batch document processing with intelligent content extraction and structuring.**

Framework-agnostic, scalable, and optimized for real-world document pipelines from day one.

🎯 Purpose
---------

[](#-purpose)

Process multiple documents (PDFs, text files, images, etc.), extract their content, and convert it into configurable JSON structures ready for bulk loading into databases or services.

### Quick Example

[](#quick-example)

```
$result = ContentProcessor::make()
    ->withSchema($schema)
    ->withExtractor(new PdfTextExtractor())
    ->withStructurer(new SchemaAwareStructurer())
    ->fromDirectory('/documents')
    ->processFinal();  // Returns FinalResult with clean API
```

📦 Installation
--------------

[](#-installation)

```
composer require content-extract/content-processor:^1.4.0
```

Or add to your `composer.json`:

```
{
  "require": {
    "content-extract/content-processor": "^1.4.0"
  }
}
```

🏗️ Project Structure
--------------------

[](#️-project-structure)

```
src/
├── Contracts/              # Interfaces defining the contract
│   ├── ExtractorInterface.php
│   ├── StructurerInterface.php
│   └── SchemaInterface.php
├── Core/                   # Main classes
│   └── ContentProcessor.php
├── Extractors/             # Extractor implementations
│   ├── PdfTextExtractor.php
│   ├── TextFileExtractor.php
│   └── PdfOcrExtractor.php (v1.5.0+)
├── Schemas/                # Schema implementations
│   └── ArraySchema.php
├── Structurers/            # Structurer implementations
│   ├── SimpleLineStructurer.php
│   ├── RuleBasedStructurer.php
│   ├── SchemaAwareStructurer.php
│   └── CompositePdfExtractor.php (v1.5.0+)
├── Utils/                  # Utilities
│   ├── TextNormalizer.php
│   └── TextSegmenter.php
└── Models/                 # Domain models
    ├── Warning.php
    ├── Error.php
    └── FinalResult.php

examples/
├── example_basic.php
├── example_semantic_structuring.php
└── sample_cv_*.txt

```

⚡ Quick Start
-------------

[](#-quick-start)

### 1. Define Your Schema

[](#1-define-your-schema)

```
use ContentProcessor\Schemas\ArraySchema;

$schema = new ArraySchema([
    'name' => [
        'type' => 'string',
        'required' => true,
        'aliases' => ['name', 'full name', 'applicant name'],
    ],
    'email' => [
        'type' => 'string',
        'required' => true,
        'aliases' => ['email', 'email address'],
    ],
    'experience_years' => [
        'type' => 'integer',
        'required' => false,
        'aliases' => ['years of experience', 'experience'],
    ],
]);
```

### 2. Configure the Processor

[](#2-configure-the-processor)

```
use ContentProcessor\Core\ContentProcessor;
use ContentProcessor\Extractors\PdfTextExtractor;
use ContentProcessor\Structurers\SchemaAwareStructurer;

$result = ContentProcessor::make()
    ->withSchema($schema)
    ->withExtractor(new PdfTextExtractor())
    ->withStructurer(new SchemaAwareStructurer())
    ->fromDirectory('/path/to/documents', '*.pdf')
    ->processFinal();
```

### 3. Consume Results

[](#3-consume-results)

```
// Check status
if (!$result->isSuccessful()) {
    echo "Some documents failed:\n";
    foreach ($result->errors() as $error) {
        echo "  - " . $error->getMessage() . "\n";
    }
}

// Process successful data
foreach ($result->data() as $item) {
    echo "Processed: " . $item['document'] . "\n";
    // $item['data'] contains the structured data
    var_dump($item['data']);
}

// Inspect quality warnings
if ($result->hasWarnings()) {
    foreach ($result->warnings() as $warning) {
        echo "⚠️ Field '{$warning->getField()}': {$warning->getMessage()}\n";
    }
}

// Export to JSON
echo $result->toJSONPretty();
```

🧪 Testing
---------

[](#-testing)

### Run Examples

[](#run-examples)

```
cd examples
php example_basic.php
php example_semantic_structuring.php
```

### Full Test Suite

[](#full-test-suite)

```
composer test
```

### Code Quality

[](#code-quality)

```
composer lint
```

🔌 Available Interfaces
----------------------

[](#-available-interfaces)

### ExtractorInterface

[](#extractorinterface)

```
interface ExtractorInterface {
    public function extract(string $source): array;
    public function canHandle(string $source): bool;
    public function getName(): string;
}
```

### StructurerInterface

[](#structurerinterface)

```
interface StructurerInterface {
    public function structure(array $content, SchemaInterface $schema): array;
    public function getName(): string;
}
```

### SchemaInterface

[](#schemainterface)

```
interface SchemaInterface {
    public function getDefinition(): array;
    public function validate(array $data): array;
    public function getName(): string;
}
```

📋 Processor Options
-------------------

[](#-processor-options)

```
$processor->withOptions([
    'skip_invalid' => true,    // Skip documents that fail validation
    'preserve_empty' => false, // Preserve empty fields in result
]);
```

✅ Implemented Features (Blocks 1-5)
-----------------------------------

[](#-implemented-features-blocks-1-5)

### Block 1: Core ✅

[](#block-1-core-)

- Framework-agnostic design with clean interfaces
- Extractor/Structurer pattern
- JSON schema validation
- Batch processing

### Block 2: PDF Support ✅

[](#block-2-pdf-support-)

- PdfTextExtractor with smalot/pdfparser
- Batch processing with multiple PDFs
- Robust error handling

### Block 3: Semantic Structuring ✅

[](#block-3-semantic-structuring-)

- SchemaAwareStructurer for intelligent extraction
- Field aliases for semantic guidance
- Text normalization and segmentation
- Advanced warning system
- Type conversion and validation

### Block 4: Final Result API ✅

[](#block-4-final-result-api-)

- Unified FinalResult object
- Error and warning normalization
- Summary with statistics
- JSON export and serialization

### Block 5: Security &amp; Hardening ✅

[](#block-5-security--hardening-)

- File size limits (10 MB default)
- Batch document limits (50 documents default)
- Path traversal protection
- Configurable security validation
- Production-ready defaults

### Block 6: OCR Support (v1.5.0+) 🚀

[](#block-6-ocr-support-v150-)

- PdfOcrExtractor for scanned PDFs using Tesseract
- Automatic fallback when digital extraction fails
- Transparent OCR processing without code changes
- Preserves semantic structuring pipeline

🔍 OCR Support (Optional)
------------------------

[](#-ocr-support-optional)

This library supports OCR for scanned PDFs using **Tesseract OCR**.

### Requirements

[](#requirements)

- Tesseract OCR installed on the system
- Language data files (e.g., `eng` for English)
- Installation is handled by the operating system, not Composer

### Automatic Fallback

[](#automatic-fallback)

OCR is automatically used when:

- Digital text extraction returns insufficient text
- Extracted text is empty or below threshold (default: 50 characters)
- Extracted text contains no alphabetic characters

### Example with OCR

[](#example-with-ocr)

```
use ContentProcessor\Extractors\CompositePdfExtractor;

// Automatically tries digital extraction first, then OCR if needed
$result = ContentProcessor::make()
    ->withSchema($schema)
    ->withExtractor(new CompositePdfExtractor())  // Tries PDF text first, then OCR
    ->withStructurer(new SchemaAwareStructurer())
    ->fromDirectory('/documents')
    ->processFinal();
```

### Important Notes

[](#important-notes)

- OCR is **optional** - the library works fine with digital PDFs
- OCR is **NOT** installed by Composer
- OCR support does **not** change schema behavior
- Aliases are still defined by your application
- If Tesseract is not available, clear error messages are provided

📚 Documentation
---------------

[](#-documentation)

- [ARCHITECTURE.md](ARCHITECTURE.md) - Complete architectural design
- [SECURITY.md](SECURITY.md) - Security policy and configurable limits
- [SEMANTIC\_STRUCTURING\_GUIDE.md](SEMANTIC_STRUCTURING_GUIDE.md) - Schema aliases and matching
- [QUICK\_START\_V1.4.0.md](QUICK_START_V1.4.0.md) - Quick reference for v1.4.0+

🔌 API Reference
---------------

[](#-api-reference)

### FinalResult

[](#finalresult)

```
$result = ContentProcessor::make()->...->processFinal();

// Access data
$result->data();           // Array of successful documents
$result->errors();         // Array of normalized errors
$result->warnings();       // Array of semantic warnings
$result->summary();        // Summary with statistics

// Status checks
$result->isSuccessful();   // bool - At least 1 successful?
$result->isPerfect();      // bool - No errors or warnings?
$result->hasErrors();      // bool
$result->hasWarnings();    // bool

// Filtering
$result->errorsByType('validation');
$result->warningsByField('email');
$result->warningsByCategory('missing_value');

// Serialization
$result->toArray();        // array
$result->toJSON();         // string (compact)
$result->toJSONPretty();   // string (formatted)
$result->fullResults();    // array (complete audit trail)
```

🚀 Production Ready
------------------

[](#-production-ready)

The library is tested and ready for production deployment. See [SECURITY.md](SECURITY.md) for deployment recommendations.

📋 Requirements
--------------

[](#-requirements)

- PHP &gt;= 8.1
- Composer
- (Optional) Tesseract OCR for scanned PDF support

📄 License
---------

[](#-license)

MIT

###  Health Score

40

—

FairBetter than 86% of packages

Maintenance90

Actively maintained with recent releases

Popularity10

Limited adoption so far

Community6

Small or concentrated contributor base

Maturity46

Maturing project, gaining track record

 Bus Factor1

Top contributor holds 100% of commits — single point of failure

How is this calculated?**Maintenance (25%)** — Last commit recency, latest release date, and issue-to-star ratio. Uses a 2-year decay window.

**Popularity (30%)** — Total and monthly downloads, GitHub stars, and forks. Logarithmic scaling prevents top-heavy scores.

**Community (15%)** — Contributors, dependents, forks, watchers, and maintainers. Measures real ecosystem engagement.

**Maturity (30%)** — Project age, version count, PHP version support, and release stability.

###  Release Activity

Cadence

Every ~0 days

Total

4

Last Release

50d ago

### Community

Maintainers

![](https://www.gravatar.com/avatar/99a166238fd0c8907f33a9d6fa10c869a9309b3e7c9f3a2b0d0913a924bebef0?d=identicon)[saul9809](/maintainers/saul9809)

---

Top Contributors

[![saul9809](https://avatars.githubusercontent.com/u/120270819?v=4)](https://github.com/saul9809 "saul9809 (47 commits)")

---

Tags

phppdfjson-schemasecurityPSR-4psr-12content extractionbatch processingdocument-processingproduction-ready

###  Code Quality

TestsPHPUnit

Code StylePHP\_CodeSniffer

### Embed Badge

![Health badge](/badges/content-extract-content-processor/health.svg)

```
[![Health](https://phpackages.com/badges/content-extract-content-processor/health.svg)](https://phpackages.com/packages/content-extract-content-processor)
```

###  Alternatives

[atgp/factur-x

PHP library to manage your Factur-X / ZUGFeRD 2.0 PDF invoices files

147883.7k4](/packages/atgp-factur-x)[icamys/php-sitemap-generator

Simple PHP sitemap generator.

176356.9k7](/packages/icamys-php-sitemap-generator)

PHPackages © 2026

[Directory](/)[Categories](/categories)[Trending](/trending)[Changelog](/changelog)[Analyze](/analyze)
