PHPackages                             jcfrane/pdf-text-extractor - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [PDF &amp; Document Generation](/categories/documents)
4. /
5. jcfrane/pdf-text-extractor

ActiveLibrary[PDF &amp; Document Generation](/categories/documents)

jcfrane/pdf-text-extractor
==========================

A Laravel PDF text extraction package with multiple strategies (PdfParser, XObject, AWS Textract, Tesseract OCR). Handles Canva-generated PDFs, scanned documents, and other edge cases with automatic fallback.

v0.0.3(1mo ago)269MITPHPPHP ^8.1CI passing

Since Feb 11Pushed 1mo agoCompare

[ Source](https://github.com/jcfrane/pdf-text-extractor)[ Packagist](https://packagist.org/packages/jcfrane/pdf-text-extractor)[ RSS](/packages/jcfrane-pdf-text-extractor/feed)WikiDiscussions main Synced 1mo ago

READMEChangelog (3)Dependencies (9)Versions (5)Used By (0)

PDF Text Extractor (Laravel)
============================

[](#pdf-text-extractor-laravel)

Laravel-first PDF text extraction with fallback strategies for:

- standard PDFs
- Canva/XObject-based PDFs
- scanned PDFs (via OCR)

Installation
------------

[](#installation)

```
composer require jcfrane/pdf-text-extractor
```

Optional OCR dependencies:

```
# AWS Textract support
composer require aws/aws-sdk-php

# Tesseract support (system packages)
# Ubuntu/Debian:
apt-get install tesseract-ocr ghostscript
# macOS:
brew install tesseract ghostscript
```

Laravel Setup
-------------

[](#laravel-setup)

The package uses Laravel auto-discovery.
If you want to customize settings, publish config:

```
php artisan vendor:publish --tag=pdf-text-extractor-config
```

This creates:

- `config/pdf-text-extractor.php`

Quick Start (Laravel)
---------------------

[](#quick-start-laravel)

### Dependency Injection

[](#dependency-injection)

```
use JCFrane\PdfTextExtractor\PdfTextExtractor;

class ParseResumeAction
{
    public function __invoke(PdfTextExtractor $extractor, string $path): string
    {
        $result = $extractor->extract($path);

        if (! $result->isSuccessful()) {
            return '';
        }

        return $result->getText();
    }
}
```

### Facade

[](#facade)

A facade is already included and auto-aliased as `PdfTextExtractor`.

```
use JCFrane\PdfTextExtractor\Facades\PdfTextExtractor;

$result = PdfTextExtractor::extract(storage_path('app/resumes/candidate.pdf'));

if ($result->isSuccessful()) {
    $text = $result->getText();
    $strategyUsed = $result->getStrategy(); // pdf_parser, xobject, textract, tesseract
}
```

Configuration
-------------

[](#configuration)

Publish the config file:

```
php artisan vendor:publish --tag=pdf-text-extractor-config
```

This creates `config/pdf-text-extractor.php` with the following options:

### Minimum Text Length

[](#minimum-text-length)

```
'min_text_length' => env('PDF_EXTRACTOR_MIN_TEXT_LENGTH', 20),
```

The minimum number of characters an extraction must produce to be considered successful. If a strategy returns fewer characters than this threshold, the next strategy in the list will be tried. Increase this if short garbage output is being accepted; decrease it if your PDFs legitimately contain very little text.

### Strategies

[](#strategies)

```
'strategies' => [
    JCFrane\PdfTextExtractor\Strategies\PdfParserStrategy::class,
    JCFrane\PdfTextExtractor\Strategies\XObjectStrategy::class,
    // JCFrane\PdfTextExtractor\Strategies\TextractStrategy::class,
    // JCFrane\PdfTextExtractor\Strategies\TesseractStrategy::class,
],
```

An ordered list of extraction strategies. Each strategy is attempted in sequence until one produces text meeting the `min_text_length` threshold. You can reorder, add, or remove strategies to suit your needs.

StrategyBest forRequirements`PdfParserStrategy`Standard text-based PDFsNone (included)`XObjectStrategy`Canva / XObject-based PDFsNone (included)`TextractStrategy`Scanned PDFs (cloud OCR)`aws/aws-sdk-php`, AWS credentials`TesseractStrategy`Scanned PDFs (local OCR)`tesseract-ocr`, `ghostscript` binaries**Example: enable all strategies**

```
'strategies' => [
    JCFrane\PdfTextExtractor\Strategies\PdfParserStrategy::class,
    JCFrane\PdfTextExtractor\Strategies\XObjectStrategy::class,
    JCFrane\PdfTextExtractor\Strategies\TextractStrategy::class,
    JCFrane\PdfTextExtractor\Strategies\TesseractStrategy::class,
],
```

### AWS Textract

[](#aws-textract)

Only required if `TextractStrategy` is in your strategies list. Requires `composer require aws/aws-sdk-php`.

```
'textract' => [
    'region'  => env('PDF_EXTRACTOR_AWS_REGION', 'us-east-1'),
    'key'     => env('PDF_EXTRACTOR_AWS_KEY'),
    'secret'  => env('PDF_EXTRACTOR_AWS_SECRET'),
    'version' => env('PDF_EXTRACTOR_AWS_VERSION', 'latest'),

    // Required for multi-page PDFs (async API uploads the PDF to S3)
    's3_bucket' => env('PDF_EXTRACTOR_AWS_S3_BUCKET'),
    's3_prefix' => env('PDF_EXTRACTOR_AWS_S3_PREFIX', 'pdf-text-extractor'),

    // Async job polling
    'async_poll_interval_ms'  => (int) env('PDF_EXTRACTOR_AWS_ASYNC_POLL_INTERVAL_MS', 1000),
    'async_max_attempts'      => (int) env('PDF_EXTRACTOR_AWS_ASYNC_MAX_ATTEMPTS', 20),
    'async_delete_uploaded'   => (bool) env('PDF_EXTRACTOR_AWS_ASYNC_DELETE_UPLOADED', true),
],
```

KeyEnv VariableDefaultDescription`region``PDF_EXTRACTOR_AWS_REGION``us-east-1`AWS region for Textract and S3`key``PDF_EXTRACTOR_AWS_KEY`—AWS access key ID`secret``PDF_EXTRACTOR_AWS_SECRET`—AWS secret access key`version``PDF_EXTRACTOR_AWS_VERSION``latest`AWS SDK version`s3_bucket``PDF_EXTRACTOR_AWS_S3_BUCKET`—S3 bucket for multi-page PDF processing`s3_prefix``PDF_EXTRACTOR_AWS_S3_PREFIX``pdf-text-extractor`Key prefix for uploaded PDFs in S3`async_poll_interval_ms``PDF_EXTRACTOR_AWS_ASYNC_POLL_INTERVAL_MS``1000`Milliseconds between polling attempts for async jobs`async_max_attempts``PDF_EXTRACTOR_AWS_ASYNC_MAX_ATTEMPTS``20`Maximum number of polling attempts before giving up`async_delete_uploaded``PDF_EXTRACTOR_AWS_ASYNC_DELETE_UPLOADED``true`Delete the uploaded PDF from S3 after processing**How Textract works:**

- **Single-page PDFs** use the synchronous `DetectDocumentText` API — no S3 required.
- **Multi-page PDFs** use the async flow: the PDF is uploaded to S3, `StartDocumentTextDetection` is called, and the result is polled via `GetDocumentTextDetection`.

Add these env values to your `.env`:

```
PDF_EXTRACTOR_AWS_REGION=eu-west-2
PDF_EXTRACTOR_AWS_KEY=your_key
PDF_EXTRACTOR_AWS_SECRET=your_secret

# Required for multi-page PDFs
PDF_EXTRACTOR_AWS_S3_BUCKET=your_bucket
```

**Required IAM permissions:**

```
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "TextractApis",
      "Effect": "Allow",
      "Action": [
        "textract:DetectDocumentText",
        "textract:StartDocumentTextDetection",
        "textract:GetDocumentTextDetection"
      ],
      "Resource": "*"
    },
    {
      "Sid": "TextractStagingObjectAccess",
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:GetObject",
        "s3:DeleteObject"
      ],
      "Resource": "arn:aws:s3:::YOUR_BUCKET_NAME/pdf-text-extractor/*"
    },
    {
      "Sid": "TextractStagingBucketList",
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket"
      ],
      "Resource": "arn:aws:s3:::YOUR_BUCKET_NAME"
    }
  ]
}
```

### Tesseract OCR

[](#tesseract-ocr)

Only required if `TesseractStrategy` is in your strategies list. Requires `tesseract` and `ghostscript` installed on the system.

```
'tesseract' => [
    'binary'              => env('PDF_EXTRACTOR_TESSERACT_BINARY', 'tesseract'),
    'ghostscript_binary'  => env('PDF_EXTRACTOR_GHOSTSCRIPT_BINARY', 'gs'),
    'language'            => env('PDF_EXTRACTOR_TESSERACT_LANGUAGE', 'eng'),
    'dpi'                 => (int) env('PDF_EXTRACTOR_TESSERACT_DPI', 300),
],
```

KeyEnv VariableDefaultDescription`binary``PDF_EXTRACTOR_TESSERACT_BINARY``tesseract`Path to the Tesseract binary`ghostscript_binary``PDF_EXTRACTOR_GHOSTSCRIPT_BINARY``gs`Path to the Ghostscript binary`language``PDF_EXTRACTOR_TESSERACT_LANGUAGE``eng`Tesseract language code (e.g. `eng`, `fra`, `deu`)`dpi``PDF_EXTRACTOR_TESSERACT_DPI``300`DPI used when converting PDF pages to images### Environment Variables Reference

[](#environment-variables-reference)

All env variables at a glance:

```
# General
PDF_EXTRACTOR_MIN_TEXT_LENGTH=20

# AWS Textract
PDF_EXTRACTOR_AWS_REGION=us-east-1
PDF_EXTRACTOR_AWS_KEY=
PDF_EXTRACTOR_AWS_SECRET=
PDF_EXTRACTOR_AWS_VERSION=latest
PDF_EXTRACTOR_AWS_S3_BUCKET=
PDF_EXTRACTOR_AWS_S3_PREFIX=pdf-text-extractor
PDF_EXTRACTOR_AWS_ASYNC_POLL_INTERVAL_MS=1000
PDF_EXTRACTOR_AWS_ASYNC_MAX_ATTEMPTS=20
PDF_EXTRACTOR_AWS_ASYNC_DELETE_UPLOADED=true

# Tesseract
PDF_EXTRACTOR_TESSERACT_BINARY=tesseract
PDF_EXTRACTOR_GHOSTSCRIPT_BINARY=gs
PDF_EXTRACTOR_TESSERACT_LANGUAGE=eng
PDF_EXTRACTOR_TESSERACT_DPI=300
```

Result Object
-------------

[](#result-object)

`extract()` and `extractFromString()` return an `ExtractionResult`:

- `getText()`
- `isSuccessful()`
- `getStrategy()`
- `getTextLength()`

License
-------

[](#license)

MIT

###  Health Score

38

—

LowBetter than 85% of packages

Maintenance89

Actively maintained with recent releases

Popularity15

Limited adoption so far

Community6

Small or concentrated contributor base

Maturity36

Early-stage or recently created project

 Bus Factor1

Top contributor holds 100% of commits — single point of failure

How is this calculated?**Maintenance (25%)** — Last commit recency, latest release date, and issue-to-star ratio. Uses a 2-year decay window.

**Popularity (30%)** — Total and monthly downloads, GitHub stars, and forks. Logarithmic scaling prevents top-heavy scores.

**Community (15%)** — Contributors, dependents, forks, watchers, and maintainers. Measures real ecosystem engagement.

**Maturity (30%)** — Project age, version count, PHP version support, and release stability.

###  Release Activity

Cadence

Every ~20 days

Total

3

Last Release

55d ago

### Community

Maintainers

![](https://www.gravatar.com/avatar/2dfbe5a9ae770e600c736bec7beef2ad46eca3cc19b3cb0a8de9816b615b8c96?d=identicon)[jcfrane](/maintainers/jcfrane)

---

Top Contributors

[![jcfrane](https://avatars.githubusercontent.com/u/7079154?v=4)](https://github.com/jcfrane "jcfrane (1 commits)")

---

Tags

laravelpdfOCRtext extractioncanvaxobject

###  Code Quality

TestsPest

### Embed Badge

![Health badge](/badges/jcfrane-pdf-text-extractor/health.svg)

```
[![Health](https://phpackages.com/badges/jcfrane-pdf-text-extractor/health.svg)](https://phpackages.com/packages/jcfrane-pdf-text-extractor)
```

###  Alternatives

[barryvdh/laravel-dompdf

A DOMPDF Wrapper for Laravel

7.3k87.6M278](/packages/barryvdh-laravel-dompdf)[barryvdh/laravel-snappy

Snappy PDF/Image for Laravel

2.8k24.8M48](/packages/barryvdh-laravel-snappy)[elibyy/tcpdf-laravel

tcpdf support for Laravel 6, 7, 8, 9, 10, 11

3542.7M5](/packages/elibyy-tcpdf-laravel)[lucasromanojf/laravel5-pdf

Provides the HTML2PDF functionality using the wkhtmltopdf library (Laravel 5)

1271.8k](/packages/lucasromanojf-laravel5-pdf)

PHPackages © 2026

[Directory](/)[Categories](/categories)[Trending](/trending)[Changelog](/changelog)[Analyze](/analyze)
