PHPackages                             ecourty/text-chunker - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [Utility &amp; Helpers](/categories/utility)
4. /
5. ecourty/text-chunker

ActiveLibrary[Utility &amp; Helpers](/categories/utility)

ecourty/text-chunker
====================

A framework-agnostic PHP library for chunking text and files using pluggable strategies and post-processors.

1.1.0(2mo ago)656MITPHPPHP &gt;=8.3CI passing

Since Feb 26Pushed 2mo agoCompare

[ Source](https://github.com/EdouardCourty/PHPTextChunker)[ Packagist](https://packagist.org/packages/ecourty/text-chunker)[ RSS](/packages/ecourty-text-chunker/feed)WikiDiscussions main Synced 1mo ago

READMEChangelog (1)Dependencies (4)Versions (5)Used By (0)

php-text-chunker
================

[](#php-text-chunker)

[![PHP CI](https://github.com/EdouardCourty/PHPTextChunker/actions/workflows/ci.yml/badge.svg)](https://github.com/EdouardCourty/PHPTextChunker/actions/workflows/ci.yml)

A framework-agnostic PHP library for splitting text and files into meaningful chunks, using pluggable strategies and a composable post-processing pipeline.

Table of Contents
-----------------

[](#table-of-contents)

- [Installation](#installation)
- [Core Features](#core-features)
- [Quick Start](#quick-start)
- [Chunking Strategies](#chunking-strategies)
- [Post-Processors](#post-processors)
- [Configuration Reference](#configuration-reference)
- [Custom Readers](#custom-readers)
- [Performance](#performance)
- [Datasets](#datasets)
- [Development](#development)

---

Installation
------------

[](#installation)

```
composer require ecourty/text-chunker
```

**Requirements**: PHP &gt;= 8.3

---

Core Features
-------------

[](#core-features)

- **9 built-in strategies**: paragraph, sentence, fixed-size, dialogue, markdown, word count, regex, line, recursive
- **8 built-in post-processors**: overlap, token limit, metadata enrichment, filtering, chunk merger, text normalization, deduplication, regex replace
- **Streaming architecture**: processes large files in 8KB buffers — minimal memory usage
- **Works with files and strings**: `setFile()` or `setText()`
- **Fully extensible**: implement your own strategies and post-processors
- **Zero framework dependencies**

---

Quick Start
-----------

[](#quick-start)

```
use Ecourty\TextChunker\TextChunker;
use Ecourty\TextChunker\Strategy\ParagraphChunkingStrategy;

$chunker = new TextChunker();

foreach ($chunker->setFile('document.txt')->chunk(new ParagraphChunkingStrategy()) as $chunk) {
    echo $chunk->getText();       // chunk content
    echo $chunk->getPosition();   // index in the sequence
    print_r($chunk->getMetadata()); // strategy, length, etc.
}
```

Chunk from a string:

```
$chunker = new TextChunker();

foreach ($chunker->setText($myText)->chunk(new SentenceChunkingStrategy()) as $chunk) {
    // ...
}
```

---

Chunking Strategies
-------------------

[](#chunking-strategies)

StrategySplits onKey options`ParagraphChunkingStrategy`Double newlines (`\n\n`)—`SentenceChunkingStrategy`Sentence-ending punctuation (`. ! ?`)—`FixedSizeChunkingStrategy`Fixed character count`chunkSize` (default: 1000)`DialogueChunkingStrategy`Dialogue lines, context-aware grouping`targetChunkSize`, `minChunkSize``MarkdownChunkingStrategy`Markdown headers (`#` to `######`)`minHeadingLevel`, `maxHeadingLevel``WordCountChunkingStrategy`Fixed word count, respects word boundaries`wordCount` (default: 200)`RegexChunkingStrategy`Configurable regex pattern`pattern`, `delimiterPosition` (`None` | `Prefix` | `Suffix`)`LineChunkingStrategy`N consecutive lines per chunk`linesPerChunk` (default: 10)`RecursiveChunkingStrategy`Cascade of strategies with a size limit`strategies[]`, `maxChunkSize``RecursiveChunkingStrategy` applies `strategies[0]` to the stream, and immediately re-splits any chunk exceeding `maxChunkSize` using `strategies[1]`, then `strategies[2]`, etc. Streaming-safe — never buffers more than one chunk at a time.

---

Post-Processors
---------------

[](#post-processors)

Post-processors are applied in sequence after chunking. Chain them with `withPostProcessor()`.

Post-processorDescriptionKey options`OverlappingChunkPostProcessor`Prepends the tail of the previous chunk for context continuity`overlapSize` (default: 200)`TokenLimitPostProcessor`Splits chunks exceeding a token budget`maxTokens`, `charactersPerToken``MetadataEnricherPostProcessor`Adds `chunk_index`, `total_chunks`, `word_count`, `char_count`, `source`—`ChunkFilterPostProcessor`Removes empty or too-short chunks`minLength`, `removeEmpty``ChunkMergerPostProcessor`Merges consecutive small chunks until `minChunkSize` is reached`minChunkSize` (default: 200), `separator``TextNormalizationPostProcessor`Collapses whitespace, trims lines, strips control characters`collapseWhitespace`, `trimLines`, `stripControlChars``DeduplicationPostProcessor`Removes duplicate chunks by md5 content hash; adds `content_hash` metadata—`RegexReplacePostProcessor`Applies ordered `[pattern => replacement]` substitutions to each chunk's text`replacements[]`---

Configuration Reference
-----------------------

[](#configuration-reference)

### TextChunker

[](#textchunker)

MethodDescription`setFile(string $path)`Set source file (streamed)`setText(string $text)`Set source string`withMetadata(array $meta)`Attach global metadata to every chunk`withPostProcessor(...)`Add a post-processor to the pipeline`withPostProcessors(...)`Add multiple post-processors at once (variadic)`withReader(ReaderInterface)`Inject a custom reader (see below)`chunk(ChunkingStrategyInterface)`Returns a `Generator`### Chunk

[](#chunk)

MethodReturns`getText()``string` — the chunk content`getPosition()``int` — index in the sequence`getMetadata()``array` — associated metadata`getLength()``int` — character count`withMetadata(array)`New `Chunk` with merged metadata---

Custom Readers
--------------

[](#custom-readers)

By default, `setFile()` reads from the local filesystem via `LocalFileReader`. To read from a remote source (S3, Azure Blob, SFTP, etc.), implement `ReaderInterface` and inject it via `withReader()`.

`ReaderInterface` has a single method: `readChunks(string $path, int $bufferSize): \Generator`. Yield string chunks of arbitrary size — the chunking strategies handle the rest. The `$path` passed to `readChunks()` is whatever string you gave to `setFile()`, so it can be an S3 key, a URI, or any identifier your reader understands.

**Example with [Flysystem](https://flysystem.thephpleague.com/) (works with S3, Azure, SFTP, GCS, and more):**

```
use League\Flysystem\Filesystem;
use Ecourty\TextChunker\Contract\ReaderInterface;
use Ecourty\TextChunker\TextChunker;
use Ecourty\TextChunker\Strategy\ParagraphChunkingStrategy;

class FlysystemReader implements ReaderInterface
{
    public function __construct(private Filesystem $filesystem) {}

    public function readChunks(string $path, int $bufferSize): \Generator
    {
        $stream = $this->filesystem->readStream($path);

        try {
            while (!feof($stream)) {
                $data = fread($stream, $bufferSize);
                if ($data === false) {
                    break;
                }
                yield $data;
            }
        } finally {
            fclose($stream);
        }
    }
}

// S3 example
$adapter = new \League\Flysystem\AwsS3V3\AwsS3V3Adapter($s3Client, 'my-bucket');
$filesystem = new Filesystem($adapter);

foreach (
    (new TextChunker())
        ->withReader(new FlysystemReader($filesystem))
        ->setFile('documents/report.txt')  // S3 key
        ->chunk(new ParagraphChunkingStrategy())
    as $chunk
) {
    echo $chunk->getText();
}
```

---

Performance
-----------

[](#performance)

Benchmarked with PHPBench on real-world datasets (Bible KJV, Les Misérables, Encyclopaedia Britannica 11th Ed.). See [BENCHMARKS.md](BENCHMARKS.md) for the full results.

**Strategy throughput** (Bible KJV, 4.26 MB):

StrategyTimeThroughput`SentenceChunkingStrategy`43 ms~98 MB/s`FixedSizeChunkingStrategy`44 ms~98 MB/s`LineChunkingStrategy`46 ms~93 MB/s`ParagraphChunkingStrategy`293 ms~15 MB/s`WordCountChunkingStrategy`377 ms~11 MB/s**Post-processor overhead** (50 KB excerpt): all 8 processors run in **&lt; 3 ms**. Chain freely.

The library is **streaming-first** — most strategies hold only ~2 MB in memory regardless of input file size.

---

Datasets
--------

[](#datasets)

The `datasets/` directory contains large text corpora used for benchmarking chunking strategies. All texts are public domain sourced from [Project Gutenberg](https://www.gutenberg.org/).

FileSourceSizeNotes`bible_kjv.txt`[King James Bible (PG #10)](https://www.gutenberg.org/ebooks/10)~4.5 MBGreat for sentence and paragraph benchmarks`les_miserables.txt`[Les Misérables by Victor Hugo (PG #17489–17496)](https://www.gutenberg.org/ebooks/17489)~2.6 MBAll 5 tomes in French, ideal for paragraph chunking`britannica/`[Encyclopaedia Britannica, 11th Edition](https://www.gutenberg.org/ebooks/search/?query=encyclopaedia+britannica+11th)~118 MB92 volumes of dense encyclopaedic text> Headers and Project Gutenberg license preambles can be stripped before benchmarking to work with clean content only.

---

Development
-----------

[](#development)

```
# Install dependencies
composer install

# Run tests
composer test

# Run PHPStan (level max)
composer phpstan

# Run CS fixer
composer cs-fix

# Run all checks
composer qa
```

### Extending the library

[](#extending-the-library)

Implement `ChunkingStrategyInterface` to create a custom strategy, or `ChunkPostProcessorInterface` for a custom post-processor. See `AGENTS.md` for detailed guidelines.

###  Health Score

43

—

FairBetter than 91% of packages

Maintenance86

Actively maintained with recent releases

Popularity17

Limited adoption so far

Community6

Small or concentrated contributor base

Maturity52

Maturing project, gaining track record

 Bus Factor1

Top contributor holds 100% of commits — single point of failure

How is this calculated?**Maintenance (25%)** — Last commit recency, latest release date, and issue-to-star ratio. Uses a 2-year decay window.

**Popularity (30%)** — Total and monthly downloads, GitHub stars, and forks. Logarithmic scaling prevents top-heavy scores.

**Community (15%)** — Contributors, dependents, forks, watchers, and maintainers. Measures real ecosystem engagement.

**Maturity (30%)** — Project age, version count, PHP version support, and release stability.

###  Release Activity

Cadence

Every ~5 days

Total

2

Last Release

68d ago

### Community

Maintainers

![](https://www.gravatar.com/avatar/3150ffb131124e5f03272d9ed8084c514f18fff6aafff1a5973c016993f6ef66?d=identicon)[ecourty](/maintainers/ecourty)

---

Top Contributors

[![EdouardCourty](https://avatars.githubusercontent.com/u/37371516?v=4)](https://github.com/EdouardCourty "EdouardCourty (10 commits)")

---

Tags

textnlpragchunkingsplitting

###  Code Quality

TestsPHPUnit

Static AnalysisPHPStan

Code StylePHP CS Fixer

Type Coverage Yes

### Embed Badge

![Health badge](/badges/ecourty-text-chunker/health.svg)

```
[![Health](https://phpackages.com/badges/ecourty-text-chunker/health.svg)](https://phpackages.com/packages/ecourty-text-chunker)
```

###  Alternatives

[froala/wysiwyg-editor

A beautiful jQuery WYSIWYG HTML rich text editor. High performance and modern design make it easy to use for developers and loved by users.

5.4k306.9k3](/packages/froala-wysiwyg-editor)[rubix/ml

A high-level machine learning and deep learning library for the PHP language.

2.2k1.4M28](/packages/rubix-ml)[ckeditor/ckeditor

JavaScript WYSIWYG web text editor.

5234.2M76](/packages/ckeditor-ckeditor)[nlp-tools/nlp-tools

NlpTools is a set of php 5.3+ classes for beginner to semi advanced natural language processing work.

774645.2k5](/packages/nlp-tools-nlp-tools)[tinymce/tinymce

Web based JavaScript HTML WYSIWYG editor control.

1697.5M106](/packages/tinymce-tinymce)[codewithkyrian/transformers

State-of-the-art Machine Learning for PHP. Run Transformers in PHP

749231.8k5](/packages/codewithkyrian-transformers)

PHPackages © 2026

[Directory](/)[Categories](/categories)[Trending](/trending)[Changelog](/changelog)[Analyze](/analyze)
