PHPackages                             labrodev/document-sampler - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [Utility &amp; Helpers](/categories/utility)
4. /
5. labrodev/document-sampler

ActiveLibrary[Utility &amp; Helpers](/categories/utility)

labrodev/document-sampler
=========================

Extracts a structured representative sample from long documents for downstream AI processing.

1.0.0(1mo ago)00MITPHPPHP ^8.5

Since Apr 24Pushed 1mo agoCompare

[ Source](https://github.com/labrodev/document-sampler)[ Packagist](https://packagist.org/packages/labrodev/document-sampler)[ RSS](/packages/labrodev-document-sampler/feed)WikiDiscussions main Synced 1w ago

READMEChangelogDependencies (2)Versions (2)Used By (0)

DocumentSampler
===============

[](#documentsampler)

Pure PHP library that extracts a structured, representative sample from a document of any length. No framework dependency, no HTTP calls, no AI — just text processing.

Designed as the input layer for downstream AI-powered packages such as relevance checkers, prompt injection detectors, and depersonalisation services.

---

Requirements
------------

[](#requirements)

- PHP `^8.5`

---

Installation
------------

[](#installation)

```
composer require labrodev/document-sampler
```

---

Basic usage
-----------

[](#basic-usage)

```
use Labrodev\DocumentSampler\DocumentSampler;

$result = (new DocumentSampler())->sample($rawText);

$result->intro             // opening chars — title and introduction
$result->outline           // extracted section headings from anywhere in the document
$result->middle            // fixed window centred on the document midpoint
$result->tail              // closing chars — conclusion and sign-off
$result->text              // all samples joined with separators
$result->charCount         // character count of the combined sample
$result->originalCharCount // character count of the original document
```

---

Custom window sizes
-------------------

[](#custom-window-sizes)

By default each zone uses the window defined on the `DocumentPart` enum. Pass any subset to the constructor to override:

```
// Override specific zones — unset zones use the enum defaults
$sampler = new DocumentSampler(
    intro:   2000,
    middle:  300,
);

$result = $sampler->sample($rawText);
```

---

How it works
------------

[](#how-it-works)

The sampler partitions every document into four fixed-size windows regardless of document length:

ZoneDefault windowWhat it captures`intro`1000 charsTitle, abstract, opening paragraphs`outline`500 charsSection headings (`# Markdown`, `1.1 Numbered`, `ALL-CAPS` lines)`middle`500 charsWindow centred on the document midpoint`tail`500 charsClosing paragraphs, conclusion, signatureWindows are fixed — a 400-page PDF gets the same sized sample as a one-page memo. The goal is a compact, representative fingerprint of the document, not a summary.

---

Exporting results
-----------------

[](#exporting-results)

### JSON

[](#json)

```
$result->toJson();
```

```
{
    "meta": {
        "originalCharCount": 50000,
        "sampledCharCount": 2300
    },
    "samples": {
        "intro": "...",
        "outline": "...",
        "middle": "...",
        "tail": "..."
    }
}
```

### Markdown

[](#markdown)

```
$result->toMd();
```

```
## Document Sample

**Original size:** 50,000 chars
**Sampled size:** 2,300 chars

### Intro
...

### Outline
...

### Middle
...

### Tail
...
```

Empty zones are omitted from both outputs.

---

Default window sizes
--------------------

[](#default-window-sizes)

Window sizes are defined on the `DocumentPart` enum and can be read at runtime:

```
use Labrodev\DocumentSampler\Enums\DocumentPart;

DocumentPart::Intro->chars();   // 1000
DocumentPart::Outline->chars(); // 500
DocumentPart::Middle->chars();  // 500
DocumentPart::Tail->chars();    // 500
```

---

When to use this
----------------

[](#when-to-use-this)

- **Before calling an AI API** — reduce a large document to a structured excerpt that fits in a context window without losing structural information.
- **Relevance checking** — feed `$result->text` to a classifier to decide whether a document is relevant before processing it in full.
- **Prompt injection detection** — scan a compact sample for malicious instructions before passing untrusted documents to an LLM.
- **Depersonalisation** — run PII detection over a representative sample before deciding whether to redact the full document.
- **Document classification** — use the outline and intro zones to classify document type without reading the entire file.

---

Testing
-------

[](#testing)

```
composer test
```

Static analysis
---------------

[](#static-analysis)

```
composer analyse
```

---

Author
------

[](#author)

**Petro Lashyn** —

---

License
-------

[](#license)

MIT

###  Health Score

39

—

LowBetter than 84% of packages

Maintenance90

Actively maintained with recent releases

Popularity0

Limited adoption so far

Community6

Small or concentrated contributor base

Maturity51

Maturing project, gaining track record

 Bus Factor1

Top contributor holds 100% of commits — single point of failure

How is this calculated?**Maintenance (25%)** — Last commit recency, latest release date, and issue-to-star ratio. Uses a 2-year decay window.

**Popularity (30%)** — Total and monthly downloads, GitHub stars, and forks. Logarithmic scaling prevents top-heavy scores.

**Community (15%)** — Contributors, dependents, forks, watchers, and maintainers. Measures real ecosystem engagement.

**Maturity (30%)** — Project age, version count, PHP version support, and release stability.

###  Release Activity

Cadence

Unknown

Total

1

Last Release

46d ago

### Community

Maintainers

![](https://avatars.githubusercontent.com/u/151143718?v=4)[Lashyn Petro ](/maintainers/labrodev)[@labrodev](https://github.com/labrodev)

---

Top Contributors

[![labrodev](https://avatars.githubusercontent.com/u/151143718?v=4)](https://github.com/labrodev "labrodev (3 commits)")

---

Tags

aiknowledgeknowledge-basemdphpphp-libraryphp-packagephp-utilityrag-pipeline

###  Code Quality

TestsPest

Static AnalysisPHPStan

Type Coverage Yes

### Embed Badge

![Health badge](/badges/labrodev-document-sampler/health.svg)

```
[![Health](https://phpackages.com/badges/labrodev-document-sampler/health.svg)](https://phpackages.com/packages/labrodev-document-sampler)
```

###  Alternatives

[basecom/sw6-fixtures-plugin

basecom Fixtures Plugin

19195.6k](/packages/basecom-sw6-fixtures-plugin)

PHPackages © 2026

[Directory](/)[Categories](/categories)[Trending](/trending)[Changelog](/changelog)[Analyze](/analyze)
