PHPackages                             benbjurstrom/markdown-object - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [Parsing &amp; Serialization](/categories/parsing)
4. /
5. benbjurstrom/markdown-object

ActiveLibrary[Parsing &amp; Serialization](/categories/parsing)

benbjurstrom/markdown-object
============================

Structure-aware, token-smart chunking for Markdown documents

v0.6.0(5mo ago)151.5k↑70%MITPHPPHP ^8.3CI passing

Since Nov 2Pushed 5mo agoCompare

[ Source](https://github.com/benbjurstrom/markdown-object)[ Packagist](https://packagist.org/packages/benbjurstrom/markdown-object)[ Docs](https://github.com/benbjurstrom/markdown-object)[ GitHub Sponsors]()[ RSS](/packages/benbjurstrom-markdown-object/feed)WikiDiscussions main Synced 1mo ago

READMEChangelog (6)Dependencies (5)Versions (14)Used By (0)

Markdown Object
===============

[](#markdown-object)

[![Latest Version on Packagist](https://camo.githubusercontent.com/1a0b1661a1b35dcddd069b0ff7aa5baf370475b69cb2916e166dc648312fd7db/68747470733a2f2f696d672e736869656c64732e696f2f7061636b61676973742f762f62656e626a75727374726f6d2f6d61726b646f776e2d6f626a6563742e7376673f7374796c653d666c61742d737175617265)](https://packagist.org/packages/benbjurstrom/markdown-object)[![GitHub Tests Action Status](https://camo.githubusercontent.com/a33d706698e9ba1b78272d59ba8756b66249bb96922e977d2733f297cbe1bd3f/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f616374696f6e732f776f726b666c6f772f7374617475732f62656e626a75727374726f6d2f6d61726b646f776e2d6f626a6563742f72756e2d74657374732e796d6c3f6272616e63683d6d61696e266c6162656c3d7465737473267374796c653d666c61742d737175617265)](https://github.com/benbjurstrom/markdown-object/actions/workflows/run-tests.yml)[![GitHub Code Style Action Status](https://camo.githubusercontent.com/fbe35e0135a96129ae5ac66f06e615d4803781e224cfede00315aee91c15fe17/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f616374696f6e732f776f726b666c6f772f7374617475732f62656e626a75727374726f6d2f6d61726b646f776e2d6f626a6563742f6669782d7068702d636f64652d7374796c652d6973737565732e796d6c3f6272616e63683d6d61696e266c6162656c3d636f64652532307374796c65267374796c653d666c61742d737175617265)](https://github.com/benbjurstrom/markdown-object/actions/workflows/fix-php-code-style-issues.yml)[![GitHub PHPStan Action Status](https://camo.githubusercontent.com/3947d8ebca0e1b2b473d5543d916cff4fd4b417f7c66812a8921477d2ebf1ffa/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f616374696f6e732f776f726b666c6f772f7374617475732f62656e626a75727374726f6d2f6d61726b646f776e2d6f626a6563742f7068707374616e2e796d6c3f6272616e63683d6d61696e266c6162656c3d7068707374616e267374796c653d666c61742d737175617265)](https://github.com/benbjurstrom/markdown-object/actions/workflows/phpstan.yml)[![Total Downloads](https://camo.githubusercontent.com/9d4528b0023cefd26844b6f71f094582ed49ca85cfe8107feac327d0bcfd0e8d/68747470733a2f2f696d672e736869656c64732e696f2f7061636b61676973742f64742f62656e626a75727374726f6d2f6d61726b646f776e2d6f626a6563742e7376673f7374796c653d666c61742d737175617265)](https://packagist.org/packages/benbjurstrom/markdown-object)

Intelligent Markdown chunking that preserves document structure and semantic relationships. Creates token-aware chunks optimized for embedding model context windows. Built on [League CommonMark](https://github.com/thephpleague/commonmark) and [Yethee\\Tiktoken](https://github.com/yethee/tiktoken-php).

Try It Out
----------

[](#try-it-out)

Clone the **[Interactive Demo](https://github.com/benbjurstrom/markdown-object-demo)** to experiment with chunking in real-time. Paste your Markdown, adjust parameters, and see how content gets split into semantic chunks.

[![markdown-object-demo](https://private-user-images.githubusercontent.com/12499093/510278738-2f69026a-24d3-4b44-a656-40b3a62af2be.webp?jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NzQ3NTY0NjIsIm5iZiI6MTc3NDc1NjE2MiwicGF0aCI6Ii8xMjQ5OTA5My81MTAyNzg3MzgtMmY2OTAyNmEtMjRkMy00YjQ0LWE2NTYtNDBiM2E2MmFmMmJlLndlYnA_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjYwMzI5JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI2MDMyOVQwMzQ5MjJaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1lNTQxYTMyODhkYTNhODNiYjg2M2MzN2FhMDhiZmQ2MzM2MDljYTVjYjFhMTQxMmEzNmZkYTQ3MTBhYzYwNWZmJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.WV6-CZ_NTb_uQBay0cdqRmH6sfHMMx5l81WisJ84JDw)](https://private-user-images.githubusercontent.com/12499093/510278738-2f69026a-24d3-4b44-a656-40b3a62af2be.webp?jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NzQ3NTY0NjIsIm5iZiI6MTc3NDc1NjE2MiwicGF0aCI6Ii8xMjQ5OTA5My81MTAyNzg3MzgtMmY2OTAyNmEtMjRkMy00YjQ0LWE2NTYtNDBiM2E2MmFmMmJlLndlYnA_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjYwMzI5JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI2MDMyOVQwMzQ5MjJaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1lNTQxYTMyODhkYTNhODNiYjg2M2MzN2FhMDhiZmQ2MzM2MDljYTVjYjFhMTQxMmEzNmZkYTQ3MTBhYzYwNWZmJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.WV6-CZ_NTb_uQBay0cdqRmH6sfHMMx5l81WisJ84JDw)Basic Usage
-----------

[](#basic-usage)

```
use League\CommonMark\Environment\Environment;
use League\CommonMark\Parser\MarkdownParser;
use League\CommonMark\Extension\CommonMark\CommonMarkCoreExtension;
use League\CommonMark\Extension\Table\TableExtension;
use BenBjurstrom\MarkdownObject\Build\MarkdownObjectBuilder;
use BenBjurstrom\MarkdownObject\Tokenizer\TikTokenizer;

// 1) Parse Markdown with CommonMark
$env = new Environment();
$env->addExtension(new CommonMarkCoreExtension());
$env->addExtension(new TableExtension());

$parser   = new MarkdownParser($env);
$filename = 'guide.md';
$markdown = file_get_contents($filename);
$doc      = $parser->parse($markdown);

// 2) Build the structured model
$builder   = new MarkdownObjectBuilder();
$tokenizer = TikTokenizer::forModel('gpt-3.5-turbo');
$mdObj     = $builder->build($doc, $filename, $markdown, $tokenizer);

// 3) Emit hierarchically-packed chunks
$chunks = $mdObj->toMarkdownChunks(target: 512, hardCap: 1024);

foreach ($chunks as $chunk) {
    echo "---\n";
    echo "Chunk: {$chunk->id} | {$chunk->tokenCount} tokens";

    // Source position tracking for finding chunks in original document
    $pos = $chunk->sourcePosition;
    if ($pos->lines !== null) {
        echo " | Line: {$pos->lines->startLine}";
    }
    echo "\n";
    echo implode(' › ', $chunk->breadcrumb) . "\n";
    echo "---\n\n";
    echo $chunk->markdown . "\n\n";
}

/*
---
Chunk: 1 | 163 tokens | Line: 1
demo.md › Getting Started
---

# Getting Started

Welcome to the Markdown Object demo! This tool helps you visualize how markdown is parsed and chunked.

## Features

### Real-time Processing

Type or paste markdown in the left pane and see the results instantly.

### Hierarchical Chunking

Content is automatically organized into semantic chunks that keep related information together…

---
Chunk: 2 | 287 tokens | Line: 18
demo.md › Getting Started › Advanced Options
---

## Advanced Options

Configure chunking parameters to see how different settings affect the output.

### Token Limits

Adjust the target and hard cap values to control chunk sizes…
*/
```

Installation
------------

[](#installation)

You can install the package via composer:

```
composer require benbjurstrom/markdown-object
```

Advanced Usage
--------------

[](#advanced-usage)

### JSON Serialization

[](#json-serialization)

```
// Serialize to JSON
$json = $mdObj->toJson(JSON_PRETTY_PRINT);

// Deserialize from JSON
$copy = \BenBjurstrom\MarkdownObject\Model\MarkdownObject::fromJson($json);
```

### Custom Tokenizer

[](#custom-tokenizer)

```
use BenBjurstrom\MarkdownObject\Tokenizer\TikTokenizer;

// Use a different model
$tokenizer = TikTokenizer::forModel('gpt-4');

// Or use a specific encoding
$tokenizer = TikTokenizer::forEncoding('p50k_base');

// Pass to both build() and toMarkdownChunks()
$mdObj = $builder->build($doc, $filename, $markdown, $tokenizer);
$chunks = $mdObj->toMarkdownChunks(
    target: 512,
    hardCap: 1024,
    tok: $tokenizer
);
```

### Custom Chunking Parameters

[](#custom-chunking-parameters)

```
$chunks = $mdObj->toMarkdownChunks(
    target: 256,                // Smaller target for content splitting
    hardCap: 512,               // Smaller hard cap for hierarchy
    tok: $customTokenizer,      // Optional: use different tokenizer
    repeatTableHeaders: false   // Optional: don't repeat headers in split tables
);
```

### A note on Token Counts

[](#a-note-on-token-counts)

Chunk token counts include separator tokens (`\n\n`) added when joining content pieces, so they may be slightly higher than the sum of individual node tokens. This is expected and ensures the count accurately reflects what will be embedded.

```
// Build-time: sum of nodes (no separators)
echo $mdObj->tokenCount;  // e.g., 155

// Chunk: includes \n\n separators between elements
echo $chunks[0]->tokenCount;  // e.g., 163 (8 tokens higher)
```

Chunking Strategy
-----------------

[](#chunking-strategy)

The package uses **hierarchical greedy packing** to create semantically coherent chunks that respect your document's natural structure.

### Algorithm Overview

[](#algorithm-overview)

The chunker intelligently splits content using a two-threshold system:

- **`target`** - Soft limit for splitting large content blocks (paragraphs, code, tables)
- **`hardCap`** - Hard limit for hierarchical decisions (when to split vs. keep sections together)

### How It Works

[](#how-it-works)

1. **Start whole** – If the entire document fits within `hardCap`, return as a single chunk
2. **Split hierarchically** – When too large, split at the highest heading level (H1, then H2, etc.)
3. **Pack greedily** – Combine sibling sections that fit together within `hardCap`
4. **Recurse deeply** – Sections that don't fit are processed recursively with updated breadcrumbs
5. **Minimize fragments** – After recursion, continue packing remaining siblings to avoid orphaned content
6. **Split smartly** – Long paragraphs, code blocks, and tables break at `target` boundaries while preserving readability

Testing
-------

[](#testing)

Run the tests with:

```
composer test
```

Documentation
-------------

[](#documentation)

For detailed architecture documentation, see [ARCHITECTURE.md](ARCHITECTURE.md).

For examples of hierarchical packing behavior, see [EXAMPLES.md](EXAMPLES.md).

Changelog
---------

[](#changelog)

Please see [CHANGELOG](CHANGELOG.md) for more information on what has changed recently.

Contributing
------------

[](#contributing)

Contributions are welcome! Please feel free to submit a Pull Request.

Security Vulnerabilities
------------------------

[](#security-vulnerabilities)

Please review [our security policy](../../security/policy) on how to report security vulnerabilities.

Credits
-------

[](#credits)

- [Ben Bjurstrom](https://github.com/benbjurstrom)
- [All Contributors](../../contributors)

License
-------

[](#license)

The MIT License (MIT). Please see [License File](LICENSE.md) for more information.

###  Health Score

41

—

FairBetter than 89% of packages

Maintenance70

Regular maintenance activity

Popularity29

Limited adoption so far

Community6

Small or concentrated contributor base

Maturity47

Maturing project, gaining track record

 Bus Factor1

Top contributor holds 100% of commits — single point of failure

How is this calculated?**Maintenance (25%)** — Last commit recency, latest release date, and issue-to-star ratio. Uses a 2-year decay window.

**Popularity (30%)** — Total and monthly downloads, GitHub stars, and forks. Logarithmic scaling prevents top-heavy scores.

**Community (15%)** — Contributors, dependents, forks, watchers, and maintainers. Measures real ecosystem engagement.

**Maturity (30%)** — Project age, version count, PHP version support, and release stability.

###  Release Activity

Cadence

Every ~4 days

Total

6

Last Release

175d ago

### Community

Maintainers

![](https://www.gravatar.com/avatar/5c6e87e7d26b161a120d9f149fabc989bf2d5bcccee4e9972867b477d336b92b?d=identicon)[benbjurstrom](/maintainers/benbjurstrom)

---

Top Contributors

[![benbjurstrom](https://avatars.githubusercontent.com/u/12499093?v=4)](https://github.com/benbjurstrom "benbjurstrom (8 commits)")

---

Tags

benbjurstrommarkdown-object

###  Code Quality

TestsPest

Static AnalysisPHPStan

Code StyleLaravel Pint

Type Coverage Yes

### Embed Badge

![Health badge](/badges/benbjurstrom-markdown-object/health.svg)

```
[![Health](https://phpackages.com/badges/benbjurstrom-markdown-object/health.svg)](https://phpackages.com/packages/benbjurstrom-markdown-object)
```

###  Alternatives

[spatie/laravel-markdown

A highly configurable markdown renderer and Blade component for Laravel

4053.4M35](/packages/spatie-laravel-markdown)[mnapoli/front-yaml

2895.6M45](/packages/mnapoli-front-yaml)[daux/daux.io

Documentation generator that uses a simple folder structure and Markdown files to create custom documentation on the fly

825191.0k1](/packages/daux-dauxio)[spatie/sheets

Store &amp; retrieve your static content in plain text files

30187.7k4](/packages/spatie-sheets)[rajentrivedi/tokenizer-x

TokenizerX calculates required tokens for given prompt

91214.0k3](/packages/rajentrivedi-tokenizer-x)[prezet/prezet

Prezet: Markdown Blogging for Laravel

2969.8k2](/packages/prezet-prezet)

PHPackages © 2026

[Directory](/)[Categories](/categories)[Trending](/trending)[Changelog](/changelog)[Analyze](/analyze)
