PHPackages                             llm-html-extractor/symfony-bundle - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [Parsing &amp; Serialization](/categories/parsing)
4. /
5. llm-html-extractor/symfony-bundle

ActiveSymfony-bundle[Parsing &amp; Serialization](/categories/parsing)

llm-html-extractor/symfony-bundle
=================================

Symfony bundle for extracting structured data from HTML using LLM providers

0.1(7mo ago)02MITPHPPHP &gt;=8.2

Since Oct 19Pushed 7mo agoCompare

[ Source](https://github.com/michalboryczko/llm-html-extractor-symfony-bundle)[ Packagist](https://packagist.org/packages/llm-html-extractor/symfony-bundle)[ RSS](/packages/llm-html-extractor-symfony-bundle/feed)WikiDiscussions main Synced 1mo ago

READMEChangelog (1)Dependencies (16)Versions (2)Used By (0)

LLM HTML Extractor Symfony Bundle
=================================

[](#llm-html-extractor-symfony-bundle)

A powerful Symfony bundle for extracting structured data from HTML using LLM (Large Language Model) providers with a plugin architecture.

Features
--------

[](#features)

- **LLM-Based Extraction**: Uses LLM providers (starting with Jina Reader) to extract structured data from HTML
- **Type-Safe DTOs**: Define extraction schemas using PHP attributes on your DTOs
- **Hybrid Extraction**: Easily combine LLM extraction with code-based extraction - use AI for complex fields and DomCrawler/XPath for simple structured data
- **Extensible**: Plugin architecture allows custom extractors for specific use cases
- **Cacheable**: Built-in caching support for LLM responses
- **Logging**: Optional logging for LLM requests/responses and cache operations
- **Configurable**: Flexible configuration for different LLM providers and caching strategies

Installation
------------

[](#installation)

```
composer require llm-html-extractor/symfony-bundle

```

Configuration
-------------

[](#configuration)

Create or update `config/packages/llm_html_extractor.yaml`:

```
llm_html_extractor:
    llm_client:
        client: jina_reader  # or any service ID for custom client
        jina_reader:
            model: 'jinaai/readerlm-v2'  # or 'jinaai/readerlm-v1.5'
            default_temperature: 0.0  # Default temperature for LLM requests (0.0 = deterministic)
            default_max_tokens: 64000  # Default max tokens for LLM responses
            http_client:
                base_uri: 'https://r.jina.ai'
                api_key: '%env(JINA_API_KEY)%'
                timeout: 600
                headers:
                    X-Custom-Header: 'value'
    cache:
        enabled: true
        ttl: 36000  # 10 hours
        pool: 'cache.app'
    logs:
        enabled: true  # default: false
        logger: 'logger'  # service ID of the logger to use

```

Alternatively, you can use an existing HTTP client service:

```
llm_html_extractor:
    llm_client:
        client: jina_reader
        jina_reader:
            model: 'jinaai/readerlm-v2'
            http_client: 'my_custom_http_client_service'  # Service ID implementing HttpClientInterface

```

### Using a Custom LLM Client

[](#using-a-custom-llm-client)

To use your own LLM client implementation, just set the `client` parameter to your service ID:

```
llm_html_extractor:
    llm_client:
        client: 'app.my_custom_llm_client'  # Your service ID
    cache:
        enabled: true  # Will automatically wrap your client with caching

```

Your custom client must implement `LlmHtmlExtractor\SymfonyBundle\Client\LlmClientInterface`. The bundle will validate this during container compilation and throw a clear error if the interface is not implemented.

### Logging

[](#logging)

The bundle provides comprehensive logging for debugging and monitoring:

- **Request/Response Logging**: When `logs.enabled: true`, all LLM requests and responses are logged at info level
- **Cache Operations**: Cache hits and misses are logged when both caching and logging are enabled
- **Error Logging**: Failed LLM requests are logged at error level with exception details

The decorators are applied in this order:

1. Base LLM Client (e.g., JinaReaderLlmClient)
2. LoggingLlmClient (if logs enabled) - logs requests/responses
3. CacheableLlmClient (if cache enabled) - logs cache hits/misses

This means logged requests show the actual LLM calls (cache misses), not cached responses.

Usage
-----

[](#usage)

### 1. Define Your Extraction DTO

[](#1-define-your-extraction-dto)

```
use LlmHtmlExtractor\SymfonyBundle\Attribute\AsLlmExtractableProperty;

class ArticleExtractionResult
{
    public function __construct(
        #[AsLlmExtractableProperty('Extract the article title')]
        public string $title,

        #[AsLlmExtractableProperty('Extract the author name')]
        public string $author,

        #[AsLlmExtractableProperty('Extract publication date in YYYY-MM-DD format')]
        public string $publishedAt,

        #[AsLlmExtractableProperty('Extract the main article content')]
        public string $content,
    ) {}
}

```

### 2. Use the Extraction Handler

[](#2-use-the-extraction-handler)

```
use LlmHtmlExtractor\SymfonyBundle\Extractor\ExtractionHandler;

class ArticleScraper
{
    public function __construct(
        private ExtractionHandler $extractionHandler,
    ) {}

    public function scrape(string $html): ArticleExtractionResult
    {
        return $this->extractionHandler->handle(
            ArticleExtractionResult::class,
            $html
        );
    }
}

```

### 3. Create Custom Extractors (Optional)

[](#3-create-custom-extractors-optional)

For specific extraction needs, implement the `FromHtmlExtractorInterface`:

```
use LlmHtmlExtractor\SymfonyBundle\Extractor\FromHtmlExtractorInterface;
use Symfony\Component\DomCrawler\Crawler;
use Symfony\Component\DependencyInjection\Attribute\AutoconfigureTag;

#[AutoconfigureTag('llm_extractor.extractor', ['priority' => 50])]
class CustomPdfUrlExtractor implements FromHtmlExtractorInterface
{
    public function extract(string $html, array $context = []): mixed
    {
        $crawler = new Crawler($html);
        return $crawler->filterXPath('//a[contains(@href, ".pdf")]')
            ->each(fn($node) => $node->attr('href'));
    }

    public function supports(string $className, string $propertyName): bool
    {
        return $className === ArticleExtractionResult::class
            && $propertyName === 'pdfUrls';
    }
}

```

Supported LLM Providers
-----------------------

[](#supported-llm-providers)

Currently supported:

- **Jina Reader** (jinaai/readerlm-v2, jinaai/readerlm-v1.5)
    - Uses vLLM OpenAI API standard endpoint (`/openai/v1/chat/completions`)
    - Tested with Runpod serverless deployments
    - Compatible with any vLLM deployment following the OpenAI API standard

License
-------

[](#license)

MIT

Contributing
------------

[](#contributing)

Contributions are welcome! Please feel free to submit a Pull Request.

###  Health Score

28

—

LowBetter than 54% of packages

Maintenance65

Regular maintenance activity

Popularity2

Limited adoption so far

Community2

Small or concentrated contributor base

Maturity37

Early-stage or recently created project

How is this calculated?**Maintenance (25%)** — Last commit recency, latest release date, and issue-to-star ratio. Uses a 2-year decay window.

**Popularity (30%)** — Total and monthly downloads, GitHub stars, and forks. Logarithmic scaling prevents top-heavy scores.

**Community (15%)** — Contributors, dependents, forks, watchers, and maintainers. Measures real ecosystem engagement.

**Maturity (30%)** — Project age, version count, PHP version support, and release stability.

###  Release Activity

Cadence

Unknown

Total

1

Last Release

212d ago

### Community

Maintainers

![](https://www.gravatar.com/avatar/da42b87155e058dd049e7239af5aaddfa6e339f421a2f39a6d22e46236b36a9d?d=identicon)[michalboryczko](/maintainers/michalboryczko)

---

Tags

symfonybundleparseraihtmlextractionextractorscrapingllmscrapperjinajina-reader

###  Code Quality

TestsPHPUnit

Static AnalysisPHPStan

Code StylePHP CS Fixer

Type Coverage Yes

### Embed Badge

![Health badge](/badges/llm-html-extractor-symfony-bundle/health.svg)

```
[![Health](https://phpackages.com/badges/llm-html-extractor-symfony-bundle/health.svg)](https://phpackages.com/packages/llm-html-extractor-symfony-bundle)
```

###  Alternatives

[prestashop/prestashop

PrestaShop is an Open Source e-commerce platform, committed to providing the best shopping cart experience for both merchants and customers.

9.0k15.4k](/packages/prestashop-prestashop)[shopware/platform

The Shopware e-commerce core

3.3k1.5M3](/packages/shopware-platform)[sylius/sylius

E-Commerce platform for PHP, based on Symfony framework.

8.4k5.6M651](/packages/sylius-sylius)[sulu/sulu

Core framework that implements the functionality of the Sulu content management system

1.3k1.3M152](/packages/sulu-sulu)[shopware/core

Shopware platform is the core for all Shopware ecommerce products.

595.2M386](/packages/shopware-core)[web-auth/webauthn-framework

FIDO2/Webauthn library for PHP and Symfony Bundle.

50570.7k1](/packages/web-auth-webauthn-framework)

PHPackages © 2026

[Directory](/)[Categories](/categories)[Trending](/trending)[Changelog](/changelog)[Analyze](/analyze)
