PHPackages                             atoolo/crawler-teaser-indexer - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [API Development](/categories/api)
4. /
5. atoolo/crawler-teaser-indexer

ActiveSymfony-bundle[API Development](/categories/api)

atoolo/crawler-teaser-indexer
=============================

Automated crawler-based generation and indexing to Solr.

1.0.0(2mo ago)016proprietaryPHPPHP &gt;=8.2CI passing

Since Feb 25Pushed 1mo agoCompare

[ Source](https://github.com/sitepark/atoolo-crawler-teaser-indexer)[ Packagist](https://packagist.org/packages/atoolo/crawler-teaser-indexer)[ RSS](/packages/atoolo-crawler-teaser-indexer/feed)WikiDiscussions main Synced 1mo ago

READMEChangelog (1)Dependencies (18)Versions (3)Used By (0)

Atoolo-Modul: Teaser-Crawler
============================

[](#atoolo-modul-teaser-crawler)

[![codecov](https://camo.githubusercontent.com/ec5bb486c2db7da98a474a098b10990fdd72acd5d555d945f790c9885b55c2ba/68747470733a2f2f636f6465636f762e696f2f67682f736974657061726b2f61746f6f6c6f2d637261776c65722d7465617365722d696e64657865722f67726170682f62616467652e7376673f746f6b656e3d716d496f556255733368)](https://codecov.io/gh/sitepark/atoolo-crawler-teaser-indexer)[![phpstan](https://camo.githubusercontent.com/b72adb1f27170ecf486459c4b07e920bb3db2b464444bce8277e018270665646/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f5048505374616e2d6c6576656c253230392d627269676874677265656e)](https://camo.githubusercontent.com/b72adb1f27170ecf486459c4b07e920bb3db2b464444bce8277e018270665646/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f5048505374616e2d6c6576656c253230392d627269676874677265656e)[![php](https://camo.githubusercontent.com/91cb649feb797bb507583f5f4e88d0727695b25c0a7265b8b1c7448848b13ef9/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f5048502d382e322d626c7565)](https://camo.githubusercontent.com/91cb649feb797bb507583f5f4e88d0727695b25c0a7265b8b1c7448848b13ef9/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f5048502d382e322d626c7565)[![php](https://camo.githubusercontent.com/44471375dfd3bcd4fd091f24ddc5326e7d42abfb024cc764490d83ed6ef31b85/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f5048502d382e332d626c7565)](https://camo.githubusercontent.com/44471375dfd3bcd4fd091f24ddc5326e7d42abfb024cc764490d83ed6ef31b85/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f5048502d382e332d626c7565)[![php](https://camo.githubusercontent.com/b37db47746bb49d291c47c3cc8fabd15219dc271ef4a998933fcd59e950d22b3/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f5048502d382e342d626c7565)](https://camo.githubusercontent.com/b37db47746bb49d291c47c3cc8fabd15219dc271ef4a998933fcd59e950d22b3/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f5048502d382e342d626c7565)

1 Overview
----------

[](#1-overview)

The crawler automates the collection of teaser content (title, intro text, date, link) from a specific website.
It filters this data and passes the final processed information to the Apache Solr index in order to make the teaser content searchable.

The architecture is modular and follows the principles of the Symfony framework.
The project uses the Pipes-and-Filters architectural pattern.
This pattern was chosen to ensure loose coupling between the individual processing steps.

1.1 Core Processing Steps
-------------------------

[](#11-core-processing-steps)

1. `Schedule` → `CrawlerManager` →
2. `CrawlerManager` → `URLCollector` →
3. `CrawlerManager` → `Fetcher` →
4. `CrawlerManager` → `Parser` →
5. `CrawlerManager` → `Processor` →
6. `CrawlerManager` → `Indexer`

- **`Schedule`**: A scheduled task that invokes the `CrawlerManager` via a Symfony command. This enables time-controlled execution of the crawler.
- **`CrawlerManager`**: The central coordinator that calls the components `URLCollector`, `Fetcher`, `Parser`, `Processor`, and `Indexer` in the correct order.
- **`URLCollector`**: Collects URLs to crawl by parsing the `sitemap.xml` and filtering them based on predefined patterns.
- **`Fetcher`**: Sends HTTP requests to retrieve the HTML content of a URL.
- **`Parser`**: Specialized in data extraction. Uses `symfony/dom-crawler` to extract teaser data via CSS selectors or OpenGraph tags from the HTML content.
- **`Processor`**: Responsible for data transformation. Raw data is cleaned, trimmed to a maximum length of 120 characters, and transformed into the data model required for indexing.
- **`Indexer`**: Provides the interface to Apache Solr. Receives the processed data and submits it for indexing via the `atoolo-search-bundle`.

---

2 Installation and Operation
----------------------------

[](#2-installation-and-operation)

### 2.1 Installation

[](#21-installation)

The application was developed as a Symfony bundle and is distributed as a Composer package.

1. Install the module via Composer:
    `composer require atoolo/crawler-teaser-indexer`
2. Run `composer update` to resolve all dependencies.

- Run tests:
    `vendor/bin/phpunit`
- Run the application inside the project:
    `docker compose exec -u ${UID} fpm /var/www/-->Projectname
