PHPackages                             nadar/crawler - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [PDF &amp; Document Generation](/categories/documents)
4. /
5. nadar/crawler

ActiveLibrary[PDF &amp; Document Generation](/categories/documents)

nadar/crawler
=============

A highly extendible, dependency free Crawler for HTML, PDFS or any other type of Documents.

1.7.1(4y ago)1023.8k↓34.2%1[2 PRs](https://github.com/nadar/crawler/pulls)2MITPHPCI passing

Since Sep 25Pushed 3mo agoCompare

[ Source](https://github.com/nadar/crawler)[ Packagist](https://packagist.org/packages/nadar/crawler)[ GitHub Sponsors](https://github.com/nadar)[ RSS](/packages/nadar-crawler/feed)WikiDiscussions master Synced 1mo ago

READMEChangelog (10)Dependencies (2)Versions (17)Used By (2)

Website Crawler for PHP
=======================

[](#website-crawler-for-php)

[![Tests](https://github.com/nadar/php-page-crawler/workflows/Tests/badge.svg)](https://github.com/nadar/php-page-crawler/workflows/Tests/badge.svg)[![Test Coverage](https://camo.githubusercontent.com/313ff9a1600701b7fbf00edd9cf4106f31a34b107a9faa70d32cf2109b252cce/68747470733a2f2f6170692e636f6465636c696d6174652e636f6d2f76312f6261646765732f37356165353831313561393131656466623137382f746573745f636f766572616765)](https://codeclimate.com/github/nadar/crawler/test_coverage)[![Maintainability](https://camo.githubusercontent.com/5e6e6e3caa9d0e3b6bf3b9ae8bab557bd6aefa2d10c0f2396b20f000163b6d8d/68747470733a2f2f6170692e636f6465636c696d6174652e636f6d2f76312f6261646765732f37356165353831313561393131656466623137382f6d61696e7461696e6162696c697479)](https://codeclimate.com/github/nadar/crawler/maintainability)[![Packagist Downloads](https://camo.githubusercontent.com/d13ae26be0f9f223f977dcd4544052aaf3a8adc5abe5ea61e5c8d15860716ef9/68747470733a2f2f696d672e736869656c64732e696f2f7061636b61676973742f64742f6e616461722f637261776c6572)](https://packagist.org/packages/nadar/crawler)

A highly extendible, dependency free Crawler for HTML, PDFS or any other type of Documents.

**Why another Page Crawler?** Yes, indeed, there are already very good Crawlers around, therefore those where my goals:

- **Dependency Free** - we don't want to use any HTTP client, as much "native" PHP code as possible in order to keep the overhead small. It just requires the CURL extension.
- **Memory Efficent** - As memory efficient as possible, less overhead, full code control.
- **Extendible** - Attach your own parsers in order to determine how html or any other format is parsed. There are out of the box parsers for HTML and PDF. Its very easy to build your own data type parser.
- **Runtime Storage** - When the crawler runs, certain informations must be stored. This is extendible to suit your use case. Either use your database or take the built in array or file storage system.
- **Async** - It's possible to start the crawler and process any further run cycle as an asynchronus process, f.e. with a PHP queue system like [Yii2 Queue](https://github.com/yiisoft/yii2-queue).

Installation
------------

[](#installation)

Composer is required to install this library:

```
composer require nadar/crawler
```

In order to use the PDF Parser, the optional library `smalot/pdfparser` must be installed:

```
smalot/pdfparser
```

Usage
-----

[](#usage)

1. First we need to provide the crawler the information what should be done with the results from a crawler run:

Create your handler, those are the classes which interact with the crawler in order to store your content/results somwehere. The afterRun() method will run whenever an URL is crawled and contains the results:

```
class MyCrawlHandler implements \Nadar\Crawler\Interfaces\HandlerInterface
{
    public function afterRun(\Nadar\Crawler\Result $result)
    {
        echo $result->title . " with content " . $result->content . " for url " . $result->url->getNormalized();
    }

    public function onSetup(Crawler $crawler)
    {
        // do some stuff before the crawler runs, maybe truncate your temporary table where the results should be stored.
    }

    public function onEnd(Crawler $crawler)
    {
        // runs when the crawler is finished, maybe synchronize your temporary index table with the "real" site index.
    }

}
```

2. Then we attach the handler and setup all required informations for crawler:

```
$crawler = new Crawler('https://luya.io', new ArrayStorage, new LoopRunner);

// what kind of document types would you like to parse?
$crawler->addParser(new Nadar\Crawler\Parsers\Html);

// adding will increases memory consumption
// $crawler->addParser(new Nadar\Crawler\Parsers\Pdf);

// register your handler in order to interact with the results, maybe store them in a database?
$crawler->addHandler(new MyCrawlHandler);

// setup and start the crawl process
$crawler->setup();
$crawler->run();
```

> Attention: Keep in mind that wen you enable the PDF Parser and have multiple concurrent requests this can drastically increases memory usage (Especially if there are large PDFs)! Therefore it's recommend to lower the concurrent value when enabling PDF Parser!

Benchmark
---------

[](#benchmark)

Of course those benchmarks may vary depending on internet connection, bandwidth, servers but we made all the tests under the same circumstances. The memory peak varys strong when using the PDF parsers, therefore we test only with HTML parser:

Index SizeConcurrent RequestsMemory PeakTimeStorage308306MB19sArrayStorage308306MB20sFileStorage> Still looking for a good website to use for benchmarking. See the `benchmark.php` file for the test setup.

Developer Informations
----------------------

[](#developer-informations)

For a better understanding, here is en explenation of how the classes are capsulated and for what they are used.

- Crawler: The Crawler is the main programm, it starts, runs and ends.
- Job: The job contains the url logic for the next "CURL"/Download Job
- Parsers: The parsers will take the job informations in combination with the RequestResponse in order to generate a ParserResult
- ParserResult: The Job result represents the result from a Parser.
- QueueItem: The queue item is extracted from the job and is only used to store those informations with use of StorageInterface

**Lifecycle**

Crawler -&gt; Job -&gt; (ItemQueue -&gt; Storage) -&gt; RequestResponse -&gt; Parser -&gt; ParserResult -&gt; Result

###  Health Score

44

—

FairBetter than 92% of packages

Maintenance53

Moderate activity, may be stable

Popularity33

Limited adoption so far

Community13

Small or concentrated contributor base

Maturity63

Established project with proven stability

 Bus Factor1

Top contributor holds 99.4% of commits — single point of failure

How is this calculated?**Maintenance (25%)** — Last commit recency, latest release date, and issue-to-star ratio. Uses a 2-year decay window.

**Popularity (30%)** — Total and monthly downloads, GitHub stars, and forks. Logarithmic scaling prevents top-heavy scores.

**Community (15%)** — Contributors, dependents, forks, watchers, and maintainers. Measures real ecosystem engagement.

**Maturity (30%)** — Project age, version count, PHP version support, and release stability.

###  Release Activity

Cadence

Every ~43 days

Recently: every ~96 days

Total

14

Last Release

1505d ago

### Community

Maintainers

![](https://www.gravatar.com/avatar/86184bf08843ed8fcc4aedb2fdecd8a9e832e47e89a7166cebfda529c176f5ce?d=identicon)[nadar](/maintainers/nadar)

---

Top Contributors

[![nadar](https://avatars.githubusercontent.com/u/3417221?v=4)](https://github.com/nadar "nadar (169 commits)")[![dependabot[bot]](https://avatars.githubusercontent.com/in/29110?v=4)](https://github.com/dependabot[bot] "dependabot[bot] (1 commits)")

---

Tags

crawlerhacktoberfesthtmlpdfphp

###  Code Quality

TestsPHPUnit

### Embed Badge

![Health badge](/badges/nadar-crawler/health.svg)

```
[![Health](https://phpackages.com/badges/nadar-crawler/health.svg)](https://phpackages.com/packages/nadar-crawler)
```

###  Alternatives

[spatie/browsershot

Convert a webpage to an image or pdf using headless Chrome

5.2k32.1M102](/packages/spatie-browsershot)[barryvdh/laravel-snappy

Snappy PDF/Image for Laravel

2.8k24.8M48](/packages/barryvdh-laravel-snappy)[openspout/openspout

PHP Library to read and write spreadsheet files (CSV, XLSX and ODS), in a fast and scalable way

1.2k57.6M131](/packages/openspout-openspout)[keboola/csv

Keboola CSV reader and writer

1451.8M21](/packages/keboola-csv)[setasign/tfpdf

This class is a modified version of FPDF that adds UTF-8 support. The latest version is based on FPDF 1.85.

426.1M30](/packages/setasign-tfpdf)[aspera/xlsx-reader

Spreadsheet reader library for XLSX files

52742.2k5](/packages/aspera-xlsx-reader)

PHPackages © 2026

[Directory](/)[Categories](/categories)[Trending](/trending)[Changelog](/changelog)[Analyze](/analyze)
