PHPackages                             tallesairan/crawler - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [Utility &amp; Helpers](/categories/utility)
4. /
5. tallesairan/crawler

ActiveLibrary[Utility &amp; Helpers](/categories/utility)

tallesairan/crawler
===================

Most powerful, popular and production crawling/scraping package for PHP

1.0(2y ago)07MITPHPPHP &gt;=7.0

Since Aug 15Pushed 2y agoCompare

[ Source](https://github.com/tallesairan/PHPCrawler)[ Packagist](https://packagist.org/packages/tallesairan/crawler)[ RSS](/packages/tallesairan-crawler/feed)WikiDiscussions master Synced 1mo ago

READMEChangelog (1)Dependencies (6)Versions (2)Used By (0)

PHPCrawler
==========

[](#phpcrawler)

Most powerful, popular and production crawling/scraping package for PHP, happy hacking :)

Features:

- server-side DOM &amp; automatic DomParser insertion with Symfony\\Component\\DomCrawler
- Configurable pool size and retries
- Control rate limit
- forceUTF8 mode to let crawler deal for you with charset detection and conversion
- Compatible with PHP 7.2

Thanks to

- [Amp](https://amphp.org/) a non-blocking concurrency framework for PHP
- [Artax](https://amphp.org/artax/) An Asynchronous HTTP Client for PHP
- [node-crawler](https://github.com/bda-research/node-crawler) Most powerful, popular and production crawling/scraping package for Node

node-crawler is really a great crawler. PHPCrawler tries its best effort to keep similarity with it.

[中文说明](README_zh.md)

Table of Contents
=================

[](#table-of-contents)

- [Get started](#get-started)
    - [Install](#install)
    - [Basic usage](#basic-usage)
    - [Slow down](#slow-down)
    - [Custom parameters](#custom-parameters)
    - [Raw body](#raw-body)
- [Events](#events)
    - [Event: response](#eventresponse)
    - [Event: drain](#eventdrain)
- [Advanced](#advanced)
    - [Encoding](#encoding)
    - [Logger](#logger)
    - [Coroutine](#logger)
- [Other](#other)
    - [API reference](/docs/api.md)
    - [Configuration](/docs/configuration.md)
- [Work with DomParser](#work-with-domparser)

Get started
===========

[](#get-started)

Install
-------

[](#install)

```
$ composer require "coooold/crawler"
```

Basic usage
-----------

[](#basic-usage)

```
use PHPCrawler\PHPCrawler;
use PHPCrawler\Response;
use Symfony\Component\DomCrawler\Crawler;

$logger = new Monolog\Logger("fox");
try {
    $logger->pushHandler(new \Monolog\Handler\StreamHandler(STDOUT, \Monolog\Logger::INFO));
} catch (\Exception $e) {
}

$crawler = new PHPCrawler([
    'maxConnections' => 2,
    'domParser' => true,
    'timeout' => 3000,
    'retries' => 3,
    'logger' => $logger,
]);

$crawler->on('response', function (Response $res) use ($cli) {
    if (!$res->success) {
        return;
    }

    $title = $res->dom->filter("title")->html();
    echo ">>> title: {$title}\n";
    $res->dom
        ->filter('.related-item a')
        ->each(function (Crawler $crawler) {
            echo ">>> links: ", $crawler->text(), "\n";
        });
});

$crawler->queue('https://www.foxnews.com/');
$crawler->run();
```

Slow down
---------

[](#slow-down)

Use `rateLimit` to slow down when you are visiting web sites.

```
$crawler = new PHPCrawler([
    'maxConnections' => 10,
    'rateLimit' => 2,   // reqs per second
    'domParser' => true,
    'timeout' => 30000,
    'retries' => 3,
    'logger' => $logger,
]);

for ($page = 1; $page queue([
        'uri' => "http://www.qbaobei.com/jiaoyu/gshb/List_{$page}.html",
        'type' => 'list',
    ]);
}

$crawler->run(); //between two tasks, avarage time gap is 1000 / 2 (ms)
```

Custom parameters
-----------------

[](#custom-parameters)

Sometimes you have to access variables from previous request/response session, what should you do is passing parameters as same as options:

```
$crawler->queue([
    'uri' => 'http://www.google.com',
    'parameter1' => 'value1',
    'parameter2' => 'value2',
])
```

then access them in callback via `$res->task['parameter1']`, `$res->task['parameter2']` ...

Raw body
--------

[](#raw-body)

If you are downloading files like image, pdf, word etc, you have to save the raw response body which means Crawler shouldn't convert it to string. To make it happen, you need to set encoding to null

```
$crawler = new PHPCrawler([
    'maxConnections' => 10,
    'rateLimit' => 2,   // req per second
    'domParser' => false,
    'timeout' => 30000,
    'retries' => 3,
    'logger' => $logger,
]);

$crawler->on('response', function (Response $res, PHPCrawler $crawler) {
    if (!$res->success) {
        return;
    }

    echo "write ".$res->task['fileName']."\n";
    file_put_contents($res->task['fileName'], $res->body);
});

$crawler->queue([
    'uri' => "http://www.gutenberg.org/ebooks/60881.txt.utf-8",
    'fileName' => '60881.txt',
]);

$crawler->queue([
    'uri' => "http://www.gutenberg.org/ebooks/60882.txt.utf-8",
    'fileName' => '60882.txt',
]);

$crawler->queue([
    'uri' => "http://www.gutenberg.org/ebooks/60883.txt.utf-8",
    'fileName' => '60883.txt',
]);

$crawler->run();
```

Events
------

[](#events)

### Event::RESPONSE

[](#eventresponse)

Triggered when a request is done.

```
$crawler->on('response', function (Response $res, PHPCrawler $crawler) {
    if (!$res->success) {
        return;
    }
});
```

### Event::DRAIN

[](#eventdrain)

Triggered when queue is empty.

```
$crawler->on('drain', function () {
    echo "queue is drained\n";
});
```

Advanced
--------

[](#advanced)

### Encoding

[](#encoding)

HTTP body will be converted to utf-8 from the default encoding.

```
$crawler = new PHPCrawler([
    'encoding' => 'gbk,
]);
```

### Logger

[](#logger)

A PSR logger instance could be used.

```
$logger = new Monolog\Logger("fox");
$logger->pushHandler(new \Monolog\Handler\StreamHandler(STDOUT, \Monolog\Logger::INFO));

$crawler = new PHPCrawler([
    'logger' => $logger,
]);
```

See [Monolog Reference](https://github.com/Seldaek/monolog).

### Coroutine

[](#coroutine)

PHPCrawler, based on amp non-blocking concurrency framework, could work with coroutines, ensuring excellent performance. [Amp async packages](https://amphp.org/packages) should be used in callbacks, that is to say, neither php native mysql client nor php native file io is not recommended. The keyword yield like await in ES6, introduced the non-blocking io.

```
$crawler->on('response', function (Response $res) use ($cli) {
    /** @var \Amp\Artax\Response $res */
    $res = yield $cli->request("https://www.foxnews.com/politics/lindsey-graham-adam-schiff-is-doing-a-lot-of-damage-to-the-country-and-he-needs-to-stop");
    $body = yield $res->getBody();
    echo "=======> body " . strlen($body) . " bytes \n";
});
```

Work with DomParser
-------------------

[](#work-with-domparser)

[Symfony\\Component\\DomCrawler](https://packagist.org/packages/symfony/dom-crawler) is a handy tool for crawling pages. Response::dom will be injected with an instance of Symfony\\Component\\DomCrawler\\Crawler.

```
$crawler->on('response', function (Response $res) use ($cli) {
    if (!$res->success) {
        return;
    }

    $title = $res->dom->filter("title")->html();
    echo ">>> title: {$title}\n";
    $res->dom
        ->filter('.related-item a')
        ->each(function (Crawler $crawler) {
            echo ">>> links: ", $crawler->text(), "\n";
        });
});
```

See [DomCrawler Reference](https://symfony.com/doc/current/components/dom_crawler.html).

Other
-----

[](#other)

[API reference](/docs/api.md)

[Configuration](/docs/configuration.md)

###  Health Score

19

—

LowBetter than 10% of packages

Maintenance20

Infrequent updates — may be unmaintained

Popularity4

Limited adoption so far

Community6

Small or concentrated contributor base

Maturity38

Early-stage or recently created project

 Bus Factor1

Top contributor holds 100% of commits — single point of failure

How is this calculated?**Maintenance (25%)** — Last commit recency, latest release date, and issue-to-star ratio. Uses a 2-year decay window.

**Popularity (30%)** — Total and monthly downloads, GitHub stars, and forks. Logarithmic scaling prevents top-heavy scores.

**Community (15%)** — Contributors, dependents, forks, watchers, and maintainers. Measures real ecosystem engagement.

**Maturity (30%)** — Project age, version count, PHP version support, and release stability.

###  Release Activity

Cadence

Unknown

Total

1

Last Release

1000d ago

### Community

Maintainers

![](https://www.gravatar.com/avatar/71bc67c217f09cac05a1355dd791c88694732a81d30be15b9a8b726446d9ed6d?d=identicon)[tallesairan](/maintainers/tallesairan)

---

Top Contributors

[![tallesairan](https://avatars.githubusercontent.com/u/2424128?v=4)](https://github.com/tallesairan "tallesairan (2 commits)")

---

Tags

crawlerampcoroutinespidernode-crawler

### Embed Badge

![Health badge](/badges/tallesairan-crawler/health.svg)

```
[![Health](https://phpackages.com/badges/tallesairan-crawler/health.svg)](https://phpackages.com/packages/tallesairan-crawler)
```

###  Alternatives

[vdb/php-spider

A configurable and extensible PHP web spider

1.4k181.0k7](/packages/vdb-php-spider)[wa72/htmlpagedom

jQuery-inspired DOM manipulation extension for Symfony's Crawler

3383.9M34](/packages/wa72-htmlpagedom)[jaybizzle/laravel-crawler-detect

A Laravel package to detect web crawlers via the user agent

3232.6M17](/packages/jaybizzle-laravel-crawler-detect)[crwlr/crawler

Web crawling and scraping library.

37214.8k2](/packages/crwlr-crawler)[nucleos/lastfm

Last.fm webservice client for php.

1812.9k2](/packages/nucleos-lastfm)

PHPackages © 2026

[Directory](/)[Categories](/categories)[Trending](/trending)[Changelog](/changelog)[Analyze](/analyze)
