PHPackages                             ddliu/spider - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [Utility &amp; Helpers](/categories/utility)
4. /
5. ddliu/spider

ActiveLibrary[Utility &amp; Helpers](/categories/utility)

ddliu/spider
============

Light weight spider for the web.

v0.2.9(11y ago)19364MITPHP

Since Nov 6Pushed 11y ago4 watchersCompare

[ Source](https://github.com/ddliu/spider)[ Packagist](https://packagist.org/packages/ddliu/spider)[ RSS](/packages/ddliu-spider/feed)WikiDiscussions master Synced today

READMEChangelogDependencies (7)Versions (21)Used By (0)

Spider [![Build Status](https://camo.githubusercontent.com/0e15a96ad43bae61c209d897eba117d293eae08186ba782e0702ac1702056290/68747470733a2f2f7472617669732d63692e6f72672f64646c69752f7370696465722e737667)](https://travis-ci.org/ddliu/spider)
==============================================================================================================================================================================================================================================

[](#spider-)

A flexible spider in PHP.

Concepts
--------

[](#concepts)

A spider contains many processors called `pipes`, you can pass as many tasks as you like to the spider, each task go through these `pipes` and get processed.

Installation
------------

[](#installation)

```
composer require ddliu/spider

```

Requirements
------------

[](#requirements)

- PHP5.3+
- curl(RequestPipe)

Dependencies
------------

[](#dependencies)

See `composer.json`.

Usage
-----

[](#usage)

```
use ddliu\spider\Spider;
use ddliu\spider\Pipe\NormalizeUrlPipe;
use ddliu\spider\Pipe\RequestPipe;
use ddliu\spider\Pipe\DomCrawlerPipe;

(new Spider())
    ->pipe(new NormalizeUrlPipe())
    ->pipe(new RequestPipe())
    ->pipe(new DomCrawlerPipe())
    ->pipe(function($spider, $task) {
        $task['$dom']->filter('a')->each(function($a) use ($task) {
            $href = $a->attr('href');
            $task->fork($href);
        })
    })
    // the entry task
    ->addTask('http://example.com')
    ->run()
    ->report();
```

Find more examples in `examples` folder.

Spider
------

[](#spider)

The `Spider` class.

### Options

[](#options)

- limit: maxmum tasks to run

### Methods

[](#methods)

- `pipe($pipe)`: add a pipe
- `addTask($task)`: add a task
- `run()`: run the spider
- `report()`: write report to log

Task
----

[](#task)

A task contains the data array and some helper functions.

The `Task` class implements `ArrayAccess` interface, so you can access data like array.

### Methods

[](#methods-1)

- `fork($task)`: add a sub task to the spider
- `ignore()`: ignore the task

Pipes
-----

[](#pipes)

Pipes define how each task being processed.

A pipe can be a function:

```
function($spider, $task) {}
```

Or extends the BasePipe:

```
use ddliu\spider\Pipe\BasePipe;

class MyPipe extends BasePipe {
    public function run($spider, $task) {
        // process the task...
    }
}
```

Useful Pipes
------------

[](#useful-pipes)

### NormalizeUrlPipe

[](#normalizeurlpipe)

Normalize `$task['url']`.

```
new NormalizeUrlPipe()
```

### RequestPipe

[](#requestpipe)

Start an HTTP request with `$task['url']` and save the result in `$task['content']`.

```
new RequestPipe(array(
    'useragent' => 'myspider',
    'timeout' => 10
));
```

### FileCachePipe

[](#filecachepipe)

Cache a pipe (e.g. `RequestPipe`).

```
$requestPipe = new RequestPipe();
$cacheForReqPipe = new FileCachePipe($requestPipe, [
    'input' => 'url',
    'output' => 'content',
    'root' => '/path/to/cache/root',
]);
```

### RetryPipe

[](#retrypipe)

Retry on failure.

```
$requestPipe = new RequestPipe();
$retryForReqPipe = new RetryPipe($requestPipe, [
    'count' => 10,
]);
```

### DomCrawlerPipe

[](#domcrawlerpipe)

Create a [DomCrawler](https://github.com/symfony/DomCrawler) from `$task['content']`. Access it with `$task['$dom']` in following pipes.

### ReportPipe

[](#reportpipe)

Report every 10 minutes.

```
new ReportPipe(array(
    'seconds' => 600
))
```

Logging
-------

[](#logging)

`$spider->logger` is an instance of `Monolog\Logger`. You can add logging handlers to it before start:

```
use Monolog\Handler\StreamHandler;

$spider->logger->pushHandler(new StreamHandler('path/to/your.log', Logger::WARNING));

```

TODO/Ideas
----------

[](#todoideas)

- Real world examples.
- Running tasks concurrently.(With pthread?)

Alternate
---------

[](#alternate)

Use [golang version](http://github.com/ddliu/go-spider) for better performance!

###  Health Score

31

—

LowBetter than 66% of packages

Maintenance20

Infrequent updates — may be unmaintained

Popularity18

Limited adoption so far

Community11

Small or concentrated contributor base

Maturity62

Established project with proven stability

 Bus Factor1

Top contributor holds 100% of commits — single point of failure

How is this calculated?**Maintenance (25%)** — Last commit recency, latest release date, and issue-to-star ratio. Uses a 2-year decay window.

**Popularity (30%)** — Total and monthly downloads, GitHub stars, and forks. Logarithmic scaling prevents top-heavy scores.

**Community (15%)** — Contributors, dependents, forks, watchers, and maintainers. Measures real ecosystem engagement.

**Maturity (30%)** — Project age, version count, PHP version support, and release stability.

###  Release Activity

Cadence

Every ~2 days

Total

20

Last Release

4202d ago

### Community

Maintainers

![](https://www.gravatar.com/avatar/b7cb651132ee314761f02f60a02dede47d7a94577923f742b4c79556434beeee?d=identicon)[ddliu](/maintainers/ddliu)

---

Top Contributors

[![ddliu](https://avatars.githubusercontent.com/u/797146?v=4)](https://github.com/ddliu "ddliu (45 commits)")

### Embed Badge

![Health badge](/badges/ddliu-spider/health.svg)

```
[![Health](https://phpackages.com/badges/ddliu-spider/health.svg)](https://phpackages.com/packages/ddliu-spider)
```

###  Alternatives

[craftcms/cms

Craft CMS

3.6k3.6M3.1k](/packages/craftcms-cms)[pimcore/pimcore

Content &amp; Product Management Framework (CMS/PIM/E-Commerce)

3.8k3.8M508](/packages/pimcore-pimcore)[prestashop/prestashop

PrestaShop is an Open Source e-commerce platform, committed to providing the best shopping cart experience for both merchants and customers.

9.1k17.8k](/packages/prestashop-prestashop)[drupal/core-dev

require-dev dependencies from drupal/drupal; use in addition to drupal/core-recommended to run tests from drupal/core.

2022.6M341](/packages/drupal-core-dev)[blackfire/player

A powerful web crawler and web scraper with Blackfire support

49517.1k](/packages/blackfire-player)[open-dxp/opendxp

Content &amp; Product Management Framework (CMS/PIM)

9421.6k61](/packages/open-dxp-opendxp)

PHPackages © 2026

[Directory](/)[Categories](/categories)[Trending](/trending)[Changelog](/changelog)[Analyze](/analyze)
