PHPackages                             vdb/php-spider - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [Utility &amp; Helpers](/categories/utility)
4. /
5. vdb/php-spider

ActiveApplication[Utility &amp; Helpers](/categories/utility)

vdb/php-spider
==============

A configurable and extensible PHP web spider

v0.7.6(7mo ago)1.3k185.8k↓46.8%231[1 issues](https://github.com/mvdbos/php-spider/issues)[2 PRs](https://github.com/mvdbos/php-spider/pulls)7MITPHPPHP &gt;=8.0CI passing

Since Mar 16Pushed 4d ago79 watchersCompare

[ Source](https://github.com/mvdbos/php-spider)[ Packagist](https://packagist.org/packages/vdb/php-spider)[ RSS](/packages/vdb-php-spider/feed)WikiDiscussions master Synced yesterday

READMEChangelogDependencies (13)Versions (32)Used By (7)

[![Build Status](https://github.com/mvdbos/php-spider/actions/workflows/php.yml/badge.svg?branch=master)](https://github.com/mvdbos/php-spider/actions/workflows/php.yml)[![Latest Stable Version](https://camo.githubusercontent.com/cf8c885f71e5e0c98b9a67308df74415e5685640920cb14d058154058277d0d4/68747470733a2f2f706f7365722e707567782e6f72672f7664622f7068702d7370696465722f76)](https://packagist.org/packages/vdb/php-spider)[![Total Downloads](https://camo.githubusercontent.com/306078613dcb43d45f58e35a862e540ec0f4550b8f6f8a0f03796f237c021421/68747470733a2f2f706f7365722e707567782e6f72672f7664622f7068702d7370696465722f646f776e6c6f616473)](https://packagist.org/packages/vdb/php-spider)[![License](https://camo.githubusercontent.com/21a55deec6f195cfadf46e35da596fe21221c76db4ad808086a3fee6773f1748/68747470733a2f2f706f7365722e707567782e6f72672f7664622f7068702d7370696465722f6c6963656e7365)](https://packagist.org/packages/vdb/php-spider)

PHP-Spider Features
===================

[](#php-spider-features)

- supports two traversal algorithms: breadth-first and depth-first
- supports crawl depth limiting, queue size limiting and max downloads limiting
- supports adding custom URI discovery logic, based on XPath, CSS selectors, or plain old PHP
- comes with a useful set of URI filters, such as robots.txt and Domain limiting
- supports custom URI filters, both prefetch (URI) and postfetch (Resource content)
- supports caching downloaded resources with configurable max age (see [example](example/example_cache.php) and [documentation](docs/filters/CachedResourceFilter.md))
- supports custom request handling logic
- supports Basic, Digest and NTLM HTTP authentication. See [example](example/example_basic_auth.php).
- comes with a useful set of persistence handlers (memory, file)
- supports custom persistence handlers
- collects statistics about the crawl for reporting
- dispatches useful events, allowing developers to add even more custom behavior
- supports a politeness policy

This Spider does not support Javascript.

Installation
------------

[](#installation)

The easiest way to install PHP-Spider is with [composer](https://getcomposer.org/). Find it on [Packagist](https://packagist.org/packages/vdb/php-spider).

```
$ composer require vdb/php-spider
```

Usage
-----

[](#usage)

This is a very simple example. This code can be found in [example/example\_simple.php](example/example_simple.php). For a more complete example with some logging, caching and filters, see [example/example\_complex.php](example/example_complex.php). That file contains a more real-world example.

> > Note that by default, the spider stops processing when it encounters a 4XX or 5XX error responses. To set the spider up to keep processing, please see [the link checker example](https://github.com/mvdbos/php-spider/blob/master/example/example_link_check.php). It uses a custom request handler, that configures the default Guzzle request handler to not fail on 4XX and 5XX responses.

First create the spider

```
$spider = new Spider('http://www.dmoz.org');
```

Add a URI discoverer. Without it, the spider does nothing. In this case, we want all `` nodes from a certain ``

```
$spider->addDiscoverer(new XPathExpressionDiscoverer("//div[@id='catalogs']//a"));
```

Set some sane options for this example. In this case, we only get the first 10 items from the start page.

```
$spider->setMaxDepth(1);
$spider->setMaxQueueSize(10);
```

Add a listener to collect stats from the Spider and the QueueManager. There are more components that dispatch events you can use.

```
$statsHandler = new StatsHandler();
$spider->getQueueManager()->getDispatcher()->addSubscriber($statsHandler);
$spider->getDispatcher()->addSubscriber($statsHandler);
```

Execute the crawl

```
$spider->crawl();
```

When crawling is done, we could get some info about the crawl

```
echo "\n  ENQUEUED:  " . count($statsHandler->getQueued());
echo "\n  SKIPPED:   " . count($statsHandler->getFiltered());
echo "\n  FAILED:    " . count($statsHandler->getFailed());
echo "\n  PERSISTED:    " . count($statsHandler->getPersisted());
```

Finally we could do some processing on the downloaded resources. In this example, we will echo the title of all resources

```
echo "\n\nDOWNLOADED RESOURCES: ";
foreach ($spider->getDownloader()->getPersistenceHandler() as $resource) {
    echo "\n - " . $resource->getCrawler()->filterXpath('//title')->text();
}
```

### Fluent Configuration

[](#fluent-configuration)

For most common settings, you can configure the spider fluently via convenience methods on `Spider` and keep related configuration in one place.

```
use VDB\Spider\Spider;
use VDB\Spider\Discoverer\XPathExpressionDiscoverer;
use VDB\Spider\Filter\Prefetch\AllowedHostsFilter;
use VDB\Spider\PersistenceHandler\FileSerializedResourcePersistenceHandler;
use VDB\Spider\QueueManager\QueueManagerInterface;

$spider = new Spider('https://example.com');

// Configure limits and traversal in one place
$spider
    ->setDownloadLimit(50)                         // Max resources to download
    ->setTraversalAlgorithm(QueueManagerInterface::ALGORITHM_BREADTH_FIRST)
    ->setMaxDepth(2)                               // Max discovery depth
    ->setMaxQueueSize(500)                         // Max URIs in queue
    ->setPersistenceHandler(new FileSerializedResourcePersistenceHandler(__DIR__.'/results'))
    ->addDiscoverer(new XPathExpressionDiscoverer('//a')) // Add discoverers
    ->addFilter(new AllowedHostsFilter(['example.com'])); // Add prefetch filters

// Optional: enable politeness policy (delay between requests to same domain)
$spider->enablePolitenessPolicy(100);

$spider->crawl();
```

### Using Cache to Skip Already Downloaded Resources

[](#using-cache-to-skip-already-downloaded-resources)

To avoid re-downloading resources that are already cached (useful for incremental crawls):

```
use VDB\Spider\Filter\Prefetch\CachedResourceFilter;
use VDB\Spider\PersistenceHandler\FileSerializedResourcePersistenceHandler;

// Use a fixed spider ID to share cache across runs
$spiderId = 'my-spider-cache';
$spider = new Spider('http://example.com', null, null, null, $spiderId);

// Set up file persistence
$resultsPath = __DIR__ . '/cache';
$spider->getDownloader()->setPersistenceHandler(
    new FileSerializedResourcePersistenceHandler($resultsPath)
);

// Add cache filter - skip resources downloaded within the last hour
$maxAgeSeconds = 3600; // 1 hour (set to 0 to always use cache)
$cacheFilter = new CachedResourceFilter($resultsPath, $spiderId, $maxAgeSeconds);
$spider->getDiscovererSet()->addFilter($cacheFilter);

$spider->crawl();
```

For more details, see the [CachedResourceFilter documentation](docs/filters/CachedResourceFilter.md) and [example](example/example_cache.php).

Contributing
------------

[](#contributing)

Contributing to PHP-Spider is as easy as Forking the repository on Github and submitting a Pull Request. The Symfony documentation contains an excellent guide for how to do that properly here: [Submitting a Patch](http://symfony.com/doc/current/contributing/code/patches.html#step-1-setup-your-environment).

There a few requirements for a Pull Request to be accepted:

- Follow the coding standards: PHP-Spider follows the coding standards defined in the [PSR-0](https://github.com/php-fig/fig-standards/blob/master/accepted/PSR-0.md), [PSR-1](https://github.com/php-fig/fig-standards/blob/master/accepted/PSR-1-basic-coding-standard.md) and [PSR-2](https://github.com/php-fig/fig-standards/blob/master/accepted/PSR-2-coding-style-guide.md) Coding Style Guides;
- Prove that the code works with unit tests and that coverage remains 100%;

> Note: An easy way to check if your code conforms to PHP-Spider is by running the script `bin/static-analysis`, which is part of this repo. This will run the following tools, configured for PHP-Spider: PHP CodeSniffer, PHP Mess Detector and PHP Copy/Paste Detector.

> Note: To run PHPUnit with coverage, and to check that coverage == 100%, you can run `bin/coverage-enforce`.

### Local Testing with GitHub Actions

[](#local-testing-with-github-actions)

You can run the full CI pipeline locally using [nektos/act](https://nektosact.com/):

```
# Fast path: run the full workflow with PHP 8.0 (recommended)
./bin/check
```

Or use the underlying act wrapper directly:

```
# Run all tests locally
./bin/act

# Run specific PHP version locally
./bin/act --matrix php-versions:8.0

# Run specific job or view available workflows
./bin/act -l
```

For more details, see [.github/LOCAL\_TESTING.md](.github/LOCAL_TESTING.md).

Support
-------

[](#support)

For things like reporting bugs and requesting features it is best to create an [issue](https://github.com/mvdbos/php-spider/issues) here on GitHub. It is even better to accompany it with a Pull Request. ;-)

License
-------

[](#license)

PHP-Spider is licensed under the MIT license.

###  Health Score

65

—

FairBetter than 99% of packages

Maintenance84

Actively maintained with recent releases

Popularity59

Moderate usage in the ecosystem

Community38

Small or concentrated contributor base

Maturity69

Established project with proven stability

 Bus Factor1

Top contributor holds 71.6% of commits — single point of failure

How is this calculated?**Maintenance (25%)** — Last commit recency, latest release date, and issue-to-star ratio. Uses a 2-year decay window.

**Popularity (30%)** — Total and monthly downloads, GitHub stars, and forks. Logarithmic scaling prevents top-heavy scores.

**Community (15%)** — Contributors, dependents, forks, watchers, and maintainers. Measures real ecosystem engagement.

**Maturity (30%)** — Project age, version count, PHP version support, and release stability.

###  Release Activity

Cadence

Every ~221 days

Recently: every ~181 days

Total

22

Last Release

210d ago

PHP version history (7 changes)v0.1PHP &gt;=5.3.0

v0.2PHP &gt;=5.5.0

v0.4PHP &gt;=7.2

v0.6.0PHP &gt;=7.3

v0.6.3PHP &gt;=7.3||^8.0

v0.7.0PHP &gt;=7.4||^8.0

v0.7.1PHP &gt;=8.0

### Community

Maintainers

![](https://www.gravatar.com/avatar/bbebe81f2de691fcf258a26b095887ba7556c1777fca7193c3e78fab0991de88?d=identicon)[matthijsvandenbos](/maintainers/matthijsvandenbos)

---

Top Contributors

[![mvdbos](https://avatars.githubusercontent.com/u/1101757?v=4)](https://github.com/mvdbos "mvdbos (295 commits)")[![Copilot](https://avatars.githubusercontent.com/in/1143301?v=4)](https://github.com/Copilot "Copilot (87 commits)")[![dependabot[bot]](https://avatars.githubusercontent.com/in/29110?v=4)](https://github.com/dependabot[bot] "dependabot[bot] (11 commits)")[![spekulatius](https://avatars.githubusercontent.com/u/8433587?v=4)](https://github.com/spekulatius "spekulatius (6 commits)")[![peter17](https://avatars.githubusercontent.com/u/752832?v=4)](https://github.com/peter17 "peter17 (5 commits)")[![tacman](https://avatars.githubusercontent.com/u/619585?v=4)](https://github.com/tacman "tacman (2 commits)")[![eddiejaoude](https://avatars.githubusercontent.com/u/624760?v=4)](https://github.com/eddiejaoude "eddiejaoude (1 commits)")[![ReadmeCritic](https://avatars.githubusercontent.com/u/15367484?v=4)](https://github.com/ReadmeCritic "ReadmeCritic (1 commits)")[![greatwitenorth](https://avatars.githubusercontent.com/u/547583?v=4)](https://github.com/greatwitenorth "greatwitenorth (1 commits)")[![soeren-helbig](https://avatars.githubusercontent.com/u/1651135?v=4)](https://github.com/soeren-helbig "soeren-helbig (1 commits)")[![DmitrySidorenkoShim](https://avatars.githubusercontent.com/u/8822805?v=4)](https://github.com/DmitrySidorenkoShim "DmitrySidorenkoShim (1 commits)")[![scrutinizer-auto-fixer](https://avatars.githubusercontent.com/u/6253494?v=4)](https://github.com/scrutinizer-auto-fixer "scrutinizer-auto-fixer (1 commits)")

---

Tags

crawlerspiderscraper

###  Code Quality

TestsPHPUnit

Code StylePHP CS Fixer

### Embed Badge

![Health badge](/badges/vdb-php-spider/health.svg)

```
[![Health](https://phpackages.com/badges/vdb-php-spider/health.svg)](https://phpackages.com/packages/vdb-php-spider)
```

###  Alternatives

[craftcms/cms

Craft CMS

3.6k3.6M3.1k](/packages/craftcms-cms)[spatie/crawler

Crawl all internal links found on a website

2.8k18.5M66](/packages/spatie-crawler)[pimcore/pimcore

Content &amp; Product Management Framework (CMS/PIM/E-Commerce)

3.8k3.8M508](/packages/pimcore-pimcore)[civicrm/civicrm-core

Open source constituent relationship management for non-profits, NGOs and advocacy organizations.

751291.4k42](/packages/civicrm-civicrm-core)[silverstripe/framework

The SilverStripe framework

7313.7M2.8k](/packages/silverstripe-framework)[blackfire/player

A powerful web crawler and web scraper with Blackfire support

49517.1k](/packages/blackfire-player)

PHPackages © 2026

[Directory](/)[Categories](/categories)[Trending](/trending)[Changelog](/changelog)[Analyze](/analyze)
