PHPackages                             brittainmedia/phpcrawl - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [Utility &amp; Helpers](/categories/utility)
4. /
5. brittainmedia/phpcrawl

AbandonedArchivedLibrary[Utility &amp; Helpers](/categories/utility)

brittainmedia/phpcrawl
======================

PHPCrawl is a webcrawler/webspider-library written in PHP. It supports filters, limiters, cookie-handling, robots.txt-handling, multiprocessing and much more.

0.10.1(3y ago)93.5k42GPL-2.0PHP

Since Dec 17Pushed 2y ago1 watchersCompare

[ Source](https://github.com/crispy-computing-machine/phpcrawl)[ Packagist](https://packagist.org/packages/brittainmedia/phpcrawl)[ RSS](/packages/brittainmedia-phpcrawl/feed)WikiDiscussions master Synced 3d ago

READMEChangelog (10)DependenciesVersions (29)Used By (2)

Now archived due to fundamental issues. Replaced by [SuperSimpleCrawler](https://github.com/crispy-computing-machine/SuperSimpleCrawler)
----------------------------------------------------------------------------------------------------------------------------------------

[](#now-archived-due-to-fundamental-issues-replaced-by-supersimplecrawler)

phpcrawl
========

[](#phpcrawl)

[![Latest Stable Version](https://camo.githubusercontent.com/fb3eed9f14edb48bd824e09b6b86317c2d6c48c70e4f64541437f5b0cfa9c6e1/68747470733a2f2f706f7365722e707567782e6f72672f627269747461696e6d656469612f706870637261776c2f762f737461626c65)](https://packagist.org/packages/brittainmedia/phpcrawl) [![Total Downloads](https://camo.githubusercontent.com/bc35b0f71946123651bbef93e8a62738690440f4c121d9831fdb9877adfc96f2/68747470733a2f2f706f7365722e707567782e6f72672f627269747461696e6d656469612f706870637261776c2f646f776e6c6f616473)](https://packagist.org/packages/brittainmedia/phpcrawl) [![License](https://camo.githubusercontent.com/337ce0fc6193d4bc74c53271ac5ca2d70971ea018317521ce8931a03a0a85df6/68747470733a2f2f706f7365722e707567782e6f72672f627269747461696e6d656469612f706870637261776c2f6c6963656e7365)](https://packagist.org/packages/brittainmedia/phpcrawl)

```
composer require brittainmedia/phpcrawl
```

```
use PHPCrawl\Enums\PHPCrawlerAbortReasons;
use PHPCrawl\Enums\PHPCrawlerMultiProcessModes;
use PHPCrawl\Enums\PHPCrawlerUrlCacheTypes;
use PHPCrawl\PHPCrawler;
use PHPCrawl\PHPCrawlerDocumentInfo;

// New custom crawler
$crawler = new class() extends PHPCrawler {

    /**
     * @param $PageInfo
     * @return int
     */
    function handleDocumentInfo($PageInfo): int
    {
        // Print the URL of the document
        echo "URL: " . $PageInfo->url . PHP_EOL;

        // Print the http-status-code
        echo "HTTP-statuscode: " . $PageInfo->http_status_code . PHP_EOL;

        // Print the number of found links in this document
        echo "Links found: " . count($PageInfo->links_found_url_descriptors) . PHP_EOL;

        // ..

        // continue crawling
        return 1;
    }
};

$crawler->setURL($url = 'https://bbc.co.uk/news');

// Optional
//$crawler->setProxy($proxy_host, $proxy_port, $proxy_username, $proxy_password);

// Only receive content of files with content-type "text/html"
$crawler->addContentTypeReceiveRule('#text/html#');

// Ignore links to ads...
$advertFilterRule = "/\bads\b|2o7|a1\.yimg|ad(brite|click|farm|revolver|server|tech|vert)|at(dmt|wola)|banner|bizrate|blogads|bluestreak|burstnet|casalemedia|coremetrics|(double|fast)click|falkag|(feedster|right)media|googlesyndication|hitbox|httpads|imiclk|intellitxt|js\.overture|kanoodle|kontera|mediaplex|nextag|pointroll|qksrv|speedera|statcounter|tribalfusion|webtrends/";
$crawler->addURLFilterRule($advertFilterRule);

// Store and send cookie-data like a browser does
$crawler->enableCookieHandling(true);

// Limits set, successfully retrieved only
$crawler->setRequestLimit(1);

/**
 * 3 - The crawler only follows links to pages or files located in or under the same path like the one of the root-url.
 * E.g. if the root-url is
 * "http://www.foo.com/bar/index.html",
 * the crawler will follow links to "http://www.foo.com/bar/page.html" and "http://www.foo.com/bar/path/index.html",
 * but not links to "http://www.foo.com/page.html".
 *
 */
$crawler->setFollowMode(3);

// Keep going until resolved
$crawler->setFollowRedirectsTillContent(TRUE);

// tmp directory
$crawler->setWorkingDirectory(sys_get_temp_dir() . DIRECTORY_SEPARATOR . 'phpcrawl' .DIRECTORY_SEPARATOR);

// Cache
$crawler->setUrlCacheType(PHPCrawlerUrlCacheTypes::URLCACHE_MEMORY);

// File crawling - Store to file or set limit for large files
#$crawler->addStreamToFileContentType('##');
#$crawler->setContentSizeLimit(500000); // Google only crawls pages 500kb and below?

//Decides whether the crawler should obey "nofollow"-tags, we will obey
$crawler->obeyNoFollowTags(true);

//Decides whether the crawler should obey robot.txt, we will not obey!
$crawler->obeyRobotsTxt(false);

// Delay to stop blocking
$crawler->setRequestDelay(0.5);

// fake browser or use fake robot one
$crawler->setUserAgentString('Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0');

// Multiprocess (optional) - Forces PHPCrawlerUrlCacheTypes::URLCACHE_SQLITE use, must have link priorities!
$crawler->addLinkPriority("/news/", 10);
$crawler->addLinkPriority("/\.jpeg/", 5);
$crawler->goMultiProcessed(PHPCrawlerMultiProcessModes::MPMODE_CHILDS_EXECUTES_USERCODE);

// Thats enough, now here we go
$crawler->go();

// At the end, after the process is finished, we print a short
// report (see method getProcessReport() for more information)
$report = $crawler->getProcessReport();

echo 'Finished crawling site: ' . $url . PHP_EOL;
echo 'Summary:' . PHP_EOL;
echo 'Links followed: ' . $report->links_followed . PHP_EOL;
echo 'Documents received: ' . $report->files_received . PHP_EOL;
echo 'Bytes received: ' . $report->bytes_received . ' bytes' . PHP_EOL;
echo 'Process runtime: ' . $report->process_runtime . ' sec' . PHP_EOL;
echo 'Process memory: ' . $report->memory_peak_usage . ' sec' . PHP_EOL;
echo 'Server connect time: ' . $report->avg_server_connect_time . ' sec' . PHP_EOL;
echo 'Server response time: ' . $report->avg_server_response_time . ' sec' . PHP_EOL;
echo 'Server transfer rate: ' . $report->avg_proc_data_transfer_rate . ' bytes' . PHP_EOL;

$abortReason = $report->abort_reason;
switch ($abortReason) {
    case PHPCrawlerAbortReasons::ABORTREASON_PASSEDTHROUGH:
        echo 'Crawling-process aborted because everything is done/passed through.' . PHP_EOL;
        break;
    case PHPCrawlerAbortReasons::ABORTREASON_TRAFFICLIMIT_REACHED:
        echo 'Crawling-process aborted because the traffic limit set by user was reached.' . PHP_EOL;
        break;
    case PHPCrawlerAbortReasons::ABORTREASON_FILELIMIT_REACHED:
        echo 'Crawling-process aborted because the file limit set by user was reached.' . PHP_EOL;
        break;
    case PHPCrawlerAbortReasons::ABORTREASON_USERABORT:
        echo 'Crawling-process aborted because the handleDocumentInfo-method returned a negative value.' . PHP_EOL;
        break;
    default:
        echo 'Unknown abort reason.' . PHP_EOL;
        break;

}
```

Initially just a copy of  forked from [mmerian](https://github.com/mmerian/phpcrawl) for using with composer.

*Due to the [main project](https://sourceforge.net/projects/phpcrawl/files/PHPCrawl/) now seemingly being abandoned (having no updates for 4 years) I am going to proceed to make any changes/fixes in this repository.*

### Latest updates

[](#latest-updates)

- 0.9 compatible PHP 7 Only.
- 0.10 compatible PHP 8. ([Submit issues](https://github.com/crispy-computing-machine/phpcrawl/issues))
- Introduced namespaces
- Lots of bug fixes
- Refactored various class sections

Now archived...

###  Health Score

34

—

LowBetter than 77% of packages

Maintenance20

Infrequent updates — may be unmaintained

Popularity25

Limited adoption so far

Community18

Small or concentrated contributor base

Maturity64

Established project with proven stability

 Bus Factor1

Top contributor holds 67.9% of commits — single point of failure

How is this calculated?**Maintenance (25%)** — Last commit recency, latest release date, and issue-to-star ratio. Uses a 2-year decay window.

**Popularity (30%)** — Total and monthly downloads, GitHub stars, and forks. Logarithmic scaling prevents top-heavy scores.

**Community (15%)** — Contributors, dependents, forks, watchers, and maintainers. Measures real ecosystem engagement.

**Maturity (30%)** — Project age, version count, PHP version support, and release stability.

###  Release Activity

Cadence

Every ~121 days

Recently: every ~54 days

Total

26

Last Release

1121d ago

### Community

Maintainers

![](https://www.gravatar.com/avatar/854cf1c8013059a7a17ef714aaca2116b30d9102f9b7bf68cc90178f1cebad32?d=identicon)[brittainmedia](/maintainers/brittainmedia)

---

Top Contributors

[![crispy-computing-machine](https://avatars.githubusercontent.com/u/5684066?v=4)](https://github.com/crispy-computing-machine "crispy-computing-machine (19 commits)")[![gerben86](https://avatars.githubusercontent.com/u/17478858?v=4)](https://github.com/gerben86 "gerben86 (4 commits)")[![mmerian](https://avatars.githubusercontent.com/u/167871?v=4)](https://github.com/mmerian "mmerian (4 commits)")[![ThePutzy](https://avatars.githubusercontent.com/u/1135634?v=4)](https://github.com/ThePutzy "ThePutzy (1 commits)")

---

Tags

crawlcrawlerphpphp74sphider

### Embed Badge

![Health badge](/badges/brittainmedia-phpcrawl/health.svg)

```
[![Health](https://phpackages.com/badges/brittainmedia-phpcrawl/health.svg)](https://phpackages.com/packages/brittainmedia-phpcrawl)
```

###  Alternatives

[nowendwell/laravel-terms

A tool for adding terms and conditions to your project

2517.6k](/packages/nowendwell-laravel-terms)[soul/address-parse

收货地址解析，成功率99%以上，支持（身份证号，电话，座机，区号，省市区街道地址)

405.4k](/packages/soul-address-parse)

PHPackages © 2026

[Directory](/)[Categories](/categories)[Trending](/trending)[Changelog](/changelog)[Analyze](/analyze)
