PHPackages                             heimrichhannot/crawler - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [Utility &amp; Helpers](/categories/utility)
4. /
5. heimrichhannot/crawler

ActiveLibrary[Utility &amp; Helpers](/categories/utility)

heimrichhannot/crawler
======================

Crawl all internal links found on a website

6.0.0(5y ago)12.8kMITPHPPHP ^7.4|^8.0

Since Nov 2Pushed 2y agoCompare

[ Source](https://github.com/heimrichhannot/crawler)[ Packagist](https://packagist.org/packages/heimrichhannot/crawler)[ Docs](https://github.com/spatie/crawler)[ RSS](/packages/heimrichhannot-crawler/feed)WikiDiscussions master Synced 1w ago

READMEChangelog (5)Dependencies (8)Versions (96)Used By (0)

H&amp;H Crawler
===============

[](#hh-crawler)

A fork of spatie/crawler v2 with some adjustments. Only used for an internal project.

Crawl links on a website
========================

[](#crawl-links-on-a-website)

[![Latest Version on Packagist](https://camo.githubusercontent.com/5920a25c35e56ff898c9f879853569a34014e8733ecdcc9db76fc1d27db9f4fe/68747470733a2f2f696d672e736869656c64732e696f2f7061636b61676973742f762f7370617469652f637261776c65722e7376673f7374796c653d666c61742d737175617265)](https://packagist.org/packages/spatie/crawler)[![Software License](https://camo.githubusercontent.com/55c0218c8f8009f06ad4ddae837ddd05301481fcf0dff8e0ed9dadda8780713e/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f6c6963656e73652d4d49542d627269676874677265656e2e7376673f7374796c653d666c61742d737175617265)](LICENSE.md)[![Build Status](https://camo.githubusercontent.com/fd045cd475bd89ae41f95b7f5c49b5f983554c982a480cfdbf44346fcb909bbb/68747470733a2f2f696d672e736869656c64732e696f2f7472617669732f7370617469652f637261776c65722f6d61737465722e7376673f7374796c653d666c61742d737175617265)](https://travis-ci.org/spatie/crawler)[![Quality Score](https://camo.githubusercontent.com/2878ec21445dcf356a4ca91b2cac00466b539756dcd1acf9aa23c48d30689182/68747470733a2f2f696d672e736869656c64732e696f2f7363727574696e697a65722f672f7370617469652f637261776c65722e7376673f7374796c653d666c61742d737175617265)](https://scrutinizer-ci.com/g/spatie/crawler)[![StyleCI](https://camo.githubusercontent.com/6971520f550395890ea33486fbb4a22de4a4e59b4157ad1cb40f80ffe0ffc2fb/68747470733a2f2f7374796c6563692e696f2f7265706f732f34353430363333382f736869656c64)](https://styleci.io/repos/45406338)[![Total Downloads](https://camo.githubusercontent.com/f0de3723819e89d47e5df65d7374aacf237e7ac08f5de198ee54d8234b1e5d23/68747470733a2f2f696d672e736869656c64732e696f2f7061636b61676973742f64742f7370617469652f637261776c65722e7376673f7374796c653d666c61742d737175617265)](https://packagist.org/packages/spatie/crawler)

This package provides a class to crawl links on a website. Under the hood Guzzle promises are used to [crawl multiple urls concurrently](http://docs.guzzlephp.org/en/latest/quickstart.html?highlight=pool#concurrent-requests).

Because the crawler can execute JavaScript, it can crawl JavaScript rendered site. Under the hood [headless Chrome](https://github.com/spatie/browsershot) is used to power this feature.

Spatie is a webdesign agency in Antwerp, Belgium. You'll find an overview of all our open source projects [on our website](https://spatie.be/opensource).

Installation
------------

[](#installation)

This package can be installed via Composer:

```
composer require spatie/crawler
```

Usage
-----

[](#usage)

The crawler can be instantiated like this

```
Crawler::create()
    ->setCrawlObserver()
    ->startCrawling($url);
```

The argument passed to `setCrawlObserver` must be an object that implements the `\Spatie\Crawler\CrawlObserver` interface:

```
/**
 * Called when the crawler will crawl the given url.
 *
 * @param \Spatie\Crawler\Url $url
 */
public function willCrawl(Url $url);

/**
 * Called when the crawler has crawled the given url.
 *
 * @param \Spatie\Crawler\Url $url
 * @param \Psr\Http\Message\ResponseInterface $response
 * @param \Spatie\Crawler\Url $foundOn
 */
public function hasBeenCrawled(Url $url, $response, Url $foundOn = null);

/**
 * Called when the crawl has ended.
 */
public function finishedCrawling();
```

### Executing JavaScript

[](#executing-javascript)

By default the crawler will not execute JavaScript. This is how you can enable the execution of JavaScript:

```
Crawler::create()
    ->executeJavaScript()
    ...
```

Under the hood [headless Chrome](https://github.com/spatie/browsershot) is used to execute JavaScript. Here are some pointers on [how to install it on your system](https://github.com/spatie/browsershot#requirements).

The package will make an educated guess as to where Chrome is installed on your system. You can also manually pass the location of the Chrome binary to `executeJavaScript()`

```
Crawler::create()
    ->executeJavaScript($pathToChrome)
    ...
```

### Filtering certain urls

[](#filtering-certain-urls)

You can tell the crawler not to visit certain urls by passing using the `setCrawlProfile`-function. That function expects an objects that implements the `Spatie\Crawler\CrawlProfile`-interface:

```
/*
 * Determine if the given url should be crawled.
 */
public function shouldCrawl(Url $url): bool;
```

This package comes with three `CrawlProfiles` out of the box:

- `CrawlAllUrls`: this profile will crawl all urls on all pages including urls to an external site.
- `CrawlInternalUrls`: this profile will only crawl the internal urls on the pages of a host.
- `CrawlSubdomainUrls`: this profile will only crawl the internal urls and its subdomains on the pages of a host.

Setting the number of concurrent requests
-----------------------------------------

[](#setting-the-number-of-concurrent-requests)

To improve the speed of the crawl the package concurrently crawls 10 urls by default. If you want to change that number you can use the `setConcurrency` method.

```
Crawler::create()
    ->setConcurrency(1) //now all urls will be crawled one by one
```

Setting the maximum crawl count
-------------------------------

[](#setting-the-maximum-crawl-count)

By default, the crawler continues until it has crawled every page of the supplied URL. If you want to limit the amount of urls the crawler should crawl you can use the `setMaximumCrawlCount` method.

```
// stop crawling after 5 urls

Crawler::create()
    ->setMaximumCrawlCount(5)
```

Setting the maximum crawl depth
-------------------------------

[](#setting-the-maximum-crawl-depth)

By default, the crawler continues until it has crawled every page of the supplied URL. If you want to limit the depth of the crawler you can use the `setMaximumDepth` method.

```
Crawler::create()
    ->setMaximumDepth(2)
```

Using a custom crawl queue
--------------------------

[](#using-a-custom-crawl-queue)

When crawling a site the crawler will put urls to be crawled in a queue. By default this queue is stored in memory using the built in `CollectionCrawlQueue`.

When a site is very large you may want to store that queue elsewhere, maybe a database. In such cases you can write your own crawl queue.

A valid crawl queue is any class that implements the `Spatie\Crawler\CrawlQueue\CrawlQueue`-interface. You can pass your custom crawl queue via the `setCrawlQueue` method on the crawler.

```
Crawler::create()
    ->setCrawlQueue()
```

Changelog
---------

[](#changelog)

Please see [CHANGELOG](CHANGELOG.md) for more information what has changed recently.

Contributing
------------

[](#contributing)

Please see [CONTRIBUTING](CONTRIBUTING.md) for details.

Testing
-------

[](#testing)

To run the tests you'll have to start the included node based server first in a separate terminal window.

```
cd tests/server
npm install
./start_server.sh
```

With the server running, you can start testing.

```
vendor/bin/phpunit
```

Security
--------

[](#security)

If you discover any security related issues, please email  instead of using the issue tracker.

Postcardware
------------

[](#postcardware)

You're free to use this package, but if it makes it to your production environment we highly appreciate you sending us a postcard from your hometown, mentioning which of our package(s) you are using.

Our address is: Spatie, Samberstraat 69D, 2060 Antwerp, Belgium.

We publish all received postcards [on our company website](https://spatie.be/en/opensource/postcards).

Credits
-------

[](#credits)

- [Freek Van der Herten](https://github.com/freekmurze)
- [All Contributors](../../contributors)

Support us
----------

[](#support-us)

Spatie is a webdesign agency based in Antwerp, Belgium. You'll find an overview of all our open source projects [on our website](https://spatie.be/opensource).

Does your business depend on our contributions? Reach out and support us on [Patreon](https://www.patreon.com/spatie). All pledges will be dedicated to allocating workforce on maintenance and new awesome stuff.

License
-------

[](#license)

The MIT License (MIT). Please see [License File](LICENSE.md) for more information.

###  Health Score

39

—

LowBetter than 86% of packages

Maintenance20

Infrequent updates — may be unmaintained

Popularity21

Limited adoption so far

Community19

Small or concentrated contributor base

Maturity84

Battle-tested with a long release history

 Bus Factor1

Top contributor holds 63.5% of commits — single point of failure

How is this calculated?**Maintenance (25%)** — Last commit recency, latest release date, and issue-to-star ratio. Uses a 2-year decay window.

**Popularity (30%)** — Total and monthly downloads, GitHub stars, and forks. Logarithmic scaling prevents top-heavy scores.

**Community (15%)** — Contributors, dependents, forks, watchers, and maintainers. Measures real ecosystem engagement.

**Maturity (30%)** — Project age, version count, PHP version support, and release stability.

###  Release Activity

Cadence

Every ~32 days

Recently: every ~292 days

Total

95

Last Release

783d ago

Major Versions

v2.x-dev → 3.0.02017-12-22

3.2.1 → 4.0.02018-03-01

4.7.5 → 5.0.02020-09-29

4.7.6 → 6.0.02020-12-02

v4.x-dev → v5.x-dev2020-12-20

PHP version history (9 changes)0.0.1PHP &gt;=5.6.0

1.0.2PHP &gt;=5.5.0

2.0.0PHP ^7.0

3.0.0PHP ^7.1

5.0.0PHP ^7.4

5.0.2PHP ^7.4|^8.0

4.7.6PHP ^7.3|^8.0

2.7.5PHP ^7.0|^8.0

2.8.1PHP ^7.0 || ^8.0

### Community

Maintainers

![](https://www.gravatar.com/avatar/28ad3224d8727b622ebd229840eea6b9dbcb83eb0bd609e6ce65b614830ff538?d=identicon)[digitales@heimrich-hannot.de](/maintainers/digitales@heimrich-hannot.de)

---

Top Contributors

[![freekmurze](https://avatars.githubusercontent.com/u/483853?v=4)](https://github.com/freekmurze "freekmurze (231 commits)")[![brendt](https://avatars.githubusercontent.com/u/6905297?v=4)](https://github.com/brendt "brendt (35 commits)")[![Redominus](https://avatars.githubusercontent.com/u/22024214?v=4)](https://github.com/Redominus "Redominus (15 commits)")[![denvers](https://avatars.githubusercontent.com/u/1016564?v=4)](https://github.com/denvers "denvers (9 commits)")[![BenMorel](https://avatars.githubusercontent.com/u/1952838?v=4)](https://github.com/BenMorel "BenMorel (7 commits)")[![koertho](https://avatars.githubusercontent.com/u/12064642?v=4)](https://github.com/koertho "koertho (6 commits)")[![rubenvanassche](https://avatars.githubusercontent.com/u/619804?v=4)](https://github.com/rubenvanassche "rubenvanassche (6 commits)")[![mattiasgeniar](https://avatars.githubusercontent.com/u/407270?v=4)](https://github.com/mattiasgeniar "mattiasgeniar (5 commits)")[![sebastiandedeyne](https://avatars.githubusercontent.com/u/1561079?v=4)](https://github.com/sebastiandedeyne "sebastiandedeyne (5 commits)")[![AdrianMrn](https://avatars.githubusercontent.com/u/12762044?v=4)](https://github.com/AdrianMrn "AdrianMrn (4 commits)")[![spekulatius](https://avatars.githubusercontent.com/u/8433587?v=4)](https://github.com/spekulatius "spekulatius (4 commits)")[![andrzejkupczyk](https://avatars.githubusercontent.com/u/11018286?v=4)](https://github.com/andrzejkupczyk "andrzejkupczyk (3 commits)")[![AlexVanderbist](https://avatars.githubusercontent.com/u/6287961?v=4)](https://github.com/AlexVanderbist "AlexVanderbist (3 commits)")[![pascalbaljet](https://avatars.githubusercontent.com/u/8403149?v=4)](https://github.com/pascalbaljet "pascalbaljet (3 commits)")[![TVke](https://avatars.githubusercontent.com/u/15680337?v=4)](https://github.com/TVke "TVke (3 commits)")[![systream](https://avatars.githubusercontent.com/u/1583029?v=4)](https://github.com/systream "systream (3 commits)")[![localheinz](https://avatars.githubusercontent.com/u/605483?v=4)](https://github.com/localheinz "localheinz (3 commits)")[![ericges](https://avatars.githubusercontent.com/u/25957923?v=4)](https://github.com/ericges "ericges (2 commits)")[![juukie](https://avatars.githubusercontent.com/u/2678657?v=4)](https://github.com/juukie "juukie (2 commits)")[![barocode](https://avatars.githubusercontent.com/u/18611260?v=4)](https://github.com/barocode "barocode (2 commits)")

---

Tags

spatielinkcrawlerwebsite

###  Code Quality

TestsPHPUnit

### Embed Badge

![Health badge](/badges/heimrichhannot-crawler/health.svg)

```
[![Health](https://phpackages.com/badges/heimrichhannot-crawler/health.svg)](https://phpackages.com/packages/heimrichhannot-crawler)
```

###  Alternatives

[spatie/crawler

Crawl all internal links found on a website

2.8k16.3M52](/packages/spatie-crawler)[google/cloud-core

Google Cloud PHP shared dependency, providing functionality useful to all components.

343121.4M79](/packages/google-cloud-core)[vdb/php-spider

A configurable and extensible PHP web spider

1.4k181.0k7](/packages/vdb-php-spider)[spatie/laravel-pjax

A pjax middleware for Laravel 5

513371.8k11](/packages/spatie-laravel-pjax)[crwlr/crawler

Web crawling and scraping library.

37214.8k2](/packages/crwlr-crawler)[spatie/laravel-rdap

Perform RDAP queries in a Laravel app

72108.3k2](/packages/spatie-laravel-rdap)

PHPackages © 2026

[Directory](/)[Categories](/categories)[Trending](/trending)[Changelog](/changelog)[Analyze](/analyze)
