PHPackages                             koffleart/crawler - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [Utility &amp; Helpers](/categories/utility)
4. /
5. koffleart/crawler

ActiveLibrary[Utility &amp; Helpers](/categories/utility)

koffleart/crawler
=================

Crawl all internal links found on a website

10(5y ago)03MITPHPPHP ^7.1CI failing

Since Nov 2Pushed 6mo agoCompare

[ Source](https://github.com/koffleart/crawler)[ Packagist](https://packagist.org/packages/koffleart/crawler)[ Docs](https://github.com/spatie/crawler)[ Patreon](https://www.patreon.com/spatie)[ RSS](/packages/koffleart-crawler/feed)WikiDiscussions master Synced 3w ago

READMEChangelog (1)Dependencies (9)Versions (84)Used By (0)

🕸 Crawl the web using PHP 🕷
===========================

[](#-crawl-the-web-using-php-)

[![Latest Version on Packagist](https://camo.githubusercontent.com/5920a25c35e56ff898c9f879853569a34014e8733ecdcc9db76fc1d27db9f4fe/68747470733a2f2f696d672e736869656c64732e696f2f7061636b61676973742f762f7370617469652f637261776c65722e7376673f7374796c653d666c61742d737175617265)](https://packagist.org/packages/spatie/crawler)[![MIT Licensed](https://camo.githubusercontent.com/55c0218c8f8009f06ad4ddae837ddd05301481fcf0dff8e0ed9dadda8780713e/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f6c6963656e73652d4d49542d627269676874677265656e2e7376673f7374796c653d666c61742d737175617265)](LICENSE.md)[![Tests](https://github.com/spatie/crawler/workflows/Tests/badge.svg)](https://github.com/spatie/crawler/workflows/Tests/badge.svg)[![Total Downloads](https://camo.githubusercontent.com/f0de3723819e89d47e5df65d7374aacf237e7ac08f5de198ee54d8234b1e5d23/68747470733a2f2f696d672e736869656c64732e696f2f7061636b61676973742f64742f7370617469652f637261776c65722e7376673f7374796c653d666c61742d737175617265)](https://packagist.org/packages/spatie/crawler)

This package provides a class to crawl links on a website. Under the hood Guzzle promises are used to [crawl multiple urls concurrently](http://docs.guzzlephp.org/en/latest/quickstart.html?highlight=pool#concurrent-requests).

Because the crawler can execute JavaScript, it can crawl JavaScript rendered sites. Under the hood [Chrome and Puppeteer](https://github.com/spatie/browsershot) are used to power this feature.

Support us
----------

[](#support-us)

[![](https://camo.githubusercontent.com/e9f8370802d8dbac5aa0b08505aa19bfe0a521574a2bf993287250b6f3509adc/68747470733a2f2f6769746875622d6164732e73332e65752d63656e7472616c2d312e616d617a6f6e6177732e636f6d2f637261776c65722e6a70673f743d31)](https://spatie.be/github-ad-click/crawler)

We invest a lot of resources into creating [best in class open source packages](https://spatie.be/open-source). You can support us by [buying one of our paid products](https://spatie.be/open-source/support-us).

We highly appreciate you sending us a postcard from your hometown, mentioning which of our package(s) you are using. You'll find our address on [our contact page](https://spatie.be/about-us). We publish all received postcards on [our virtual postcard wall](https://spatie.be/open-source/postcards).

Installation
------------

[](#installation)

This package can be installed via Composer:

```
composer require spatie/crawler
```

Usage
-----

[](#usage)

The crawler can be instantiated like this

```
use Spatie\Crawler\Crawler;

Crawler::create()
    ->setCrawlObserver()
    ->startCrawling($url);
```

The argument passed to `setCrawlObserver` must be an object that extends the `\Spatie\Crawler\CrawlObservers\CrawlObserver` abstract class:

```
namespace Spatie\Crawler\CrawlObservers;

use GuzzleHttp\Exception\RequestException;
use Psr\Http\Message\ResponseInterface;
use Psr\Http\Message\UriInterface;

abstract class CrawlObserver
{
    /*
     * Called when the crawler will crawl the url.
     */
    public function willCrawl(UriInterface $url, ?string $linkText): void
    {
    }

    /*
     * Called when the crawler has crawled the given url successfully.
     */
    abstract public function crawled(
        UriInterface $url,
        ResponseInterface $response,
        ?UriInterface $foundOnUrl = null,
        ?string $linkText,
    ): void;

    /*
     * Called when the crawler had a problem crawling the given url.
     */
    abstract public function crawlFailed(
        UriInterface $url,
        RequestException $requestException,
        ?UriInterface $foundOnUrl = null,
        ?string $linkText = null,
    ): void;

    /**
     * Called when the crawl has ended.
     */
    public function finishedCrawling(): void
    {
    }
}
```

### Using multiple observers

[](#using-multiple-observers)

You can set multiple observers with `setCrawlObservers`:

```
Crawler::create()
    ->setCrawlObservers([
        ,
        ,
        ...
     ])
    ->startCrawling($url);
```

Alternatively you can set multiple observers one by one with `addCrawlObserver`:

```
Crawler::create()
    ->addCrawlObserver()
    ->addCrawlObserver()
    ->addCrawlObserver()
    ->startCrawling($url);
```

### Executing JavaScript

[](#executing-javascript)

By default, the crawler will not execute JavaScript. This is how you can enable the execution of JavaScript:

```
Crawler::create()
    ->executeJavaScript()
    ...
```

In order to make it possible to get the body html after the javascript has been executed, this package depends on our [Browsershot](https://github.com/spatie/browsershot) package. This package uses [Puppeteer](https://github.com/puppeteer/puppeteer) under the hood. Here are some pointers on [how to install it on your system](https://spatie.be/docs/browsershot/v2/requirements).

Browsershot will make an educated guess as to where its dependencies are installed on your system. By default, the Crawler will instantiate a new Browsershot instance. You may find the need to set a custom created instance using the `setBrowsershot(Browsershot $browsershot)` method.

```
Crawler::create()
    ->setBrowsershot($browsershot)
    ->executeJavaScript()
    ...
```

Note that the crawler will still work even if you don't have the system dependencies required by Browsershot. These system dependencies are only required if you're calling `executeJavaScript()`.

### Filtering certain urls

[](#filtering-certain-urls)

You can tell the crawler not to visit certain urls by using the `setCrawlProfile`-function. That function expects an object that extends `Spatie\Crawler\CrawlProfiles\CrawlProfile`:

```
/*
 * Determine if the given url should be crawled.
 */
public function shouldCrawl(UriInterface $url): bool;
```

This package comes with three `CrawlProfiles` out of the box:

- `CrawlAllUrls`: this profile will crawl all urls on all pages including urls to an external site.
- `CrawlInternalUrls`: this profile will only crawl the internal urls on the pages of a host.
- `CrawlSubdomains`: this profile will only crawl the internal urls and its subdomains on the pages of a host.

### Ignoring robots.txt and robots meta

[](#ignoring-robotstxt-and-robots-meta)

By default, the crawler will respect robots data. It is possible to disable these checks like so:

```
Crawler::create()
    ->ignoreRobots()
    ...
```

Robots data can come from either a `robots.txt` file, meta tags or response headers. More information on the spec can be found here: .

Parsing robots data is done by our package [spatie/robots-txt](https://github.com/spatie/robots-txt).

### Accept links with rel="nofollow" attribute

[](#accept-links-with-relnofollow-attribute)

By default, the crawler will reject all links containing attribute rel="nofollow". It is possible to disable these checks like so:

```
Crawler::create()
    ->acceptNofollowLinks()
    ...
```

### Using a custom User Agent

[](#using-a-custom-user-agent)

In order to respect robots.txt rules for a custom User Agent you can specify your own custom User Agent.

```
Crawler::create()
    ->setUserAgent('my-agent')
```

You can add your specific crawl rule group for 'my-agent' in robots.txt. This example disallows crawling the entire site for crawlers identified by 'my-agent'.

```
// Disallow crawling for my-agent
User-agent: my-agent
Disallow: /
```

Setting the number of concurrent requests
-----------------------------------------

[](#setting-the-number-of-concurrent-requests)

To improve the speed of the crawl the package concurrently crawls 10 urls by default. If you want to change that number you can use the `setConcurrency` method.

```
Crawler::create()
    ->setConcurrency(1) // now all urls will be crawled one by one
```

Defining Crawl Limits
---------------------

[](#defining-crawl-limits)

By default, the crawler continues until it has crawled every page it can find. This behavior might cause issues if you are working in an environment with limitations such as a serverless environment.

The crawl behavior can be controlled with the following two options:

- **Total Crawl Limit** (`setTotalCrawlLimit`): This limit defines the maximal count of URLs to crawl.
- **Current Crawl Limit** (`setCurrentCrawlLimit`): This defines how many URLs are processed during the current crawl.

Let's take a look at some examples to clarify the difference between these two methods.

### Example 1: Using the total crawl limit

[](#example-1-using-the-total-crawl-limit)

The `setTotalCrawlLimit` method allows to limit the total number of URLs to crawl, no matter often you call the crawler.

```
$queue = ;

// Crawls 5 URLs and ends.
Crawler::create()
    ->setCrawlQueue($queue)
    ->setTotalCrawlLimit(5)
    ->startCrawling($url);

// Doesn't crawl further as the total limit is reached.
Crawler::create()
    ->setCrawlQueue($queue)
    ->setTotalCrawlLimit(5)
    ->startCrawling($url);
```

### Example 2: Using the current crawl limit

[](#example-2-using-the-current-crawl-limit)

The `setCurrentCrawlLimit` will set a limit on how many URls will be crawled per execution. This piece of code will process 5 pages with each execution, without a total limit of pages to crawl.

```
$queue = ;

// Crawls 5 URLs and ends.
Crawler::create()
    ->setCrawlQueue($queue)
    ->setCurrentCrawlLimit(5)
    ->startCrawling($url);

// Crawls the next 5 URLs and ends.
Crawler::create()
    ->setCrawlQueue($queue)
    ->setCurrentCrawlLimit(5)
    ->startCrawling($url);
```

### Example 3: Combining the total and crawl limit

[](#example-3-combining-the-total-and-crawl-limit)

Both limits can be combined to control the crawler:

```
$queue = ;

// Crawls 5 URLs and ends.
Crawler::create()
    ->setCrawlQueue($queue)
    ->setTotalCrawlLimit(10)
    ->setCurrentCrawlLimit(5)
    ->startCrawling($url);

// Crawls the next 5 URLs and ends.
Crawler::create()
    ->setCrawlQueue($queue)
    ->setTotalCrawlLimit(10)
    ->setCurrentCrawlLimit(5)
    ->startCrawling($url);

// Doesn't crawl further as the total limit is reached.
Crawler::create()
    ->setCrawlQueue($queue)
    ->setTotalCrawlLimit(10)
    ->setCurrentCrawlLimit(5)
    ->startCrawling($url);
```

### Example 4: Crawling across requests

[](#example-4-crawling-across-requests)

You can use the `setCurrentCrawlLimit` to break up long running crawls. The following example demonstrates a (simplified) approach. It's made up of an initial request and any number of follow-up requests continuing the crawl.

#### Initial Request

[](#initial-request)

To start crawling across different requests, you will need to create a new queue of your selected queue-driver. Start by passing the queue-instance to the crawler. The crawler will start filling the queue as pages are processed and new URLs are discovered. Serialize and store the queue reference after the crawler has finished (using the current crawl limit).

```
// Create a queue using your queue-driver.
$queue = ;

// Crawl the first set of URLs
Crawler::create()
    ->setCrawlQueue($queue)
    ->setCurrentCrawlLimit(10)
    ->startCrawling($url);

// Serialize and store your queue
$serializedQueue = serialize($queue);
```

#### Subsequent Requests

[](#subsequent-requests)

For any following requests you will need to unserialize your original queue and pass it to the crawler:

```
// Unserialize queue
$queue = unserialize($serializedQueue);

// Crawls the next set of URLs
Crawler::create()
    ->setCrawlQueue($queue)
    ->setCurrentCrawlLimit(10)
    ->startCrawling($url);

// Serialize and store your queue
$serialized_queue = serialize($queue);
```

The behavior is based on the information in the queue. Only if the same queue-instance is passed in the behavior works as described. When a completely new queue is passed in, the limits of previous crawls -even for the same website- won't apply.

An example with more details can be found [here](https://github.com/spekulatius/spatie-crawler-cached-queue-example).

Setting the maximum crawl depth
-------------------------------

[](#setting-the-maximum-crawl-depth)

By default, the crawler continues until it has crawled every page of the supplied URL. If you want to limit the depth of the crawler you can use the `setMaximumDepth` method.

```
Crawler::create()
    ->setMaximumDepth(2)
```

Setting the maximum response size
---------------------------------

[](#setting-the-maximum-response-size)

Most html pages are quite small. But the crawler could accidentally pick up on large files such as PDFs and MP3s. To keep memory usage low in such cases the crawler will only use the responses that are smaller than 2 MB. If, when streaming a response, it becomes larger than 2 MB, the crawler will stop streaming the response. An empty response body will be assumed.

You can change the maximum response size.

```
// let's use a 3 MB maximum.
Crawler::create()
    ->setMaximumResponseSize(1024 * 1024 * 3)
```

Add a delay between requests
----------------------------

[](#add-a-delay-between-requests)

In some cases you might get rate-limited when crawling too aggressively. To circumvent this, you can use the `setDelayBetweenRequests()` method to add a pause between every request. This value is expressed in milliseconds.

```
Crawler::create()
    ->setDelayBetweenRequests(150) // After every page crawled, the crawler will wait for 150ms
```

Limiting which content-types to parse
-------------------------------------

[](#limiting-which-content-types-to-parse)

By default, every found page will be downloaded (up to `setMaximumResponseSize()` in size) and parsed for additional links. You can limit which content-types should be downloaded and parsed by setting the `setParseableMimeTypes()` with an array of allowed types.

```
Crawler::create()
    ->setParseableMimeTypes(['text/html', 'text/plain'])
```

This will prevent downloading the body of pages that have different mime types, like binary files, audio/video, ... that are unlikely to have links embedded in them. This feature mostly saves bandwidth.

Using a custom crawl queue
--------------------------

[](#using-a-custom-crawl-queue)

When crawling a site the crawler will put urls to be crawled in a queue. By default, this queue is stored in memory using the built-in `ArrayCrawlQueue`.

When a site is very large you may want to store that queue elsewhere, maybe a database. In such cases, you can write your own crawl queue.

A valid crawl queue is any class that implements the `Spatie\Crawler\CrawlQueues\CrawlQueue`-interface. You can pass your custom crawl queue via the `setCrawlQueue` method on the crawler.

```
Crawler::create()
    ->setCrawlQueue()
```

Here

- [ArrayCrawlQueue](https://github.com/spatie/crawler/blob/master/src/CrawlQueues/ArrayCrawlQueue.php)
- [RedisCrawlQueue (third-party package)](https://github.com/repat/spatie-crawler-redis)
- [CacheCrawlQueue for Laravel (third-party package)](https://github.com/spekulatius/spatie-crawler-toolkit-for-laravel)
- [Laravel Model as Queue (third-party example app)](https://github.com/insign/spatie-crawler-queue-with-laravel-model)

Change the default base url scheme
----------------------------------

[](#change-the-default-base-url-scheme)

By default, the crawler will set the base url scheme to `http` if none. You have the ability to change that with `setDefaultScheme`.

```
Crawler::create()
    ->setDefaultScheme('https')
```

Changelog
---------

[](#changelog)

Please see [CHANGELOG](CHANGELOG.md) for more information what has changed recently.

Contributing
------------

[](#contributing)

Please see [CONTRIBUTING](https://github.com/spatie/.github/blob/main/CONTRIBUTING.md) for details.

Testing
-------

[](#testing)

First, install the Puppeteer dependency, or your tests will fail.

```
npm install puppeteer

```

To run the tests you'll have to start the included node based server first in a separate terminal window.

```
cd tests/server
npm install
node server.js
```

With the server running, you can start testing.

```
composer test
```

Security
--------

[](#security)

If you've found a bug regarding security please mail  instead of using the issue tracker.

Postcardware
------------

[](#postcardware)

You're free to use this package, but if it makes it to your production environment we highly appreciate you sending us a postcard from your hometown, mentioning which of our package(s) you are using.

Our address is: Spatie, Kruikstraat 22, 2018 Antwerp, Belgium.

We publish all received postcards [on our company website](https://spatie.be/en/opensource/postcards).

Credits
-------

[](#credits)

- [Freek Van der Herten](https://github.com/freekmurze)
- [All Contributors](../../contributors)

License
-------

[](#license)

The MIT License (MIT). Please see [License File](LICENSE.md) for more information.

###  Health Score

37

—

LowBetter than 81% of packages

Maintenance46

Moderate activity, may be stable

Popularity3

Limited adoption so far

Community19

Small or concentrated contributor base

Maturity74

Established project with proven stability

 Bus Factor1

Top contributor holds 64.3% of commits — single point of failure

How is this calculated?**Maintenance (25%)** — Last commit recency, latest release date, and issue-to-star ratio. Uses a 2-year decay window.

**Popularity (30%)** — Total and monthly downloads, GitHub stars, and forks. Logarithmic scaling prevents top-heavy scores.

**Community (15%)** — Contributors, dependents, forks, watchers, and maintainers. Measures real ecosystem engagement.

**Maturity (30%)** — Project age, version count, PHP version support, and release stability.

###  Release Activity

Cadence

Every ~21 days

Total

81

Last Release

2187d ago

Major Versions

0.0.1 → 1.0.02015-11-03

1.3.1 → 2.0.02016-12-05

v2.x-dev → 3.0.02017-12-22

3.2.1 → 4.0.02018-03-01

4.7.2 → v5.x-dev2020-06-08

PHP version history (4 changes)0.0.1PHP &gt;=5.6.0

1.0.2PHP &gt;=5.5.0

2.0.0PHP ^7.0

3.0.0PHP ^7.1

### Community

Maintainers

![](https://www.gravatar.com/avatar/6b312a582599b4ca0eaee0f55f57664477bf5eac04b786648249f698f33b3d95?d=identicon)[officialbambino](/maintainers/officialbambino)

---

Top Contributors

[![freekmurze](https://avatars.githubusercontent.com/u/483853?v=4)](https://github.com/freekmurze "freekmurze (258 commits)")[![brendt](https://avatars.githubusercontent.com/u/6905297?v=4)](https://github.com/brendt "brendt (35 commits)")[![Redominus](https://avatars.githubusercontent.com/u/22024214?v=4)](https://github.com/Redominus "Redominus (15 commits)")[![AdrianMrn](https://avatars.githubusercontent.com/u/12762044?v=4)](https://github.com/AdrianMrn "AdrianMrn (10 commits)")[![denvers](https://avatars.githubusercontent.com/u/1016564?v=4)](https://github.com/denvers "denvers (9 commits)")[![BenMorel](https://avatars.githubusercontent.com/u/1952838?v=4)](https://github.com/BenMorel "BenMorel (7 commits)")[![koffleart](https://avatars.githubusercontent.com/u/67004789?v=4)](https://github.com/koffleart "koffleart (7 commits)")[![mattiasgeniar](https://avatars.githubusercontent.com/u/407270?v=4)](https://github.com/mattiasgeniar "mattiasgeniar (6 commits)")[![rubenvanassche](https://avatars.githubusercontent.com/u/619804?v=4)](https://github.com/rubenvanassche "rubenvanassche (6 commits)")[![sebastiandedeyne](https://avatars.githubusercontent.com/u/1561079?v=4)](https://github.com/sebastiandedeyne "sebastiandedeyne (5 commits)")[![spekulatius](https://avatars.githubusercontent.com/u/8433587?v=4)](https://github.com/spekulatius "spekulatius (4 commits)")[![TVke](https://avatars.githubusercontent.com/u/15680337?v=4)](https://github.com/TVke "TVke (3 commits)")[![AlexVanderbist](https://avatars.githubusercontent.com/u/6287961?v=4)](https://github.com/AlexVanderbist "AlexVanderbist (3 commits)")[![andrzejkupczyk](https://avatars.githubusercontent.com/u/11018286?v=4)](https://github.com/andrzejkupczyk "andrzejkupczyk (3 commits)")[![localheinz](https://avatars.githubusercontent.com/u/605483?v=4)](https://github.com/localheinz "localheinz (3 commits)")[![pascalbaljet](https://avatars.githubusercontent.com/u/8403149?v=4)](https://github.com/pascalbaljet "pascalbaljet (3 commits)")[![systream](https://avatars.githubusercontent.com/u/1583029?v=4)](https://github.com/systream "systream (3 commits)")[![mansoorkhan96](https://avatars.githubusercontent.com/u/51432274?v=4)](https://github.com/mansoorkhan96 "mansoorkhan96 (2 commits)")[![akalongman](https://avatars.githubusercontent.com/u/423050?v=4)](https://github.com/akalongman "akalongman (2 commits)")[![BrentRobert](https://avatars.githubusercontent.com/u/6866325?v=4)](https://github.com/BrentRobert "BrentRobert (2 commits)")

---

Tags

spatielinkcrawlerwebsite

###  Code Quality

TestsPHPUnit

### Embed Badge

![Health badge](/badges/koffleart-crawler/health.svg)

```
[![Health](https://phpackages.com/badges/koffleart-crawler/health.svg)](https://phpackages.com/packages/koffleart-crawler)
```

###  Alternatives

[spatie/crawler

Crawl all internal links found on a website

2.8k18.5M66](/packages/spatie-crawler)[aws/aws-sdk-php

AWS SDK for PHP - Use Amazon Web Services in your PHP project

6.3k543.5M2.6k](/packages/aws-aws-sdk-php)[craftcms/cms

Craft CMS

3.6k3.6M3.1k](/packages/craftcms-cms)[spatie/laravel-export

Create a static site bundle from a Laravel app

674146.0k6](/packages/spatie-laravel-export)[google/cloud-core

Google Cloud PHP shared dependency, providing functionality useful to all components.

346132.9M112](/packages/google-cloud-core)[silverstripe/framework

The SilverStripe framework

7313.7M2.8k](/packages/silverstripe-framework)

PHPackages © 2026

[Directory](/)[Categories](/categories)[Trending](/trending)[Changelog](/changelog)[Analyze](/analyze)