PHPackages                             crawlzone/crawlzone - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [Framework](/categories/framework)
4. /
5. crawlzone/crawlzone

ActiveLibrary[Framework](/categories/framework)

crawlzone/crawlzone
===================

Crawlzone is a fast asynchronous internet crawling framework aiming to provide open source web search and testing solution. It can be used for a wide range of purposes, from extracting and indexing structured data to monitoring and automated testing.

4.0.0(4y ago)825.4k9[6 issues](https://github.com/crawlzone/crawlzone/issues)[3 PRs](https://github.com/crawlzone/crawlzone/pulls)MITPHPPHP &gt;=7.4

Since Nov 21Pushed 3y ago10 watchersCompare

[ Source](https://github.com/crawlzone/crawlzone)[ Packagist](https://packagist.org/packages/crawlzone/crawlzone)[ Docs](https://github.com/crawlzone/crawlzone)[ RSS](/packages/crawlzone-crawlzone/feed)WikiDiscussions master Synced today

READMEChangelog (8)Dependencies (11)Versions (12)Used By (0)

[![Build Status](https://camo.githubusercontent.com/7d0a7cee198bbc201a7308b0161b91beeb70b61abc45cd74848e83bcfcaa1d5c/68747470733a2f2f7472617669732d63692e6f72672f637261776c7a6f6e652f637261776c7a6f6e652e7376673f6272616e63683d6d6173746572)](https://travis-ci.org/crawlzone/crawlzone)[![Coverage Status](https://camo.githubusercontent.com/0990af5ec1a20b3f0ae9fc77984b2213424d8e6f69a43b029e62c1c0b9c71cb7/68747470733a2f2f636f766572616c6c732e696f2f7265706f732f6769746875622f637261776c7a6f6e652f637261776c7a6f6e652f62616467652e7376673f6272616e63683d6d6173746572)](https://coveralls.io/github/crawlzone/crawlzone?branch=master)

Overview
========

[](#overview)

CrawlZone is a fast asynchronous internet crawling framework aiming to provide open source web scraping and testing solution. It can be used for a wide range of purposes, from extracting and indexing structured data to monitoring and automated testing. Available for PHP 7.4, 8.0, 8.1.

Installation
------------

[](#installation)

`composer require crawlzone/crawlzone`

Key Features
------------

[](#key-features)

- Asynchronous crawling with customizable concurrency.
- Automatically throttling crawling speed based on the load of the website you are crawling
- If configured, automatically filters out requests forbidden by the `robots.txt` exclusion standard.
- Straightforward middleware system allows you to append headers, extract data, filter or plug any custom functionality to process the request and response.
- Rich filtering capabilities.
- Ability to set crawling depth
- Easy to extend the core by hooking into the crawling process using events.
- Shut down crawler any time and start over without losing the progress.

Architecture
------------

[](#architecture)

[![Architecture](https://github.com/crawlzone/crawlzone/raw/master/resources/Web%20Crawler%20Architecture.svg)](https://github.com/crawlzone/crawlzone/blob/master/resources/Web%20Crawler%20Architecture.svg)

Here is what's happening for a single request when you run the client:

1. The client queues the initial request (start\_uri).
2. The engine looks at the queue and checks if there are any requests.
3. The engine gets the request from the queue and emits the `BeforeRequestSent` event. If the depth option is set in the config, then the `RequestDepth` extension validates the depth of the request. If the obey robots.txt option is set in the config, then the `RobotTxt` extension checks if the request complies with the rules. In a case when the request doesn't comply, the engine emits the `RequestFailed` event and gets the next request from the queue.
4. The engine uses the request middleware stack to pass the request through it.
5. The engine sends an asynchronous request using Guzzle HTTP Client
6. The engine emits the `AfterRequestSent` event and stores the request in the history to avoid crawling the same request again.
7. When response headers are received, but the body has not yet begun to download, the engine emits the `ResponseHeadersReceived` event.
8. The engine emits the `TransferStatisticReceived` event. If the autothrottle option is set in the config, then the `AutoThrottle` extension is executed.
9. The engine uses the response middleware stack to pass the response through it.
10. The engine emits the `ResponseReceived` event. Additionally, if the request status code is greater than or equal to 400, the engine emits `RequestFailed` event.
11. The `ResponseReceived` triggers the `ExtractAndQueueLinks` extension, which extracts and queues the links. The process starts over until the queue is empty.

Quick Start
-----------

[](#quick-start)

```