PHPackages                             scrapy/scrapy - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [Utility &amp; Helpers](/categories/utility)
4. /
5. scrapy/scrapy

ActiveLibrary[Utility &amp; Helpers](/categories/utility)

scrapy/scrapy
=============

PHP web scraping made easy.

v1.0.3(6y ago)181142MITPHPPHP ^7.2.5CI failing

Since Jan 30Pushed 6y ago1 watchersCompare

[ Source](https://github.com/aleksa-sukovic/scrapy)[ Packagist](https://packagist.org/packages/scrapy/scrapy)[ Docs](https://github.com/aleksa-sukovic/scrapy)[ RSS](/packages/scrapy-scrapy/feed)WikiDiscussions master Synced 3w ago

READMEChangelogDependencies (5)Versions (4)Used By (0)

Scrapy
======

[](#scrapy)

[![Latest Version on Packagist](https://camo.githubusercontent.com/f6f110a2d5dcbc1ea5d5b7e16dfd6648885b23034f91e9accd363d401dd466c9/68747470733a2f2f696d672e736869656c64732e696f2f7061636b61676973742f762f7363726170792f7363726170792e7376673f7374796c653d666c61742d737175617265)](https://packagist.org/packages/scrapy/scrapy)[![Build Status](https://camo.githubusercontent.com/f50b4eb952c34ec9877069f5231d6029e649f755841e461304a8c8519d23c83b/68747470733a2f2f7472617669732d63692e636f6d2f616c656b73612d73756b6f7669632f7363726170792e7376673f746f6b656e3d7a437370413573347a476b524e6971387a7a5231266272616e63683d6d6173746572)](https://travis-ci.com/aleksa-sukovic/scrapy)

PHP web scraping made easy.

Please note: *Documentation is always a work in progress, please excuse any errors.*

Installation
------------

[](#installation)

You can install the package via composer:

```
composer require scrapy/scrapy
```

Table of contents
-----------------

[](#table-of-contents)

- [Basic Usage](#documentation)
- [Parsers](#parsers)

    - [Parser definition](#parser-definition)
    - [Adding parser](#adding-parsers)
    - [Inline parsers](#inline-parsers)
    - [Passing additional parameters](#passing-additional-parameters-to-parsers)
- [Crawly](#crawly)

    - [Initialisation](#crawler-initialisation)
    - [Methods](#crawling-methods)

        - [Filter](#filter)
        - [First](#first)
        - [Nth](#nth)
        - [Raw](#raw)
        - [Trim](#trim)
        - [Pluck](#pluck)
        - [Count](#count)
        - [Int](#int)
        - [Float](#float)
        - [String](#string)
        - [Html](#html)
        - [Inner HTML](#inner-html)
        - [Exists](#exists)
        - [Reset](#reset)
        - [Map](#map)
        - [Node](#node)
- [Readers](#readers)

    - [Using built in readers](#using-built-in-readers)
    - [Writing custom readers](#writing-custom-readers)
- [User Agents](#user-agents)

    - [Why use custom agents](#why-use-custom-user-agents)
    - [Using built in agents](#using-built-in-agents)
    - [Writing custom user agents](#writing-custom-agents)
- [Build steps precedence](#precedence-of-parameters)
- [Exception Handling](#exception-handling)
- [Testing](#testing)
- [Changelog](#changelog)
- [Credits](#credits)
- [License](#license)

Basic usage
-----------

[](#basic-usage)

Scrapy is essentially a reader which can modify read data trough series of tasks. To simply read an url you can do the following.

```
    use Scrapy\Builders\ScrapyBuilder;

    $html = ScrapyBuilder::make()
        ->url('https://www.some-url.com')
        ->build()
        ->scrape();
```

### Parsers

[](#parsers)

Just reading HTML from some source is not a lot of fun. Scrapy allows you to crawl HTML with simple yet expressive API relying on Symphony's DOM crawler.

You can think of parsers as actions meant to extract data valuable to you from HTML.

#### Parser definition

[](#parser-definition)

Parsers are meant to be self-containing scraping rules allowing you to extract data from HTML string.

```
    use Scrapy\Parsers\Parser;
    use Scrapy\Crawlers\Crawly;

    class ImageParser extends Parser
    {
         public function process(Crawly $crawly, array $output): array
         {
            $output['hello'] = $crawly->filter('h1')->string();

            return $output;
         }
    }
```

#### Adding parsers

[](#adding-parsers)

Once you have your parsers defined, it's time to add them to Scrapy.

```
    use Scrapy\Builders\ScrapyBuilder;

    // Add by class reference
    ScrapyBuilder::make()
        ->parser(ImageParser::class);

    // Add concrete instance
    ScrapyBuilder::make()
        ->parser(new ImageParser());

    // Add multiple parsers
    ScrapyBuilder::make()
        ->parsers([ImageParser::class, new ImageParser()]);
```

#### Inline parsers

[](#inline-parsers)

You don't have to write a class for each parser, you can also do inline parsing. Let's see how would that look.

```
    use Scrapy\Crawlers\Crawly;
    use Scrapy\Builders\ScrapyBuilder;

    ScrapyBuilder::make()
        ->parser(function (Crawly $crawly, array $output) {
            $output['count'] = $crawly->filter('li')->count();

            return $output;
        });
```

#### Passing additional parameters to parsers

[](#passing-additional-parameters-to-parsers)

Sometimes you want to pass some extra context to your parsers. With Scrapy, you can pass an associative array of parameters which would become available to every parser.

```
    use Scrapy\Crawlers\Crawly;
    use Scrapy\Builders\ScrapyBuilder;

    ScrapyBuilder::make()
        ->params(['foo' => 'bar'])
        ->parser(function (Crawly $crawly, array $output) {
                $output['foo'] = $this->param('foo'); // 'bar'
                $output['baz'] = $this->has('baz');   // false
                $output['bar'] = $this->param('baz'); // null
         });
```

The same principle applies no matter if you define parsers as separate classes or inline them with functions.

Crawly
------

[](#crawly)

You might noticed that first argument to parser's *process* method is instance *Crawly* class.

Crawly is an HTML crawling tool. It is based on [Symphony's DOM Crawler](https://symfony.com/doc/current/components/dom_crawler.html).

#### Crawler initialisation

[](#crawler-initialisation)

Instance of Crawly can be made from any string.

```
    use Scrapy\Crawlers\Crawly;

    $crawly1 = new Crawly('Hello World!');
    $crawly2 = new Crawly('Hello World!');

    $crawly1->html(); // 'Hello World!'
    $crawly2->html(); // 'Hello World!'
```

#### Crawling methods

[](#crawling-methods)

Crawly provides few helper methods allowing you to more easily get the wanted data from HTML.

##### Filter

[](#filter)

Allows you to filter elements with CSS selector. Similar to what `document.querySelector('...')` does.

```
    $crawly = new Crawly('Hello World!');

    $crawly->filter('li')->html(); // Hello World!
```

##### First

[](#first)

Narrow your selection by taking the first element from it.

```
    $crawly = new Crawly('HelloWorld!');

    $crawly->filter('li')->first()->html(); // Hello
```

##### Nth

[](#nth)

Narrow your selection by taking the nth element from it. Note that indices are 0-based;

```
    $crawly = new Crawly('HelloWorld!');

    $crawly->filter('li')->nth(1)->html(); // World!
```

##### Raw

[](#raw)

Get access to Symphony's DOM crawler.

Crawly does not aim to replace Symphony's DOM crawler, rather just to make it's usage more pleasant. That's why not all methods are exposed directly trough Crawly.

Using `raw` method allows you to utilise the underlying Symphony's crawler.

```
    $crawly = new Crawly('HelloWorld!');

    $crawly->filter('li')->first()->raw()->html(); // Hello
```

##### Trim

[](#trim)

Trims the output string.

```
    $crawly = new Crawly('    Hello!     ');

    $crawly->filter('span')->trim()->string(); // 'Hello!'
```

##### Pluck

[](#pluck)

Extract attributes from selection.

```
    $crawly = new Crawly('12');
    $crawly->filter('li')->pluck(['attr']); // ["1","2"]

    $crawly = new Crawly('');
    $crawly->filter('img')->pluck(['width', 'height']); // [ ["200", "300"], ["400", "500"] ]
```

##### Count

[](#count)

Returns the count of currently selected nodes.

```
    $crawly = new Crawly('12');

    $crawly->filter('li')->count(); // 2
```

##### Int

[](#int)

Returns the integer value of current selection

```
    $crawly = new Crawly('123');
    $crawly->filter('span')->int(); // 123

    // Use default if selection is not numeric
    $crawly = new Crawly('');
    $crawly->filter('span')->int(55); // 55
```

##### Float

[](#float)

Returns the integer value of current selection

```
    $crawly = new Crawly('18.5');
    $crawly->filter('span')->float(); // 18.5

    // Use default if selection is not numeric
    $crawly = new Crawly('');
    $crawly->filter('span')->float(22.4); // 22.4
```

##### String

[](#string)

Returns current selection's inner content as string.

```
    $crawly = new Crawly('Hello World!');
    $crawly->filter('span')->string(); // 'Hello World!'

    // Use default in case exception arises
    $crawly = new Crawly('');
    $crawly->filter('non-existing-selection')->string('Hello'); // 'Hello'
```

##### Html

[](#html)

Returns HTML string representation of current selection, including the parent element.

```
    $crawly = new Crawly('Hello World!');
    $crawly->filter('span')->html(); // Hello World!

    // Use default in case exception arises
    $crawly = new Crawly('');
    $crawly->filter('non-existing-selection')->html('Hi'); // Hi
```

##### Inner HTML

[](#inner-html)

Returns HTML string representation of current selection, excluding the parent element.

```
    $crawly = new Crawly('Hello World!');
    $crawly->filter('span')->innerHtml(); // 'Hello World!'

    // Use default to handle exceptional cases
    $crawly = new Crawly('');
    $crawly->filter('non-existing-selection')->innerHtml('Hi'); // 'Hi'
```

##### Exists

[](#exists)

Checks if given selection exists.

You can get boolean response or raise an exception.

```
    $crawly = new Crawly('Hello World!');
    $crawly->filter('span')->exists(); // true

    $crawly = new Crawly('');
    $crawly->filter('non-existing-selection')->exists();     // false
    $crawly->filter('non-existing-selection')->exists(true); // new ScrapeException(...)
```

##### Reset

[](#reset)

Resets the crawler back to its original HTML.

```
    $crawly = new Crawly('1');
    $crawly = $crawly->filter('li')->html(); // 1

    $crawly->reset()->html(); // 1
```

##### Map

[](#map)

This method creates a new array populated with the results of calling a provided function on every node in a selection.

For each node a callback function is called with Crawly intance created from that node. Additionally, callback function takes second argument which is the 0-based index of a node.

```
    $crawly = new Crawly('    Hello    World  ');

    $crawly->filter('li')->map(function (Crawly $crawly, int $index) {
        return $crawly->trim()->string() . ' - ' . $index;
    }); // ['Hello - 0', 'World - 1']

    // limit the map function
    $crawly->filter('li')->map(function (Crawly $crawly, int $index) {
        return $crawly->trim()->string() . ' - ' . $index;
    }, 1); // ['Hello - 0']
```

##### Node

[](#node)

Returns the first DOMNode of the selection.

```
    $crawly = new Crawly('1');

    $crawly = $crawly->filter('li')->node(); // DOMNode representing '1' is returned
```

### Readers

[](#readers)

Readers are data source classes used by Scrapy to fetch the HTML content.

Scrapy comes with some readers predefined, and you can also write your own if you need to.

#### Using built in readers

[](#using-built-in-readers)

Scrapy comes with two built in readers: `UrlReader` and `FileReader`. Lets see how you may use them.

```
    use Scrapy\Builders\ScrapyBuilder;
    use Scrapy\Readers\UrlReader;
    use Scrapy\Readers\FileReader;

    ScrapyBuilder::make()
        ->reader(new UrlReader('https://www.some-url.com'));
    ScrapyBuilder::make()
        ->reader(new FileReader('path-to-file.html'));
```

As you can see built in readers allow you to use Scrapy by either reading from a url or from a specific file.

#### Writing custom readers

[](#writing-custom-readers)

You don't have to be limited to built in readers. Writing you own is a piece of cake.

```
    use Scrapy\Readers\IReader;

    class CustomReader implements IReader
    {
        public function read(): string
        {
            return 'Hello World!';
        }
    }
```

And then use it during the build process.

```
    ScrapyBuilder::make()
        ->reader(new CustomReader());
```

### User agents

[](#user-agents)

A user agent is a computer program representing a person, in this case a Scrapy instance. Scrapy provides several built in user agents for simulating different crawlers.

#### Why use custom user agents

[](#why-use-custom-user-agents)

User agents make sense only in a context of readers that fetch their data over HTTP protocol. More precisely, in cases where you want to read a web page that creates its content dynamically using JavaScript.

Scrapy by default can not parse JavaScript files. This is a problem all web crawlers face. There are numerous techniques for overcoming this problem, usually by using external services like [Prerender](https://prerender.io/) which redirect crawling bots to cached HTML pages.

Several user agents are provided to allow Scrapy to represent itself as some of the common user agents. Please not that in case a web page implements more advance crawling security checks (for example an IP check) than provided checker would fail, since they only modify the HTTP request headers.

If you want to find out more, there is a great article on pre-rendering over at [Netlify](https://www.netlify.com/blog/2016/11/22/prerendering-explained/).

#### Using built in agents

[](#using-built-in-agents)

Scrapy comes with few built in agents you can use.

```
    ScrapyBuilder::make()
        ->agent(new GoogleAgent());                     // Googlebot
    ScrapyBuilder::make()
        ->agent(new GoogleChromeAgent(81, 0, 4043, 0)); // Googlebot
    ScrapyBuilder::make()
        ->agent(new BingUserAgent());                   // Bing
    ScrapyBuilder::make()
        ->agent(new YahooUserAgent());                  // Yahoo
    ScrapyBuilder::make()
        ->agent(new DuckUserAgent());                   // Duck
```

#### Writing custom agents

[](#writing-custom-agents)

Just like with readers, you can write your own custom user agents.

```
    use Scrapy\Agents\IUserAgent;
    use Scrapy\Readers\UrlReader;

    class UserAgent implements IUserAgent
    {
        public function reader(string $url): UrlReader
        {
            $reader = new UrlReader($url);
            $reader->setConfig(['headers' => ['...']]);
            return $reader;
        }
    }
```

And then use it during the build process.

```
    ScrapyBuilder::make()
        ->agent(new UserAgent());
```

### Precedence of parameters

[](#precedence-of-parameters)

One thing to note is the precedence of different parameters you may set during the build process.

Setting the url is same as setting the reader to be UrlReader with that url. On the other hand, explicitly setting reader will have higher precedence over explicitly setting the url and/or user agent.

```
    use Scrapy\Readers\UrlReader;
    use Scrapy\Agents\GoogleAgent;
    use Scrapy\Builders\ScrapyBuilder;

    ScrapyBuilder::make()
        ->url('https://www.facebook.com')
        ->agent(new GoogleAgent())
        ->reader(new UrlReader('https://www.youtube.com')); // Youtube will be read without GoogleAgent, Facebook will be ignored.
```

### Exception handling

[](#exception-handling)

In general, Scrapy tries to handle all possible exceptions wrapping them in base Scrapy exception class: *ScrapeException*.

What this means is that you can organize your app around a single exception for general error handling.

A more granular system is planned for future release which would allow you to react to a specific parser exceptions.

```
        use Scrapy\Builders\ScrapyBuilder;
        use Scrapy\Exceptions\ScrapeException;

        try {
            $html = ScrapyBuilder::make()
                ->url('https://www.invalid-url.com')
                ->build()
                ->scrape();
        } catch (ScrapeException $e) {
            //
        }
```

### Testing

[](#testing)

To run entire suite of unit tests you can do:

```
composer test
```

### Changelog

[](#changelog)

Please see [CHANGELOG](CHANGELOG.md) for more information what has changed recently.

Credits
-------

[](#credits)

- [Aleksa Sukovic](https://github.com/aleksa-sukovic)

License
-------

[](#license)

The MIT License (MIT). Please see [License File](LICENSE.md) for more information.

###  Health Score

28

—

LowBetter than 52% of packages

Maintenance20

Infrequent updates — may be unmaintained

Popularity19

Limited adoption so far

Community9

Small or concentrated contributor base

Maturity54

Maturing project, gaining track record

 Bus Factor1

Top contributor holds 100% of commits — single point of failure

How is this calculated?**Maintenance (25%)** — Last commit recency, latest release date, and issue-to-star ratio. Uses a 2-year decay window.

**Popularity (30%)** — Total and monthly downloads, GitHub stars, and forks. Logarithmic scaling prevents top-heavy scores.

**Community (15%)** — Contributors, dependents, forks, watchers, and maintainers. Measures real ecosystem engagement.

**Maturity (30%)** — Project age, version count, PHP version support, and release stability.

###  Release Activity

Cadence

Every ~0 days

Total

2

Last Release

2346d ago

### Community

Maintainers

![](https://avatars.githubusercontent.com/u/31187362?v=4)[Aleksa Šuković](/maintainers/aleksa-sukovic)[@aleksa-sukovic](https://github.com/aleksa-sukovic)

---

Top Contributors

[![aleksa-sukovic](https://avatars.githubusercontent.com/u/31187362?v=4)](https://github.com/aleksa-sukovic "aleksa-sukovic (36 commits)")

---

Tags

webscraping

###  Code Quality

TestsPHPUnit

### Embed Badge

![Health badge](/badges/scrapy-scrapy/health.svg)

```
[![Health](https://phpackages.com/badges/scrapy-scrapy/health.svg)](https://phpackages.com/packages/scrapy-scrapy)
```

###  Alternatives

[craftcms/cms

Craft CMS

3.6k3.6M3.1k](/packages/craftcms-cms)[spatie/crawler

Crawl all internal links found on a website

2.8k18.5M67](/packages/spatie-crawler)[spatie/laravel-export

Create a static site bundle from a Laravel app

674146.0k6](/packages/spatie-laravel-export)[crwlr/crawler

Web crawling and scraping library.

36917.4k2](/packages/crwlr-crawler)[drupal/drupal-extension

Drupal extension for Behat

22215.7M173](/packages/drupal-drupal-extension)[sproutcms/cms

Enterprise content management and framework

242.5k4](/packages/sproutcms-cms)

PHPackages © 2026

[Directory](/)[Categories](/categories)[Trending](/trending)[Changelog](/changelog)[Analyze](/analyze)