PHPackages                             laurentvw/lavacrawler - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [Parsing &amp; Serialization](/categories/parsing)
4. /
5. laurentvw/lavacrawler

Abandoned → [laurentvw/scrapher](/?search=laurentvw%2Fscrapher)Library[Parsing &amp; Serialization](/categories/parsing)

laurentvw/lavacrawler
=====================

A web scraper for PHP to easily extract data from web pages

v2.3.1(8y ago)199713MITPHPPHP &gt;=5.3.0

Since Mar 23Pushed 6y ago4 watchersCompare

[ Source](https://github.com/Laurentvw/scrapher)[ Packagist](https://packagist.org/packages/laurentvw/lavacrawler)[ Docs](http://github.com/Laurentvw/scrapher)[ RSS](/packages/laurentvw-lavacrawler/feed)WikiDiscussions master Synced 2w ago

READMEChangelog (10)DependenciesVersions (14)Used By (0)

Scrapher
========

[](#scrapher)

Scrapher is a PHP library to easily scrape data from web pages.

Getting Started
---------------

[](#getting-started)

### Installation

[](#installation)

Add the package to your `composer.json` and run `composer update`.

```
{
    "require": {
        "laurentvw/scrapher": "2.*"
    }
}

```

*For the people still using v1.0 ("LavaCrawler"), you can find the documentation is here: *

### Basic Usage

[](#basic-usage)

In order to start scraping, you need to set the URL(s) or HTML to scrape, and a type of selector to use (for example a regex selector, together with the data you wish to match).

```
use \Laurentvw\Scrapher\Scrapher;
use \Laurentvw\Scrapher\Selectors\RegexSelector;

$url = 'https://www.google.com/';
$scrapher = new Scrapher($url);

// Match all links on a page
$regex = '/(.*?)/ms';

$matchConfig = array(
    array(
        'name' => 'url',
        'id' => 1, // the first match (.*?) from the regex
    ),
    array(
        'name' => 'title',
        'id' => 2, // the second match (.*?) from the regex
    ),
);

$matches = $scrapher->with(new RegexSelector($regex, $matchConfig));

$results = $matches->get();
```

This returns a list of arrays based on the match configuration that was set.

```
array(29) {
  [0] =>
  array(2) {
    'url' =>
    string(34) "https://www.google.com/webhp?tab=ww"
    'title' =>
    string(6) "Search"
  }
  ...
}

```

Documentation
-------------

[](#documentation)

### Instantiating

[](#instantiating)

When creating an instance of Scrapher, you may optionally pass one or more URLs.

Passing multiple URLs can be useful when you want to scrape the same data on different pages. For example when content is separated by pagination.

```
$scrapher = new Scrapher($url);
$scrapher = new Scrapher(array($url, $url2));
```

If you prefer to fetch the page yourself using a dedicated client/library, you may also simply pass the actual content of a page. This can also be handy if you want to scrape other content besides just web pages (e.g. local files).

```
$scrapher = new Scrapher($content);
$scrapher = new Scrapher(array($content, $content2));
```

In some cases, you may want to add (read: append) URLs or contents on the fly.

```
$scrapher->addUrl($url);
$scrapher->addUrls(array($url, $url2));
$scrapher->addContent($content);
$scrapher->addContents(array($content, $content2));
```

### Matching data using a Selector

[](#matching-data-using-a-selector)

Before retrieving or sorting the matched data, you need to choose a selector to match the data you want.

At the moment, Scrapher offers 1 selector out of the box, **RegexSelector**, which let's you select data using regular expressions.

A Selector takes an expression and a match configuration as its arguments.

For example, to match all links and their link name, you could do:

```
$regExpression = '/(.*?)/ms';

$matchConfig = array(
    array(
        // The "name" key let's you name the data you're looking for,
        // and will be used when retrieving the matched data
        'name' => 'url',
        // The "id" key is an identifier used during the regular expression search.
        // The id 1 corresponds to the first match in the regular expression, matching the URL.
        'id' => 1,
    ),
    array(
        'name' => 'title',
        'id' => 2,
    ),
);

$matches = $scrapher->with(new RegexSelector($regExpression, $matchConfig));
```

Note that the kind of value passed to the "id" key may vary depending on what selector you're using, and can virtually be anything. You can think of the "id" key as the glue between the given expression and its selector.

***RegexSelector** uses  under the hood.*

For your convenience, when using Regex, a match with `'id' => 0` will return the URL of the crawled page.

### Retrieving &amp; Sorting

[](#retrieving--sorting)

Once you've specified a selector using the **with** method, you can start retrieving and/or sorting the data.

**Retrieving**

```
// Return all matches
$results = $matches->get();

// Return all matches with a subset of the data (either use multiple arguments or an array for more than one column)
$results = $matches->get('title');

// Return the first match
$result = $matches->first();

// Return the last match
$result = $matches->last();

// Count the number of matches
$numberOfMatches = $matches->count();
```

**Offset &amp; limit**

```
// Take the first N matches
$results = $matches->take(5)->get();

// Skip the first N matches
$results = $matches->skip(1)->get();

// Take 5 matches starting from the second one.
$results = $matches->skip(1)->take(5)->get();
```

**Sorting**

```
// Order by title
$results = $matches->orderBy('title')->get();

// Order by title, then by URL
$results = $matches->orderBy('title')->orderBy('url', 'desc')->get();

// Custom sorting: For values that do not lend well with sorting, e.g. dates*.
$results = $matches->orderBy('date', 'desc', 'date_create')->get();

// Simply reverse the order of the results
$results = $matches->reverse()->get();
```

- See [date\_create](http://php.net/manual/en/function.date-create.php)

### Filtering

[](#filtering)

You can filter the matched data to refine your result set. Return `true` to keep the match, `false` to filter it out.

```
$matches->filter(function($match) {
    // Return only matches that contain 'Google' in the link title.
    return stristr($match['title'], 'Google') ? true : false;
});
```

### Mutating

[](#mutating)

In order to handle inconsistencies or formatting issues, you can alter the matched values to a more desirable value. Altering happens before filtering and sorting the result set. You can do so by using the `apply` index in the match configuration array with a closure that takes 2 arguments: the matched value and the URL of the crawled page.

```
$matchConfig = array(
    array(
        'name' => 'url',
        'id' => 1,
        // Add domain to relative URLs
        'apply' => function($match, $sourceUrl)
        {
            if (!stristr($match, 'http')) {
                return $sourceUrl . trim($match, '/');
            }
            return $match;
        },
    ),
    array(
        'name' => 'title',
        'id' => 2,
        // Remove all html tags inside the link title
        'apply' => function($match) {
            return strip_tags($match);
        },
    ),
    ...
);
```

### Validation

[](#validation)

You may validate the matched data to insure that the result set always contains the desired result. Validation happens after optionally mutating the data set with `apply`. To add the validation rules that should be applied to the data, use the `validate` index in the match configuration array with a closure that takes 2 arguments: the matched value and the URL of the crawled page. The closure should return `true` if the validation succeeded, and `false` if the validation failed. Matches that fail the validation will be removed from the result set.

```
$matchConfig = array(
    array(
        'name' => 'url',
        'id' => 1,
        // Make sure it is a valid url
        'validate' => function($match) {
            return filter_var($match, FILTER_VALIDATE_URL);
        },
    ),
    array(
        'name' => 'title',
        'id' => 2,
        // We only want titles that are between 1 and 50 characters long.
        'validate' => function($match) {
            return strlen($match) >= 1 && strlen($match) getLogs();
```

### Did you know?

[](#did-you-know)

**All methods are chainable**

```
$scrapher = new Scrapher();
$scrapher->addUrl($url)->with($regexSelector)->filter(...)->orderBy('title')->skip(1)->take(5)->get();
```

Only the methods `get`, `first`, `last`, `count` and `getLogs` will cause the chaining to end, as they all return a certain result.

**You can scrape different data from one page**

Suppose you're scraping a page, and you want to get all H2 titles, as well as all links on the page. You can do so without having to re-instantiate Scrapher.

```
$scrapher = new Scrapher($url);
$h2Titles = $scrapher->with($h2RegexSelector)->get();
$links = $scrapher->with($linksRegexSelector)->get();
```

About
-----

[](#about)

### Author

[](#author)

Laurent Van Winckel -  -

### License

[](#license)

Scrapher is licensed under the MIT License - see the `LICENSE` file for details

### Contributing

[](#contributing)

Contributions to Laurentvw\\Scrapher are always welcome. You make our lives easier by sending us your contributions through [GitHub pull requests](http://help.github.com/pull-requests).

You may also [create an issue](https://github.com/Laurentvw/scrapher/issues) to report bugs or request new features.

###  Health Score

33

—

LowBetter than 72% of packages

Maintenance20

Infrequent updates — may be unmaintained

Popularity21

Limited adoption so far

Community15

Small or concentrated contributor base

Maturity65

Established project with proven stability

 Bus Factor1

Top contributor holds 98.4% of commits — single point of failure

How is this calculated?**Maintenance (25%)** — Last commit recency, latest release date, and issue-to-star ratio. Uses a 2-year decay window.

**Popularity (30%)** — Total and monthly downloads, GitHub stars, and forks. Logarithmic scaling prevents top-heavy scores.

**Community (15%)** — Contributors, dependents, forks, watchers, and maintainers. Measures real ecosystem engagement.

**Maturity (30%)** — Project age, version count, PHP version support, and release stability.

###  Release Activity

Cadence

Every ~132 days

Recently: every ~253 days

Total

12

Last Release

3030d ago

Major Versions

v1.0.2 → 2.0.02015-04-25

PHP version history (2 changes)v1.0.0PHP &gt;=5.4.0

v2.1.1PHP &gt;=5.3.0

### Community

Maintainers

![](https://avatars.githubusercontent.com/u/50680?v=4)[Laurent Van Winckel](/maintainers/Laurentvw)[@Laurentvw](https://github.com/Laurentvw)

---

Top Contributors

[![Laurentvw](https://avatars.githubusercontent.com/u/50680?v=4)](https://github.com/Laurentvw "Laurentvw (62 commits)")[![kraffslol](https://avatars.githubusercontent.com/u/152176?v=4)](https://github.com/kraffslol "kraffslol (1 commits)")

---

Tags

crawl-pagecrawlerextractparserphpscraperscrapingselectorwebscrapingparserdatacrawlercontenttextparseextractspiderscrapeMatchextractorscrapermatchercrawlingHarvestscrapingmining

### Embed Badge

![Health badge](/badges/laurentvw-lavacrawler/health.svg)

```
[![Health](https://phpackages.com/badges/laurentvw-lavacrawler/health.svg)](https://phpackages.com/packages/laurentvw-lavacrawler)
```

###  Alternatives

[laurentvw/scrapher

A web scraper for PHP to easily extract data from web pages

192.5k1](/packages/laurentvw-scrapher)[smalot/pdfparser

Pdf parser library. Can read and extract information from pdf file.

2.7k38.3M256](/packages/smalot-pdfparser)[crwlr/crawler

Web crawling and scraping library.

36816.4k2](/packages/crwlr-crawler)[crwlr/robots-txt

Robots Exclusion Standard/Protocol Parser for Web Crawling/Scraping

1030.2k2](/packages/crwlr-robots-txt)[duzun/hquery

An extremely fast web scraper that parses megabytes of HTML in a blink of an eye. No dependencies. PHP5+

361159.6k5](/packages/duzun-hquery)[crawlbase/crawlbase

A lightweight, dependency free PHP class that acts as wrapper for Crawlbase API

1656.4k](/packages/crawlbase-crawlbase)

PHPackages © 2026

[Directory](/)[Categories](/categories)[Trending](/trending)[Changelog](/changelog)[Analyze](/analyze)