PHPackages                             franzip/serp-scraper - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [Parsing &amp; Serialization](/categories/parsing)
4. /
5. franzip/serp-scraper

ActiveLibrary[Parsing &amp; Serialization](/categories/parsing)

franzip/serp-scraper
====================

A library to extract, serialize and store data scraped on Search Engine result pages.

20678[2 issues](https://github.com/franzip/serp-scraper/issues)PHP

Since Jul 28Pushed 10y ago2 watchersCompare

[ Source](https://github.com/franzip/serp-scraper)[ Packagist](https://packagist.org/packages/franzip/serp-scraper)[ RSS](/packages/franzip-serp-scraper/feed)WikiDiscussions master Synced today

READMEChangelogDependenciesVersions (1)Used By (0)

[![Build Status](https://camo.githubusercontent.com/819d7be60b1ae2de803748ad93245f3dec6d824d3d6b64114017578072faad97/68747470733a2f2f7472617669732d63692e6f72672f6672616e7a69702f736572702d736372617065722e7376673f6272616e63683d6d6173746572)](https://travis-ci.org/franzip/serp-scraper)[![Coverage Status](https://camo.githubusercontent.com/00dddee0703d4c2a1364575cce6b8903294776bdda458534ab3500cc78c7af13/68747470733a2f2f636f766572616c6c732e696f2f7265706f732f6672616e7a69702f736572702d736372617065722f62616467652e737667)](https://coveralls.io/r/franzip/serp-scraper)

SerpScraper
===========

[](#serpscraper)

A library to extract, serialize and store data scraped on Search Engine result pages.

Installing via Composer (recommended)
-------------------------------------

[](#installing-via-composer-recommended)

Install composer in your project:

```
curl -s http://getcomposer.org/installer | php

```

Create a composer.json file in your project root:

```
{
    "require": {
        "franzip/serp-scraper": "0.1.*@dev"
    }
}

```

Install via composer

```
php composer.phar install

```

Supported Search Engines
------------------------

[](#supported-search-engines)

- Google
- Bing
- Ask
- Yahoo

Supported Serialization format
------------------------------

[](#supported-serialization-format)

- JSON
- XML
- YAML

Legal Disclaimer
----------------

[](#legal-disclaimer)

Under no circumstances I shall be considered liable to any user for direct, indirect, incidental, consequential, special, or exemplary damages, arising from or relating to userʹs use or misuse of this software. Consult the following Terms of Service before using SerpScraper:

- [Google](https://www.google.com/accounts/TOS)
- [Bing](http://windows.microsoft.com/en-us/windows/microsoft-services-agreement)
- [Ask](http://about.ask.com/terms-of-service)
- [Yahoo](https://info.yahoo.com/legal/us/yahoo/utos/en-us/)

How it works in a nutshell
--------------------------

[](#how-it-works-in-a-nutshell)

[![SerpScraper Diagram](./serp-scraper.jpg?raw=true "SerpScraper Diagram")](./serp-scraper.jpg?raw=true)

Description
-----------

[](#description)

Scraping legal status seems to be quite disputed. Anyway, this library tries to avoid unnecessary HTTP overhead by using three strategies:

- Throttling: [an internal object](https://github.com/franzip/throttler) takes care of capping the amount of allowed HTTP requests to a default of 15 per hour. Once that limit has been reached, it will not be possible to scrape more content until the timeframe expires.
- Caching: [the library used to retrieve data](https://github.com/franzip/serp-fetcher) caches every fetched page. The default cache expiration is set to 24 hours.
- Delaying: a simple and quite naive approach is used here. Multiple HTTP requests will be spaced out by a default 0,5 sec delay.

Constructor details
-------------------

[](#constructor-details)

This is the abstract constructor, used by all the concrete implementations:

```
SerpScraper($keywords, $outDir = 'out', $fetcherCacheDir = 'fetcher_cache',
            $serializerCacheDir = 'serializer_cache', $cacheTTL = 24,
            $requestDelay = 500);
```

1. `$keywords` - array
    - The keywords you want to scrape. Cannot be an empty array.
2. `$outDir` - string
    - Path to the folder to be used to store serialized pages.
3. `$fetcherCacheDir` - string
    - Path to the folder to be used to store [SerpFetcher](https://github.com/franzip/serp-fetcher) cache.
4. `$serializerCacheDir` - string
    - Path to the folder to be used to store [SerpPageSerializer](https://github.com/franzip/serp-page-serializer) cache.
5. `$cacheTTL` - integer
    - Time expiration of the [SerpFetcher](https://github.com/franzip/serp-fetcher) cache expressed in hours.
6. `$requestDelay` - integer
    - Delay to use between multiple HTTP requests, expressed in microseconds.

Building a Scraper (using Factory)
----------------------------------

[](#building-a-scraper-using-factory)

Specify the vendor as first argument. You can specify custom settings using an array as second argument (see the SerpScraper constructor above).

```
use Franzip\SerpScraper\SerpScraperBuilder;

$googleScraper = SerpScraperBuilder::create('Google', array(array('keyword1',
                                                                  'keyword2',
                                                                  ...)));

$askScraper = SerpScraperBuilder::create('Ask', array(array('key1', 'key2')));
$bingScraper = SerpScraperBuilder::create('Bing', array(array('baz', 'foo')));
...
```

Building a Scraper (with explicit constructors)
-----------------------------------------------

[](#building-a-scraper-with-explicit-constructors)

```
use Franzip\SerpScraper\Scrapers\GoogleScraper;
use Franzip\SerpScraper\Scrapers\AskScraper;
use Franzip\SerpScraper\Scrapers\BingScraper;
use Franzip\SerpScraper\Scrapers\YahooScraper;

$googleScraper = new GoogleScraper($keywords = array('foo', 'bar'),
                                   $outDir   = 'google_results');
$askScraper = new AskScraper($keywords = array('foo', bar),
                             $outDir = 'ask_results');
...
```

scrape() and scrapeAll()
------------------------

[](#scrape-and-scrapeall)

You can scrape a single tracked keyword with `scrape()`, or scrape all the tracked keywords using `scrapeAll()`.

`scrape()` signature:

```
$serpScraper->scrape($keyword, $pagesToScrape = 1, $toRemove = false,
                     $timezone = 'UTC', $throttling = true);
```

Usage example:

```
// Scrape the first 5 pages for the keyword 'foo', remove it from the tracked
// keyword, use the Los Angeles timezone and don't use throttling.
$serpScraper->scrape('foo', 5, true, 'America/Los Angeles', false);
```

`scrapeAll()` signature:

```
$serpScraper->scrapeAll($pagesToScrape = 1, $toRemove = false, $timezone = 'UTC',
                        $throttling = true);
```

Usage example:

```
// Scrape the first 5 pages for all the tracked keywords, remove them all from
// tracked keywords, use the Berlin timezone and don't use throttling.
$serpScraper->scrapeAll(5, true, 'Europe/Berlin', false);
// keywords array has been emptied
var_dump($serpScraper->getKeywords());
// array()
```

serialize() and getFetchedPages()
---------------------------------

[](#serialize-and-getfetchedpages)

Serialize all the results fetched so far. Supported formats are: JSON, XML and YAML. You can access the fetched array by calling `getFetchedPages()`.

`serialize()` signature:

```
$serpScraper->serialize($format, $toRemove = false);
```

Usage example:

```
$serpScraper->serialize($format, $toRemove = false);
// serialize to JSON the stuff retrieved so far
$serpScraper->serialize('json');
// serialize to XML the stuff retrieved so far
$serpScraper->serialize('xml');
// fetched pages are still there
var_dump($serpScraper->getFetchedPages());
// array(
//       object(Franzip\SerpPageSerializer\Models\SerializableSerpPage) (1),
//       ...
// )

// now serialize to YAML the stuff retrieved so far and empty the fetched data
$serpScraper->serialize('yml', true);
// fetched array is now empty
var_dump($serpScraper->getFetchedPages());
// array()
```

save() and getSerializedPages()
-------------------------------

[](#save-and-getserializedpages)

Write to files the serialized results so far. The format used as filename is the following: *vendor\_keyword\_pagenumber\_time.format* | *google\_foo\_3\_12032015.json*

`save()` signature:

```
$serpScraper->save($toRemove = false)
```

Usage example:

```
// write serialized results so far to the specified output folder
$serpScraper->save();
// serialized pages are still there
var_dump($serpScraper->getSerializedPages());
// array(
//       object(Franzip\SerpPageSerializer\Models\SerializedSerpPage) (1),
//       ...
// )

// write serialized results so far to the specified output folder and remove
// them from the serialized array
$serpScraper->save(true);
// serialized array is now empty
var_dump($serpScraper->getSerializedPages());
// array()
```

Adding/Removing keywords.
-------------------------

[](#addingremoving-keywords)

```
$serpScraper->addKeyword('bar');
$serpScraper->addKeywords(array('foo', 'bar', ...));
$serpScraper->removeKeyword('bar');
```

Cache flushing
--------------

[](#cache-flushing)

You can call `flushCache()` anytime. This will remove all the cached files used by the `SerpFetcher` component and will also remove all the entries from the fetched and serialized arrays.

```
$serpScraper->flushCache();
var_dump($serpScraper->getFetchedPages());
// array()
var_dump($serpScraper->getSerializedPages());
// array()
```

Basic usage
-----------

[](#basic-usage)

```
use Franzip\SerpScraper\SerpScraperBuilder;

$googleScraper = SerpScraperBuilder::create('Google', array(array('keyword1',
                                                                  'keyword2',
                                                                  'keyword3')));
// scrape the first page for 'keyword1'
$googleScraper->scrape('keyword1');
// scrape the first 5 page for 'keyword2'
$googleScraper->scrape('keyword2', 5);
// serialize to JSON what has been scraped so far
$googleScraper->serialize('json');
//
...
```

Using multiple output folders
-----------------------------

[](#using-multiple-output-folders)

You can use different output folders as you see fit. In this case, the same keywords will be scraped once but the results will be written to different folders, based on their serialization format. Since the results are cached, the `serialize()` method will use the same data over and over again.

```
use Franzip\SerpScraper\SerpScraperBuilder;

$googleScraper = SerpScraperBuilder::create('Google',
                                            array(array('foo', 'baz', ...)));

// output folders
$xmlDir  = 'google_results/xml';
$jsonDir = 'google_results/json';
$yamlDir = 'google_results/yaml';

...
// scraping action happens here...

// write xml results first
$googleScraper->serialize('xml');
$googleScraper->setOutDir($xmlDir);
$googleScraper->save();
// now json
$googleScraper->serialize('json');
$googleScraper->setOutDir($jsonDir);
$googleScraper->save();
// write yaml results, we can now remove the serialized array
$googleScraper->serialize('yml', true);
$googleScraper->setOutDir($yamlDir);
$googleScraper->save();
```

TODOs
-----

[](#todos)

- Avoid request delay on cache hit.
- Validate YAML results in the tests (couldn't find a suitable library so far).
- Improve docs with better organization and more examples.
- Refactoring messy tests.

License
-------

[](#license)

[MIT](http://opensource.org/licenses/MIT/ "MIT") Public License.

###  Health Score

24

—

LowBetter than 31% of packages

Maintenance18

Infrequent updates — may be unmaintained

Popularity20

Limited adoption so far

Community11

Small or concentrated contributor base

Maturity41

Maturing project, gaining track record

 Bus Factor1

Top contributor holds 100% of commits — single point of failure

How is this calculated?**Maintenance (25%)** — Last commit recency, latest release date, and issue-to-star ratio. Uses a 2-year decay window.

**Popularity (30%)** — Total and monthly downloads, GitHub stars, and forks. Logarithmic scaling prevents top-heavy scores.

**Community (15%)** — Contributors, dependents, forks, watchers, and maintainers. Measures real ecosystem engagement.

**Maturity (30%)** — Project age, version count, PHP version support, and release stability.

### Community

Maintainers

![](https://www.gravatar.com/avatar/815354e0dcecb447e841e032ed3f02a985239704fbbebbd6cd48c360efed5b95?d=identicon)[franzip](/maintainers/franzip)

---

Top Contributors

[![franzip](https://avatars.githubusercontent.com/u/6237296?v=4)](https://github.com/franzip "franzip (12 commits)")

### Embed Badge

![Health badge](/badges/franzip-serp-scraper/health.svg)

```
[![Health](https://phpackages.com/badges/franzip-serp-scraper/health.svg)](https://phpackages.com/packages/franzip-serp-scraper)
```

###  Alternatives

[mck89/peast

Peast is PHP library that generates AST for JavaScript code

19139.2M47](/packages/mck89-peast)[sauladam/shipment-tracker

Parses tracking information for several carriers, like UPS, USPS, DHL and GLS by simply scraping the data. No need for any kind of API access.

9843.5k](/packages/sauladam-shipment-tracker)[jstewmc/rtf

Read and write Rich Text Format (RTF) documents with PHP

45153.1k6](/packages/jstewmc-rtf)[tcds-io/php-jackson

A lightweight, flexible object serializer for PHP, inspired by FasterXML/jackson

113.2k10](/packages/tcds-io-php-jackson)

PHPackages © 2026

[Directory](/)[Categories](/categories)[Trending](/trending)[Changelog](/changelog)[Analyze](/analyze)