PHPackages                             juanparati/phpscraper - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [Utility &amp; Helpers](/categories/utility)
4. /
5. juanparati/phpscraper

ActiveLibrary[Utility &amp; Helpers](/categories/utility)

juanparati/phpscraper
=====================

A PHP scraper

1.1(6y ago)161MITPHP

Since Nov 17Pushed 6y ago1 watchersCompare

[ Source](https://github.com/juanparati/phpscraper)[ Packagist](https://packagist.org/packages/juanparati/phpscraper)[ RSS](/packages/juanparati-phpscraper/feed)WikiDiscussions master Synced 4w ago

READMEChangelog (1)Dependencies (5)Versions (2)Used By (0)

PHPSCRAPER
==========

[](#phpscraper)

1. What is it?
--------------

[](#1-what-is-it)

A command line tool used for extract and format content from webpages. It's suitable for extract like:

- product catalogs
- reviews
- lists
- etc

The output is formatted as [JSON lines](http://jsonlines.org/).

2. How it works?
----------------

[](#2-how-it-works)

1. Create a scraper receipt (see [recipes](recipes))
2. Type:

    ```
     phpscraper config url

    ```

The following example will extract all the reviews with the user name, comment and rating from Amazon:

```
    phpscraper recipes/amazon.yml https://www.amazon.de/product-reviews/B000J34HN4/ref=acr_dpx_hist_3?ie=UTF8

```

For see additional options just type:

```
    phpscraper --help

```

3. Recipes
----------

[](#3-recipes)

Recipes are YML files that describe in a structure way how to extract the content from the pages. The recipes uses [XPath](https://www.w3schools.com/xml/xpath_intro.asp) routes in order to instruct which elements are extracted.

Example of recipe that extract comments from Amazon reviews:

```
    project: "Amazon reviews extractor"
    pagination:
      next_xpath: "//li[@class='a-last']/a/@href"
    extraction:
      product:
        xpath: "//h1/a[@class='a-link-normal']"
        extract_as: "product"
        in_memory: true
      comments:
        xpath: "//div[@class='a-section celwidget']"
        subelements:
          product:
            from_memory: "product"
            extract_as: "product"
          name:
            xpath: "//span[@class='a-profile-name']"
            extract_as: "name"
          rate:
            xpath: "//i[contains(@class, 'a-icon-star')]/span"
            extract_as: "rating"
            extract_regex: "/^.{0,3}/"
            cast_as: float
          comment:
            xpath: "//span[@class='a-size-base review-text review-text-content']"
            extract_as: "comment"
          verified:
            xpath: "//span[@class='a-size-mini a-color-state a-text-bold']"
            extract_as: "verified"
            cast_as: boolean

```

### 3.1 The pagination section

[](#31-the-pagination-section)

It defines where the "next page" button is located. In case that this element is not found then scraper then it will finish the process until the current page is extracted.

### 3.2 The extraction section

[](#32-the-extraction-section)

It defines which elements are going to be extracted. It uses a cascade structure so its possible to define the parent and child elements.

The possible instructions for the extraction section are:

- xpath: It defines the element that is going to be extracted using a XPath expression. The expression will become automatically relative to the parent element if exists.
- subelements: It defines the child elements.
- extract\_as: This is always required and it sets the field name of the element to extract.
- extract\_regex: Regular expression used for extract the extracted content.
- cast\_as: Cast the extracted data, so the JSON output will reflect the right data type. Possible casts are "boolean", "int", "float" and "string"
- in\_memory: It used when we want to save temporally in the memory so we can use for example inside of another extraction thread.

5. Installation
---------------

[](#5-installation)

PHPscraper can be installed in different ways:

A) Download the [last build from Github](https://github.com/juanparati/phpscraper/releases/latest)or B) Just type "composer global require juanparati/phpscraper"

5. How to build my own package:
-------------------------------

[](#5-how-to-build-my-own-package)

- [Download Caveman](https://github.com/Mamuph/caveman/releases) (The Mamuph Helper Tool)
- Clone this project
- Inside the project directory type:

    ```
      caveman build . -x -r

    ```

###  Health Score

25

—

LowBetter than 37% of packages

Maintenance20

Infrequent updates — may be unmaintained

Popularity7

Limited adoption so far

Community8

Small or concentrated contributor base

Maturity57

Maturing project, gaining track record

 Bus Factor1

Top contributor holds 100% of commits — single point of failure

How is this calculated?**Maintenance (25%)** — Last commit recency, latest release date, and issue-to-star ratio. Uses a 2-year decay window.

**Popularity (30%)** — Total and monthly downloads, GitHub stars, and forks. Logarithmic scaling prevents top-heavy scores.

**Community (15%)** — Contributors, dependents, forks, watchers, and maintainers. Measures real ecosystem engagement.

**Maturity (30%)** — Project age, version count, PHP version support, and release stability.

###  Release Activity

Cadence

Unknown

Total

1

Last Release

2364d ago

### Community

Maintainers

![](https://www.gravatar.com/avatar/4caf72b4d969cfb8cdfbdc1d594c85b51c9316caf76b80aa0f9de7e3736cf59f?d=identicon)[juanparati](/maintainers/juanparati)

---

Top Contributors

[![juanparati](https://avatars.githubusercontent.com/u/835173?v=4)](https://github.com/juanparati "juanparati (4 commits)")

### Embed Badge

![Health badge](/badges/juanparati-phpscraper/health.svg)

```
[![Health](https://phpackages.com/badges/juanparati-phpscraper/health.svg)](https://phpackages.com/packages/juanparati-phpscraper)
```

###  Alternatives

[orchestra/canvas

Code Generators for Laravel Applications and Packages

21017.2M157](/packages/orchestra-canvas)[spatie/laravel-pjax

A pjax middleware for Laravel 5

513371.8k11](/packages/spatie-laravel-pjax)[netgen/content-browser

Netgen Content Browser is a Symfony bundle that provides an interface which selects items from any kind of backend and returns the IDs of selected items back to the calling code.

14112.1k8](/packages/netgen-content-browser)

PHPackages © 2026

[Directory](/)[Categories](/categories)[Trending](/trending)[Changelog](/changelog)[Analyze](/analyze)
