PHPackages                             ppajer/domextractor - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [Parsing &amp; Serialization](/categories/parsing)
4. /
5. ppajer/domextractor

ActiveLibrary[Parsing &amp; Serialization](/categories/parsing)

ppajer/domextractor
===================

Easily extract data from HTML and XML DOM documents. Web scraping made less messy

015PHP

Since Aug 5Pushed 5y agoCompare

[ Source](https://github.com/ppajer/DOMExtractor)[ Packagist](https://packagist.org/packages/ppajer/domextractor)[ RSS](/packages/ppajer-domextractor/feed)WikiDiscussions master Synced today

READMEChangelogDependenciesVersions (1)Used By (0)

PHP-DOM-Extractor
=================

[](#php-dom-extractor)

A PHP library for extracting data from a HTML DOM document into any user-defined data structure, based on custom extraction rules.

Usage
-----

[](#usage)

### Install

[](#install)

Download the repo and install via Composer, or manually download and `include` the class in your project. Note: this package requires [ivopetkov/html5-dom-document-php](https://github.com/ivopetkov/html5-dom-document-php) to process HTML5 documents. If you're installing manually, you will need to manage this dependency yourself.

### Defining extraction rules

[](#defining-extraction-rules)

Rules are simple PHP arrays which denote where the extractor must look for their value. They consist of a `key` to store the output in, and a CSS `selector` to match the element required. By default the element's text value will be returned, unless you specify an attribute to return instead. All instruction keys for the extractor are prefixed with a `@` and will be ignored in the output.

#### Basic query &amp; attributes

[](#basic-query--attributes)

The package uses CSS selector syntax for getting values from document nodes, including text and attribute nodes. The most basic rule could be written as:

```
array(
	'exampleKey' => array(
		'@selector' => 'title'
	)
)

// Will return:

array(
	'exampleKey' => 'Example Title'
)

```

If the data you're looking for is inside an element attribute, specify it in the selector after a `@` sign.

```
array(
	'exampleKey' => array(
		'@selector' => 'h1@class'
	)
)

// Will return:

array(
	'exampleKey' => 'h1 green-text site-heading'
)

```

#### Lists &amp; nested data

[](#lists--nested-data)

If you need to parse multiple values for a single key, or look for nested data, you can use the `@each` instruction, and nest as many levels of instructions as your memory limit allows:

```
array(
	'exampleKey' => array(
		'@selector' => '.some-list-item',
		'@each' => array(
			'listItemTitle' => array(
					'@selector' => 'h3'
			),
			'listItemLink' => array(
					'@selector' => 'a@href'
			),
			'listItemImages' => array(
					'@selector' => '.carousel-item',
					'@each' => array(
						'src' => array(
							'@selector' => 'img@src'
						)
					)
			)
		)
	)
)

```

This will return an array where `exampleKey` is an array containing arrays of data about the individual items in the list: in this example, the text content of each `h3` tag, the `href` attribute of each `a` element, and the `src` attribute of every `img` element.

```
array(
	'exampleKey' => array(
		array(
			'listItemTitle' => 'Some title',
			'listItemLink' => 'https://...',
			'listItemImages' => array(
				array('src' => 'https://...'),
				array('src' => 'https://...'),
				...
			)
		),
		array(
			'listItemTitle' => 'Some other title',
			'listItemLink' => 'https://...',
			'listItemImages' => array(
				array('src' => 'https://...'),
				array('src' => 'https://...'),
				...
			)
		),
		...
	)
)

```

### Setting up the rules

[](#setting-up-the-rules)

Once your rules are ready, you can pass them either to the instance by calling `setRules`, or the constructor as first argument. For convenience, the extractor can also take its instructions as either a JSON string or from an external JSON file as a path.

```
$rules = /* array or JSON string or file path */;

// Constructor
$extractor = new DOM_Extractor($rules);

// OR Instance
$extractor = new DOM_Extractor();
$extractor->setRules($rules);

```

### Loading the document

[](#loading-the-document)

Once everything is set, you are ready to load the document to parse and start extraction. As with passing the rules, here too you have the option of using the constructor's second argument or the dedicated `load` method.

```
$html = file_get_contents('https://...');

// Constructor
$extractor = new DOM_Extractor($rules, $html);

// OR Instance
$extractor = new DOM_Extractor();
$extractor->load($html);

```

### Complete example

[](#complete-example)

```
$rules = 'some/path/to/rules.json';
$html = file_get_contents('https:/...');

// Constructor method
$extractor = new DOM_Extractor($rules, $html);
$data = $extractor->parse();

// Instance method
$extractor = new DOM_Extractor;
$extractor->setRules($rules);
$extractor->load($html);
$data = $extractor->parse();

// Also supports method chaining:
$extractor = new DOM_Extractor
$data = $extractor->setRules($rules)->load($html)->parse();
˙``

```

###  Health Score

18

—

LowBetter than 8% of packages

Maintenance20

Infrequent updates — may be unmaintained

Popularity6

Limited adoption so far

Community6

Small or concentrated contributor base

Maturity33

Early-stage or recently created project

 Bus Factor1

Top contributor holds 100% of commits — single point of failure

How is this calculated?**Maintenance (25%)** — Last commit recency, latest release date, and issue-to-star ratio. Uses a 2-year decay window.

**Popularity (30%)** — Total and monthly downloads, GitHub stars, and forks. Logarithmic scaling prevents top-heavy scores.

**Community (15%)** — Contributors, dependents, forks, watchers, and maintainers. Measures real ecosystem engagement.

**Maturity (30%)** — Project age, version count, PHP version support, and release stability.

### Community

Maintainers

![](https://www.gravatar.com/avatar/1e6212007de7bf9a8e3591e60269ffaf607a962926182e21646b8f9975a879e4?d=identicon)[ppajer](/maintainers/ppajer)

---

Top Contributors

[![ppajer](https://avatars.githubusercontent.com/u/5861559?v=4)](https://github.com/ppajer "ppajer (17 commits)")

### Embed Badge

![Health badge](/badges/ppajer-domextractor/health.svg)

```
[![Health](https://phpackages.com/badges/ppajer-domextractor/health.svg)](https://phpackages.com/packages/ppajer-domextractor)
```

###  Alternatives

[mck89/peast

Peast is PHP library that generates AST for JavaScript code

19139.2M47](/packages/mck89-peast)[sauladam/shipment-tracker

Parses tracking information for several carriers, like UPS, USPS, DHL and GLS by simply scraping the data. No need for any kind of API access.

9843.5k](/packages/sauladam-shipment-tracker)[jstewmc/rtf

Read and write Rich Text Format (RTF) documents with PHP

45153.1k6](/packages/jstewmc-rtf)[tcds-io/php-jackson

A lightweight, flexible object serializer for PHP, inspired by FasterXML/jackson

113.2k10](/packages/tcds-io-php-jackson)

PHPackages © 2026

[Directory](/)[Categories](/categories)[Trending](/trending)[Changelog](/changelog)[Analyze](/analyze)