PHPackages                             duzun/hquery - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [HTTP &amp; Networking](/categories/http)
4. /
5. duzun/hquery

ActiveLibrary[HTTP &amp; Networking](/categories/http)

duzun/hquery
============

An extremely fast web scraper that parses megabytes of HTML in a blink of an eye. No dependencies. PHP5+

3.4.0(10mo ago)363146.3k↓20.9%73[15 issues](https://github.com/duzun/hQuery.php/issues)[3 PRs](https://github.com/duzun/hQuery.php/pulls)4MITPHPPHP &gt;=5.3CI passing

Since Jun 4Pushed 1mo ago21 watchersCompare

[ Source](https://github.com/duzun/hQuery.php)[ Packagist](https://packagist.org/packages/duzun/hquery)[ Docs](https://duzun.me/playground/hquery)[ RSS](/packages/duzun-hquery/feed)WikiDiscussions master Synced 1mo ago

READMEChangelogDependencies (2)Versions (52)Used By (4)

hQuery.php [![Donate](https://camo.githubusercontent.com/604e3db9c8751116b3f765aad0353ec7ded655bbe8aaacbc38d8c4a6b784b3ed/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f446f6e6174652d50617950616c2d677265656e2e737667)](https://www.paypal.me/duzuns)
==================================================================================================================================================================================================================================================================

[](#hqueryphp--)

An extremely fast and efficient web scraper that can parse megabytes of invalid HTML in a blink of an eye.

You can use the familiar jQuery/CSS selector syntax to easily find the data you need.

In my unit tests, I demand it be at least 10 times faster than Symfony's DOMCrawler on a 3Mb HTML document. In reality, according to my humble tests, it is two-three orders of magnitude faster than DOMCrawler in some cases, especially when selecting thousands of elements, and on average uses x2 less RAM.

See [tests/README.md](https://github.com/duzun/hQuery.php/blob/master/tests/README.md).

[API Documentation](https://duzun.github.io/hQuery.php/docs/class-hQuery.html)

💡 Features
----------

[](#-features)

- Very fast parsing and lookup
- Parses broken HTML
- jQuery-like style of DOM traversal
- Low memory usage
- Can handle big HTML documents (I have tested up to 20Mb, but the limit is the amount of RAM you have)
- Doesn't require cURL to be installed and automatically handles redirects (see [hQuery::fromUrl()](https://duzun.github.io/hQuery.php/docs/class-hQuery.html#_fromURL))
- Caches response for multiple processing tasks
- [PSR-7](https://www.php-fig.org/psr/psr-7/) friendly (see hQuery::fromHTML($message))
- PHP 5.3+
- No dependencies

Requirements
------------

[](#requirements)

- PHP 5.3 or newer (PHP 7.4+ recommended)
- `mbstring` extension is recommended for reliable charset handling and conversions
- Ensure a sufficient `memory_limit` when working with very large documents

🛠 Install
---------

[](#-install)

Add the library to your project and include it, or install via Composer/npm.

Using Composer (recommended):

```
composer require duzun/hquery
```

Or include manually:

```
include_once '/path/to/hquery.php/hquery.php';
```

Or via npm:

```
npm install hquery.php
```

Then require the file from `node_modules` if needed.

⚙ Usage
-------

[](#-usage)

### Basic setup:

[](#basic-setup)

```
// Optionally use namespaces
use duzun\hQuery;

// Either use composer, or include this file:
include_once '/path/to/libs/hquery.php';

// Set the cache path - must be a writable folder
// If not set, hQuery::fromURL() would make a new request on each call
hQuery::$cache_path = "/path/to/cache";

// Time to keep request data in cache, seconds
// A value of 0 disables cache
hQuery::$cache_expires = 3600; // default one hour
```

I would recommend using [php-http/cache-plugin](http://docs.php-http.org/en/latest/plugins/cache.html)with a [PSR-7 client](http://docs.php-http.org/en/latest/clients.html) for better flexibility.

### Load HTML from a file

[](#load-html-from-a-file)

###### [hQuery::fromFile](https://duzun.github.io/hQuery.php/docs/class-hQuery.html#_fromFile)( string `$filename`, boolean `$use_include_path` = false, resource `$context` = NULL )

[](#hqueryfromfile-string-filename-boolean-use_include_path--false-resource-context--null-)

```
// Local
$doc = hQuery::fromFile('/path/to/filesystem/doc.html');

// Remote
$doc = hQuery::fromFile('https://example.com/', false, $context);
```

Where `$context` is created with [stream\_context\_create()](https://secure.php.net/manual/en/function.stream-context-create.php).

For an example of using `$context` to make a HTTP request with proxy see [\#26](https://github.com/duzun/hQuery.php/issues/26#issuecomment-351032382).

### Load HTML from a string

[](#load-html-from-a-string)

###### [hQuery::fromHTML](https://duzun.github.io/hQuery.php/docs/class-hQuery.html#_fromHTML)( string `$html`, string `$url` = NULL )

[](#hqueryfromhtml-string-html-string-url--null-)

```
$doc = hQuery::fromHTML('Sample HTML DocContents...');

// Set base_url, in case the document is loaded from local source.
// Note: The base_url property is used to retrieve absolute URLs from relative ones.
$doc->base_url = 'http://desired-host.net/path';
```

### Load a remote HTML document

[](#load-a-remote-html-document)

###### [hQuery::fromUrl](https://duzun.github.io/hQuery.php/docs/class-hQuery.html#_fromURL)( string `$url`, array `$headers` = NULL, array|string `$body` = NULL, array `$options` = NULL )

[](#hqueryfromurl-string-url-array-headers--null-arraystring-body--null-array-options--null-)

```
use duzun\hQuery;

// GET the document
$doc = hQuery::fromUrl('http://example.com/someDoc.html', ['Accept' => 'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8']);

var_dump($doc->headers); // See response headers
var_dump(hQuery::$last_http_result); // See response details of last request

// with POST
$doc = hQuery::fromUrl(
    'http://example.com/someDoc.html', // url
    ['Accept' => 'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8'], // headers
    ['username' => 'Me', 'fullname' => 'Just Me'], // request body - could be a string as well
    ['method' => 'POST', 'timeout' => 7, 'redirect' => 7, 'decode' => 'gzip'] // options
);
```

For building advanced requests (POST, parameters etc) see [hQuery::http\_wr()](https://duzun.github.io/hQuery.php/docs/class-hQuery.html#_http_wr), though I recommend using a specialized ([PSR-7](https://www.php-fig.org/psr/psr-7/)?) library for making requests and `hQuery::fromHTML($html, $url=NULL)` for processing results. See [Guzzle](http://docs.guzzlephp.org/en/stable/) for eg.

#### [PSR-7](https://www.php-fig.org/psr/psr-7/) example:

[](#psr-7-example)

```
composer require php-http/message php-http/discovery php-http/curl-client
```

If you don't have [cURL PHP extension](https://secure.php.net/curl), just replace `php-http/curl-client` with `php-http/socket-client` in the above command.

```
use duzun\hQuery;

use Http\Discovery\HttpClientDiscovery;
use Http\Discovery\MessageFactoryDiscovery;

$client = HttpClientDiscovery::find();
$messageFactory = MessageFactoryDiscovery::find();

$request = $messageFactory->createRequest(
  'GET',
  'http://example.com/someDoc.html',
  ['Accept' => 'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8']
);

$response = $client->sendRequest($request);

$doc = hQuery::fromHTML($response, $request->getUri());
```

Another option is to use [stream\_context\_create()](https://secure.php.net/manual/en/function.stream-context-create.php)to create a `$context`, then call `hQuery::fromFile($url, false, $context)`.

### Processing the results

[](#processing-the-results)

###### [hQuery::find](https://duzun.github.io/hQuery.php/docs/class-hQuery.html#_find)( string `$sel`, array|string `$attr` = NULL, hQuery\\Node `$ctx` = NULL )

[](#hqueryfind-string-sel-arraystring-attr--null-hquerynode-ctx--null-)

```
// Find all banners (images inside anchors)
$banners = $doc->find('a[href] > img[src]:parent');

// Extract links and images
$links  = array();
$images = array();
$titles = array();

// If the result of find() is not empty
// $banners is a collection of elements (hQuery\Element)
if ( $banners ) {

    // Iterate over the result
    foreach($banners as $id => $a) {
        // $a->href property is the resolved $a->attr('href') relative to the
        // documents , if present, or $doc->baseURL.
        $links[$id] = $a->href; // get absolute URL from href property
        $titles[$id] = trim($a->text()); // strip all HTML tags and leave just text

        // Filter the result
        if ( !$a->hasClass('logo') ) {
            // $a->style property is the parsed $a->attr('style'), same as $a->attr('style', true)
            if ( strtolower($a->style['position']) == 'fixed' ) continue;

            $img = $a->find('img')[0]; // ArrayAccess
            if ( $img ) $images[$id] = $img->src; // short for $img->attr('src', true)
        }
    }

    // If at least one element has the class .home
    if ( $banners->hasClass('home') ) {
        echo 'There is .home button!', PHP_EOL;

        // ArrayAccess for elements and properties.
        if ( $banners[0]['href'] == '/' ) {
            echo 'And it is the first one!';
        }
    }
}

// Read charset of the original document (internally it is converted to UTF-8)
$charset = $doc->charset;

// Get the size of the document ( strlen($html) )
$size = $doc->size;

// The URL at which the document was requested
$requestUri = $doc->href;

// , if present, or the origin + dir path part from $doc->href.
// The .href and .src props are resolved using this value.
$baseURL = $doc->baseURL;
```

Charset and positions:

- The document is converted internally to UTF-8 for parsing.
- Element positions (the numeric offsets used internally and returned by APIs that expose byte offsets) refer to the internal UTF-8 string bytes.

Note: In case the charset meta attribute has a wrong value or the internal conversion fails for any other reason, `hQuery` will continue processing with the original HTML, but will register an error message on `$doc->html_errors['convert_encoding']`.

🖧 Live Demo
-----------

[](#-live-demo)

On [DUzun.Me](https://duzun.me/playground/hquery#sel=%20a%20%3E%20img%3Aparent&url=https%3A%2F%2Fgithub.com%2Fduzun)

A lot of people ask for sources of my **Live Demo** page. Here we go:

[view-source:https://duzun.me/playground/hquery](https://github.com/duzun/hQuery.php/blob/master/examples/duzun.me_playground_hquery.php)

### 🏃 Run the playground

[](#-run-the-playground)

You can easily run any of the `examples/` on your local machine. All you need is PHP installed in your system. After you clone the repo with `git clone https://github.com/duzun/hQuery.php.git`, you have several options to start a web-server.

###### Option 1:

[](#option-1)

```
cd hQuery.php/examples
php -S localhost:8000

# open browser http://localhost:8000/
```

###### Option 2 (browser-sync):

[](#option-2-browser-sync)

This option starts a live-reload server and is good for playing with the code.

```
npm install
gulp

# open browser http://localhost:8080/
```

###### Option 3 (VSCode):

[](#option-3-vscode)

If you are using VSCode, simply open the project and run debugger (`F5`).

🔧 TODO
------

[](#-todo)

- Unit tests everything
- Document everything
- Cookie support (implemented in mem for redirects)
- Improve selectors to be able to select by attributes
- Add more selectors
- Use [HTTPlug](http://httplug.io/) internally

💖 Support my projects
---------------------

[](#-support-my-projects)

I love Open Source. Whenever possible I share cool things with the world (check out [NPM](https://duzun.me/npm) and [GitHub](https://github.com/duzun/)).

If you like what I'm doing and this project helps you reduce time to develop, please consider to:

- ★ Star and Share the projects you like (and use)
- ☕ Give me a cup of coffee - [PayPal.me/duzuns](https://www.paypal.me/duzuns) (contact at duzun.me)
- ₿ Send me some **Bitcoin** at this addres: `bitcoin:3MVaNQocuyRUzUNsTbmzQC8rPUQMC9qafa` (or using the QR below) [![bitcoin:3MVaNQocuyRUzUNsTbmzQC8rPUQMC9qafa](https://camo.githubusercontent.com/24a9610427b2ce0ef5f7d953c1e94e8f7dbb7263a9020e2bed83a5031c59afcb/68747470733a2f2f63646e2e64757a756e2e6d652f66696c65732f71725f626974636f696e2d334d56614e516f63757952557a554e7354626d7a514338725055514d4339716166612e706e67)](https://camo.githubusercontent.com/24a9610427b2ce0ef5f7d953c1e94e8f7dbb7263a9020e2bed83a5031c59afcb/68747470733a2f2f63646e2e64757a756e2e6d652f66696c65732f71725f626974636f696e2d334d56614e516f63757952557a554e7354626d7a514338725055514d4339716166612e706e67)

###  Health Score

60

—

FairBetter than 99% of packages

Maintenance73

Regular maintenance activity

Popularity54

Moderate usage in the ecosystem

Community30

Small or concentrated contributor base

Maturity71

Established project with proven stability

 Bus Factor1

Top contributor holds 95.3% of commits — single point of failure

How is this calculated?**Maintenance (25%)** — Last commit recency, latest release date, and issue-to-star ratio. Uses a 2-year decay window.

**Popularity (30%)** — Total and monthly downloads, GitHub stars, and forks. Logarithmic scaling prevents top-heavy scores.

**Community (15%)** — Contributors, dependents, forks, watchers, and maintainers. Measures real ecosystem engagement.

**Maturity (30%)** — Project age, version count, PHP version support, and release stability.

###  Release Activity

Cadence

Every ~81 days

Recently: every ~91 days

Total

46

Last Release

321d ago

Major Versions

1.7.4 → 2.0.12018-07-03

2.2.4 → 3.0.02019-05-10

PHP version history (3 changes)1.1.0PHP &gt;=5.3.0

1.2.3PHP &gt;=5.0.0

2.0.1PHP &gt;=5.3

### Community

Maintainers

![](https://www.gravatar.com/avatar/5300c81d91f72d21119a70370ddf7810d64c38c81b677390eb2d63afe90e255d?d=identicon)[duzun](/maintainers/duzun)

---

Top Contributors

[![duzun](https://avatars.githubusercontent.com/u/321424?v=4)](https://github.com/duzun "duzun (246 commits)")[![dependabot[bot]](https://avatars.githubusercontent.com/in/29110?v=4)](https://github.com/dependabot[bot] "dependabot[bot] (6 commits)")[![elythi0n](https://avatars.githubusercontent.com/u/23305471?v=4)](https://github.com/elythi0n "elythi0n (3 commits)")[![Fantom409](https://avatars.githubusercontent.com/u/14968877?v=4)](https://github.com/Fantom409 "Fantom409 (1 commits)")[![gibex](https://avatars.githubusercontent.com/u/922533?v=4)](https://github.com/gibex "gibex (1 commits)")[![sekedus](https://avatars.githubusercontent.com/u/25115799?v=4)](https://github.com/sekedus "sekedus (1 commits)")

---

Tags

broken-htmlcrawlercss-selectorsdomcrawlerfasthqueryhtmlhtml-parserinvalid-htmljquery-likejquery-selectorsparserphppsr-0psr-4scraperselectorsxmlxml-parserhttppsr-7urlphpwebxmlparserhtmlindexefficientscrapercrawlingscrapingxhtmlselectorsinvalidfastestinvalid-htmlbroken-htmljquery-selectorsjquery-likefast-parsercss-selectors

### Embed Badge

![Health badge](/badges/duzun-hquery/health.svg)

```
[![Health](https://phpackages.com/badges/duzun-hquery/health.svg)](https://phpackages.com/packages/duzun-hquery)
```

###  Alternatives

[guzzlehttp/psr7

PSR-7 message implementation that also provides common utility methods

7.9k1.0B3.2k](/packages/guzzlehttp-psr7)[league/uri-interfaces

Common tools for parsing and resolving RFC3987/RFC3986 URI

536204.9M23](/packages/league-uri-interfaces)[httpsoft/http-message

Strict and fast implementation of PSR-7 and PSR-17

86874.0k94](/packages/httpsoft-http-message)[josantonius/url

PHP library to access URL information.

123.2k2](/packages/josantonius-url)

PHPackages © 2026

[Directory](/)[Categories](/categories)[Trending](/trending)[Changelog](/changelog)[Analyze](/analyze)
