PHPackages                             sebastiansulinski/path-extractor - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [Parsing &amp; Serialization](/categories/parsing)
4. /
5. sebastiansulinski/path-extractor

ActiveLibrary[Parsing &amp; Serialization](/categories/parsing)

sebastiansulinski/path-extractor
================================

Parse html document and extract paths from the images, anchors and other tags.

v2.0.0(3y ago)119.5k↓100%MITPHPPHP ^8.1

Since Jul 18Pushed 3y agoCompare

[ Source](https://github.com/sebastiansulinski/path-extractor)[ Packagist](https://packagist.org/packages/sebastiansulinski/path-extractor)[ RSS](/packages/sebastiansulinski-path-extractor/feed)WikiDiscussions main Synced 1mo ago

READMEChangelogDependencies (1)Versions (7)Used By (0)

Path extractor
==============

[](#path-extractor)

Package, which extracts paths and attributes from the image, anchor and other tags of the provided html.

### Installation

[](#installation)

```
composer require sebastiansulinski/path-extractor
```

### Basic usage

[](#basic-usage)

#### Instantiating

[](#instantiating)

You can instantiate `Extractor` either by using `new` keyword or static `make` method. Constructor takes and optional argument, which represents the string to be parsed.

```
use SSD\PathExtractor\Extractor;

$extractor = new Extractor;

$extractor = new Extractor($html);

$extractor = Extractor::make();

$extractor = Extractor::make($html);
```

#### Specifying input html

[](#specifying-input-html)

Apart from being able to pass your string via constructor, you can also use the `Extractor::for` method to set it on the instance.

```
$extractor = new Extractor;
$extractor->for($html);
```

#### Extracting images

[](#extracting-images)

To extract all images use the `Extractor::extract(Image::class)` method.

```
use \SSD\PathExtractor\Tags\Image;

$html = '';
$html = .'';

$images = Extractor::make($html)->extract(Image::class);
```

The above will return array containing the collection of `\SSD\PathExtractor\Tags\Image` class instances with properties `src` and `alt` available.

#### Extracting anchors

[](#extracting-anchors)

To extract all anchors use the `Extractor::extract(Anchor::class)` method.

```
use \SSD\PathExtractor\Tags\Anchor;

$html = 'Document one';
$html = .'Word document';

$anchors = Extractor::make($html)->extract(Anchor::class);
```

The above will return array containing the collection of `\SSD\PathExtractor\Tags\Anchor` class instances with properties `href`, `target`, `title` and `nodeValue` available.

#### Extracting scripts

[](#extracting-scripts)

To extract all anchors use the `Extractor::extract(Script::class)` method.

```
use \SSD\PathExtractor\Tags\Script;

$html = '';
$html = .'';
$html = .'';

$scripts = Extractor::make($html)->extract(Script::class);
```

The above will return array containing the collection of `\SSD\PathExtractor\Tags\Script` class instances with properties `src`, `async`, and `defer` available - last two with boolean `true` / `false` set based on whether they are present or not.

#### Limiting extensions

[](#limiting-extensions)

Sometimes you might want to only extract images or anchors with certain extensions. To do this use the `Extractor::withExtensions()` method and pass the required extensions as argument.

```
$images = Extractor::make($html)->withExtensions('jpg')->extract(Image::class);
$anchors = Extractor::make($html)->withExtensions(['pdf', 'docx'])->extract(Anchor::class);
$anchors = Extractor::make($html)->withExtensions('pdf', 'docx')->extract(Anchor::class);
```

#### Pre-pending url

[](#pre-pending-url)

Sometimes you might wish to prepend the protocol, domain name and even a port to the relative paths extracted from your html. To do this, use the `Extractor::withUrl()` method.

```
$html = '';
$html .= '';

$images = Extractor::make($html)->withUrl('https://mywebsite.com')->extract(Image::class);
```

The above will return an array containing two instances of `\SSD\PathExtractor\Tags\Image` - one with `src` set to `https://mywebsite.com/media/image.jpg` and the other to `https://ssdtutorials.com/media/image2.jpg`. **Please note** - it will not replace the paths which already contain protocol and domain.

#### Tidying / purifying input

[](#tidying--purifying-input)

If you'd like your input to first undergo the purification, you can use the `Extractor::withTidy()` method. This method takes 2 optional arguments: `array $config = []`, which allows you to overwrite default `tidy` extension configuration as well as `string $encoding = 'utf8'` should you need to change the encoding.

By default config is set to

```
[
    'clean' => 'yes',
    'output-html' => 'yes',
    'wrap' => 0,
]
```

More on config options at [HTML Tidy Configuration Options](http://tidy.sourceforge.net/docs/quickref.html).

#### Invalid input exception

[](#invalid-input-exception)

If you decide NOT to use `tidy` to purify your input, where for instance you will do this before passing the html to the constructor or `for` method and if the provided html contains invalid syntax, the `\SSD\PathExtractor\InvalidHtmlException` will be thrown - so make sure you catch it and act accordingly.

#### Accessing attributes of the `\SSD\PathExtractor\Tags\Tag` class instance.

[](#accessing-attributes-of-the-ssdpathextractortagstag-class-instance)

Each implementation of `\SSD\PathExtractor\Tags\Tag` will have their own, unique set of properties available

```
\SSD\PathExtractor\Tags\Anchor

- href
- target
- title
- rel
- nodeValue (represents text in between opening and closing a tag)

\SSD\PathExtractor\Tags\Image

- src
- alt
- width
- height

\SSD\PathExtractor\Tags\Script

- src
- type
- charset
- async
- defer

\SSD\PathExtractor\Tags\Link

- href
- type
- rel
```

#### Rendering tag for `\SSD\PathExtractor\Tags\Tag` class instance.

[](#rendering-tag-for-ssdpathextractortagstag-class-instance)

Once you have extracted the collection of resources, you can then return an html tag for each one by simply casting it to string or by calling the `tag()` method on it.

```
$html = '';
$html = .'';

$tag1 = (string)Extractor::make($html)->withExtensions('jpg')->extract(Image::class)[0];
$tag2 = Extractor::make($html)->withExtensions('jpg')->extract(Image::class)[0]->tag();
```

Both of the above will return

```

```

You can also obtain array representation of each instance by calling `Tag::toArray()` method on it

```
Extractor::make($html)->withExtensions('jpg')->extract(Image::class)[0]->toArray()
```

#### Adding more tag types

[](#adding-more-tag-types)

If you need more tag types i.e. `link` - simply add new class that extends `\SSD\PathExtractor\Tags\Tag` and implement the abstract methods required by it.

```
use SSD\PathExtractor\Tags\Tag;
use SSD\PathExtractor\Tags\Type;

class Link extends Tag
{
    /**
     * Get tag name.
     *
     * @return string
     */
    static public function tagName(): string
    {
        return 'link';
    }

    /**
     * Get path attribute.
     *
     * @return string
     */
    static public function pathAttribute(): string
    {
        return 'href';
    }

    /**
     * Get available attributes.
     *
     * @return array
     */
    static public function availableAttributes(): array
    {
        return [
            'href' => Type::STRING,
            'type' => Type::STRING,
            'rel' => Type::STRING,
        ];
    }

    /**
     * Get formatted tag.
     *
     * @return string
     */
    public function tag(): string
    {
        return '';
    }
}
```

#### Example of extracting only paths

[](#example-of-extracting-only-paths)

```
$string = '';
$string .= '';
$string .= 'Document';
$string .= '';
$string .= '';

$extractor = Extractor::make($string);

$images = array_map(function (Tag $tag) {
    return $tag->path();
}, $extractor->extract(Image::class));

$anchors = array_map(function (Tag $tag) {
    return $tag->path();
}, $extractor->extract(Anchor::class));

$scripts = array_map(function (Tag $tag) {
    return $tag->path();
}, $extractor->extract(Script::class));

$links = array_map(function (Tag $tag) {
    return $tag->path();
}, $extractor->extract(Link::class));

$this->assertEquals([
    '/media/image/one.jpg',
    'https://mysite.com/media/image/two.jpg',
    '/media/files/two.pdf',
    '/media/script/three.js',
    '/media/link/three.css',
], array_merge($images, $anchors, $scripts, $links));
```

###  Health Score

35

—

LowBetter than 80% of packages

Maintenance20

Infrequent updates — may be unmaintained

Popularity25

Limited adoption so far

Community8

Small or concentrated contributor base

Maturity72

Established project with proven stability

 Bus Factor1

Top contributor holds 66.7% of commits — single point of failure

How is this calculated?**Maintenance (25%)** — Last commit recency, latest release date, and issue-to-star ratio. Uses a 2-year decay window.

**Popularity (30%)** — Total and monthly downloads, GitHub stars, and forks. Logarithmic scaling prevents top-heavy scores.

**Community (15%)** — Contributors, dependents, forks, watchers, and maintainers. Measures real ecosystem engagement.

**Maturity (30%)** — Project age, version count, PHP version support, and release stability.

###  Release Activity

Cadence

Every ~253 days

Recently: every ~316 days

Total

6

Last Release

1222d ago

Major Versions

v0.0.2 → v1.0.02019-07-21

v1.0.2 → v2.0.02023-01-04

PHP version history (4 changes)v0.0.1PHP ^7.2

v1.0.1PHP ^7.2 || ^8.0

v1.0.2PHP ^7.4|^8.0

v2.0.0PHP ^8.1

### Community

Maintainers

![](https://avatars.githubusercontent.com/u/2211203?v=4)[Sebastian Sulinski](/maintainers/sebastiansulinski)[@sebastiansulinski](https://github.com/sebastiansulinski)

---

Top Contributors

[![ssdtutorials](https://avatars.githubusercontent.com/u/18402239?v=4)](https://github.com/ssdtutorials "ssdtutorials (8 commits)")[![sebastiansulinski](https://avatars.githubusercontent.com/u/2211203?v=4)](https://github.com/sebastiansulinski "sebastiansulinski (4 commits)")

---

Tags

domdocumenthtmlphp

###  Code Quality

TestsPHPUnit

### Embed Badge

![Health badge](/badges/sebastiansulinski-path-extractor/health.svg)

```
[![Health](https://phpackages.com/badges/sebastiansulinski-path-extractor/health.svg)](https://phpackages.com/packages/sebastiansulinski-path-extractor)
```

###  Alternatives

[mtdowling/jmespath.php

Declaratively specify how to extract elements from a JSON document

2.0k472.8M135](/packages/mtdowling-jmespathphp)[opis/closure

A library that can be used to serialize closures (anonymous functions) and arbitrary data.

2.6k230.0M284](/packages/opis-closure)[masterminds/html5

An HTML5 parser and serializer.

1.8k242.8M229](/packages/masterminds-html5)[sabberworm/php-css-parser

Parser for CSS Files written in PHP

1.8k191.2M65](/packages/sabberworm-php-css-parser)[michelf/php-markdown

PHP Markdown

3.5k52.4M344](/packages/michelf-php-markdown)[jms/metadata

Class/method/property metadata management in PHP

1.8k152.8M88](/packages/jms-metadata)

PHPackages © 2026

[Directory](/)[Categories](/categories)[Trending](/trending)[Changelog](/changelog)[Analyze](/analyze)
