PHPackages                             crscheid/php-article-extractor - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [Parsing &amp; Serialization](/categories/parsing)
4. /
5. crscheid/php-article-extractor

ActiveLibrary[Parsing &amp; Serialization](/categories/parsing)

crscheid/php-article-extractor
==============================

An HTML article extractor for PHP

2.5.3(4y ago)478.3k↑87.5%17[13 issues](https://github.com/crscheid/php-article-extractor/issues)MITHTMLPHP ~7.2CI failing

Since Jan 26Pushed 2y ago2 watchersCompare

[ Source](https://github.com/crscheid/php-article-extractor)[ Packagist](https://packagist.org/packages/crscheid/php-article-extractor)[ RSS](/packages/crscheid-php-article-extractor/feed)WikiDiscussions master Synced 3w ago

READMEChangelog (10)Dependencies (5)Versions (42)Used By (0)

PHP Article extractor
=====================

[](#php-article-extractor)

This is a web article parsing and language detection library for PHP. This library reads the article content from a web page, removing all HTML and providing just the raw text, suitable for text to speech or machine learning processes.

For a project I have developed, I found many existing open source solutions good starting points, but each had unique failures. This library aggregates three different approaches into a single solution while adding the additional functionality of language detection.

How To Use
----------

[](#how-to-use)

This library is distributed via packagist.org, so you can use composer to retrieve the dependency

```
composer require crscheid/php-article-extractor

```

### Calling via URL

[](#calling-via-url)

This library will attempt to retrieve the HTML for you. You need simply to create an ArticleExtractor class and call the `parseURL` function on it, passing in the URL desired.

```
use Cscheide\ArticleExtractor\ArticleExtractor;

$extractor = new ArticleExtractor();

$response = $extractor->processURL("https://www.fastcompany.com/3067246/innovation-agents/the-unexpected-design-challenge-behind-slacks-new-threaded-conversations");
var_dump($response);
```

The function `processURL` returns an array containing the title, text, and meta data associated with the request. If the text is `null` then this indicates a failed parsing. Below should be the output of the above code.

The field `result_url` will be different if the library followed redirects. This field represents the final page actually retrieved after redirects.

```
array(5) {
  ["parse_method"]=>
  string(11) "readability"
  ["title"]=>
  string(72) "The Unexpected Design Challenge Behind Slack’s New Threaded Conversations"
  ["text"]=>
  string(8013) "At first blush, threaded conversations sound like one of the most thoroughly mundane features a messaging app could introduce.After all, the idea of neatly bundling up a specific message and its replies in ..."
  ["language_method"]=>
  string(7) "service"
  ["language"]=>
  string(2) "en"
  ["result_url"]=>
  string(126) "https://www.fastcompany.com/3067246/innovation-agents/the-unexpected-design-challenge-behind-slacks-new-threaded-conversations"

}

```

### Calling with HTML

[](#calling-with-html)

If you already have HTML, you can use the `parseHTML` function and use your HTML processed through the same logic.

```
use Cscheide\ArticleExtractor\ArticleExtractor;

$extractor = new ArticleExtractor();
$myHTML = ;

$response = $extractor->processHTML($myHTML);
var_dump($response);
```

The function `parseHTML` returns an array containing the title, text, and meta data associated with the request. If the text is `null` then this indicates a failed parsing. Below should be the output of the above code.

The field `result_url` will not be included in this case since we are not attempting to get the HTML during the process call.

```
array(5) {
  ["parse_method"]=>
  string(11) "readability"
  ["title"]=>
  string(72) "The Unexpected Design Challenge Behind Slack’s New Threaded Conversations"
  ["text"]=>
  string(8013) "At first blush, threaded conversations sound like one of the most thoroughly mundane features a messaging app could introduce.After all, the idea of neatly bundling up a specific message and its replies in ..."
  ["language_method"]=>
  string(7) "service"
  ["language"]=>
  string(2) "en"
}

```

You can also create the `ArticleExtractor` class by passing in a key for the language detection service as well as a custom User-Agent string. See more information below.

Options
-------

[](#options)

### Language Detection Methods

[](#language-detection-methods)

Language detection is handled by either looking for language specifiers within the HTML meta data or by utilizing the [Detect Language](http://detectlanguage.com/) service.

If it is possible to detect the language of the article, the language code in [ISO 639-1](http://www.loc.gov/standards/iso639-2/php/code_list.php) format as well as the detection method are returned in the fields `language` and `language_method` respectively. The `language_method` field, if found successfully, may be either `html` or `service`.

If language detection fails or is not available, both of these fields will be returned as null.

[Detect Language](http://detectlanguage.com/) requires the use of an API KEY which you can sign up for. However, you can also use this library without it. If the HTML meta data do not contain information about the language of the article, then `language` and `language_method` will be returned as null values.

To utilize this library utilizing the language detection service, create the `ArticleExtractor` object by passing in your API KEY for [Detect Language](http://detectlanguage.com/).

```
use Cscheide\ArticleExtractor\ArticleExtractor;

$extractor = new ArticleExtractor('your api key');
```

### Setting User Agent

[](#setting-user-agent)

It is possible to set the user-agent for outgoing requests. To do so pass the desired user agent string to the constructor as follows:

```
use Cscheide\ArticleExtractor\ArticleExtractor;

$myUserAgent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.116 Safari/537.36";
$extractor = new ArticleExtractor(null, $myUserAgent);
```

### Force Reading Method

[](#force-reading-method)

It is possible to force the method by which the reading is attempted, either with Readability, Goose, or Goose with our custom processing. This can come in handy where Readability or Goose have particular issues with particular websites.

To force the method, simply provide a third argument to the constructor as such. The four valid methods are `readability`, `goose`, `goosecustom`, or `custom`.

```
$extractor = new ArticleExtractor(null, null, "goose");
```

Output Format
-------------

[](#output-format)

As of version 1.0, the output format has been altered to provide newline breaks for headings. This is important especially for natural language processing applications in determining sentence boundaries. If this behavior is not desired, simply strip out the additional newlines where needed.

This change was made due the fact that when header and paragraph HTML elements are simply stripped out, there often occurs issues where there is no separation between the heading and the proceeding sentence.

**Example of Output Format for Text Field**

```
\n
A database containing 250 million Microsoft customer records has been found unsecured and online\n
NurPhoto via Getty Images\n
A new report reveals that 250 million Microsoft customer records, spanning 14 years, have been exposed online without password protection.\n
Microsoft has been in the news for, mostly, the wrong reasons recently. There is the Internet Explorer zero-day vulnerability that Microsoft hasn't issued a patch for, despite it being actively exploited. That came just days after the U.S. Government issued a critical Windows 10 update now alert concerning the "extraordinarily serious" curveball crypto vulnerability. Now a newly published report, has revealed that 250 million Microsoft customer records, spanning an incredible 14 years in all, have been exposed online in a database with no password protection.\n
What Microsoft customer records were exposed online, and where did they come from?\n

```

Running tests
-------------

[](#running-tests)

Unit tests are included in this distribution and can be run utilizing PHPUnit after installing dependencies. The recommended approach is to use Docker for this purpose, so you then don't even need to have dependencies installed on your system.

> Note: Please set the environment variable `DETECT_LANGUAGE_KEY` with your [Detect Language](http://detectlanguage.com/) key in order for language detection in unit tests to work properly.

### Installing Dependencies

[](#installing-dependencies)

This will use the composer docker image to download the requirements. Note the use of the `--ignore-platform-reqs` since some of our dependencies do not yet support PHP 8.

```
docker run --rm --interactive --tty --volume $PWD:/app composer --ignore-platform-reqs install

```

### Running Unit Tests

[](#running-unit-tests)

This runs the phpunit dependency that we downloaded within the php 7.4 command line environment.

```
docker run -v $(pwd):/app -w /app -e DETECT_LANGUAGE_KEY= --rm php:7.4-cli ./vendor/phpunit/phpunit/phpunit

```

###  Health Score

38

—

LowBetter than 83% of packages

Maintenance14

Infrequent updates — may be unmaintained

Popularity37

Limited adoption so far

Community14

Small or concentrated contributor base

Maturity70

Established project with proven stability

 Bus Factor1

Top contributor holds 97.9% of commits — single point of failure

How is this calculated?**Maintenance (25%)** — Last commit recency, latest release date, and issue-to-star ratio. Uses a 2-year decay window.

**Popularity (30%)** — Total and monthly downloads, GitHub stars, and forks. Logarithmic scaling prevents top-heavy scores.

**Community (15%)** — Contributors, dependents, forks, watchers, and maintainers. Measures real ecosystem engagement.

**Maturity (30%)** — Project age, version count, PHP version support, and release stability.

###  Release Activity

Cadence

Every ~52 days

Recently: every ~84 days

Total

39

Last Release

1462d ago

Major Versions

0.9 → 1.02020-01-23

1.0.2 → 2.02020-02-26

PHP version history (3 changes)0.1.0PHP &gt;=5.5

2.0PHP &gt;=7.2

2.5.3PHP ~7.2

### Community

Maintainers

![](https://www.gravatar.com/avatar/5c9d20574b53234fe1f443698324d27689e8c0938cd22c90fba7ad0f964d9422?d=identicon)[crscheid](/maintainers/crscheid)

---

Top Contributors

[![crscheid](https://avatars.githubusercontent.com/u/11023069?v=4)](https://github.com/crscheid "crscheid (46 commits)")[![phpproff](https://avatars.githubusercontent.com/u/7473051?v=4)](https://github.com/phpproff "phpproff (1 commits)")

---

Tags

extract-websiteextractionextractorphpwebsitewebsite-articleshtmlparseextract

###  Code Quality

TestsPHPUnit

### Embed Badge

![Health badge](/badges/crscheid-php-article-extractor/health.svg)

```
[![Health](https://phpackages.com/badges/crscheid-php-article-extractor/health.svg)](https://phpackages.com/packages/crscheid-php-article-extractor)
```

###  Alternatives

[masterminds/html5

An HTML5 parser and serializer.

1.8k260.4M292](/packages/masterminds-html5)[league/html-to-markdown

An HTML-to-markdown conversion helper for PHP

1.9k31.0M272](/packages/league-html-to-markdown)[vstelmakh/url-highlight

Library to parse urls from string input

102923.8k12](/packages/vstelmakh-url-highlight)[olamedia/nokogiri

HTML Parser

22777.0k3](/packages/olamedia-nokogiri)[dimabdc/php-fast-simple-html-dom-parser

PHP Fast Simple HTML DOM parser.

9054.3k](/packages/dimabdc-php-fast-simple-html-dom-parser)[oscarotero/html-parser

Parse html strings to DOMDocument

155.4M1](/packages/oscarotero-html-parser)

PHPackages © 2026

[Directory](/)[Categories](/categories)[Trending](/trending)[Changelog](/changelog)[Analyze](/analyze)