PHPackages                             zrashwani/news-scrapper - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [Utility &amp; Helpers](/categories/utility)
4. /
5. zrashwani/news-scrapper

ActiveLibrary[Utility &amp; Helpers](/categories/utility)

zrashwani/news-scrapper
=======================

scrapping news data from a webpage using structured data

1.0.3(8y ago)132604MITPHPPHP &gt;=5.4.0

Since Sep 3Pushed 8y ago1 watchersCompare

[ Source](https://github.com/zrashwani/news-scrapper)[ Packagist](https://packagist.org/packages/zrashwani/news-scrapper)[ RSS](/packages/zrashwani-news-scrapper/feed)WikiDiscussions master Synced 1mo ago

READMEChangelog (6)Dependencies (5)Versions (8)Used By (0)

News Scrapper
=============

[](#news-scrapper)

This library extract article/news information from a webpage including: title, main image, description, author, keywords, publish date and body (if possible)...

This library supports scrapping using standard structured meta data, like: [Microdata](http://schema.org/Article), [hAtom Microformat](http://microformats.org/wiki/hatom) ..etc, along with custom selectors that can be specified to support unstructured webpages.

News-Scrapper requires PHP &gt;= 5.4

[![Build Status](https://camo.githubusercontent.com/537b783d44e9af10d1fa7a2fe1c0047af82d96ab6d70c90b7ddd2d2bc7bb422d/68747470733a2f2f7472617669732d63692e6f72672f7a7261736877616e692f6e6577732d73637261707065722e7376673f6272616e63683d6d6173746572)](https://travis-ci.org/zrashwani/news-scrapper)[![Code Climate](https://camo.githubusercontent.com/fa05d1b2f72d11f211714c417906af519ecd57a79fc5e4a15034bf02f0fd4fa3/68747470733a2f2f636f6465636c696d6174652e636f6d2f7265706f732f3535666337323430653330626130323032393030613931382f6261646765732f62343165363735366466663964396330653031622f6770612e737667)](https://codeclimate.com/repos/55fc7240e30ba0202900a918/feed)[![codecov.io](https://camo.githubusercontent.com/7ae5ad50aa1d824c33f95c326ae71c2bc10177dcb2ad9e2954399cf79e653db5/687474703a2f2f636f6465636f762e696f2f6769746875622f7a7261736877616e692f6e6577732d73637261707065722f636f7665726167652e7376673f6272616e63683d6d6173746572)](http://codecov.io/github/zrashwani/news-scrapper?branch=master)[![SensioLabsInsight](https://camo.githubusercontent.com/8697c5ff4f76668798dd8f105e49532b7285598be1843b4e360dd512e5c09158/68747470733a2f2f696e73696768742e73656e73696f6c6162732e636f6d2f70726f6a656374732f38396463643165642d623965342d346535362d386462372d6165663638376538643839612f6d696e692e706e67)](https://insight.sensiolabs.com/projects/89dcd1ed-b9e4-4e56-8db7-aef687e8d89a)[![Scrutinizer Code Quality](https://camo.githubusercontent.com/21a1f5f7b0c933604a3d6767ced708dc1e76b3a81019a5302c2a3e0fc123c7ba/68747470733a2f2f7363727574696e697a65722d63692e636f6d2f672f7a7261736877616e692f6e6577732d73637261707065722f6261646765732f7175616c6974792d73636f72652e706e673f623d6d6173746572)](https://scrutinizer-ci.com/g/zrashwani/news-scrapper/?branch=master)

How to Install
--------------

[](#how-to-install)

You can install this library with [Composer](http://getcomposer.org/). Drop this into your `composer.json`manifest file:

```
{
    "require": {
        "zrashwani/news-scrapper": "1.*"
    }
}

```

Then run `composer install`.

How to Use
----------

[](#how-to-use)

Here's a quick how to scrap news data from a webpage:

```
    require 'vendor/autoload.php';

    // Initiate scrapper
    $scrap_client = new \Zrashwani\NewsScrapper\Client();
	print_r($scrap_client->getLinkData($url));

```

By default, scrapper tries to guess the best structured data adapter and apply it.

### Scrapping Structured data

[](#scrapping-structured-data)

You can select a specific adapter to be used for extracting the data as following:

```
    $url = "http://example.com/your-news-uri";
    //use microdata standard for scrapping
    $scrap_client = new \Zrashwani\NewsScrapper\Client('Microdata');
    print_r($scrap_client->getLinkData($url));
```

Here is the list of supported structured data adapters or scrapping modes:

- [Microdata](http://schema.org/Article)
- [HAtom](http://microformats.org/wiki/hatom)
- [OpenGraph](http://ogp.me/)
- [JsonLD](http://json-ld.org/)
- [Parsely](https://www.parsely.com/docs/integration/metadata/ppage.html)
- Default

### Scrapping Unstructured data

[](#scrapping-unstructured-data)

If the webpage doesn't follow any standard structured data, you can still scrap news information by specifying xpath or css selector for different article parts like: title, description, image and body. as following:

```
$scrapClient = new \Zrashwani\NewsScrapper\Client('Custom');

/*@var $adapter \Zrashwani\NewsScrapper\Adapters\CustomAdapter */
$adapter = $scrapClient->getAdapter();
$adapter
        ->setTitleSelector('.single-post h1') //selectors can be either css or xpath
        ->setImageSelector(".sidebar img")
        ->setAuthorSelector('//a[@rel="author"]')
        ->setPublishDateSelector('//span[@class="published_data"]')
        ->setBodySelector('//div[@class="contents"]');

$newsData = ($scrapClient->getLinkData("http://example.com/your-news-uri"));
print_r($newsData);
```

Custom scrapping adapter `CustomAdapter` supports method chaining for setting the selectors. If any selector is not specified it will use default selectors based on `DefaultAdapter` (which is html adapter that depends of standard meta tags).

### Scrapping Group of Links

[](#scrapping-group-of-links)

To scrap group of news article from certain page containing news links, `scrapLinkGroup` method can be used

```
$listingPageUrl = 'https://www.readability.com/topreads/'; //url containing news listing
$linksSelector = '.entry-title a'; //css or xpath selector for news links inside listing page
$numberOfArticles = 3; //number of links to scrap, use null to get all matching selector

$scrapClient = new \Zrashwani\NewsScrapper\Client();
$newsGroupData = $scrapClient->scrapLinkGroup($listingPageUrl, $linksSelector,$numberOfArticles);
foreach($newsGroupData as $singleNews){
    print_r($singleNews);
}
```

How to Contribute
-----------------

[](#how-to-contribute)

1. Fork this repository
2. Create a new branch for each feature or improvement
3. Send a pull request from each feature branch

It is very important to separate new features or improvements into separate feature branches, and to send a pull request for each branch. This allows me to review and pull in new features or improvements individually.

All pull requests must adhere to the [PSR-2 standard](https://github.com/php-fig/fig-standards/blob/master/accepted/PSR-2-coding-style-guide.md).

System Requirements
-------------------

[](#system-requirements)

- PHP 5.4.0+

License
-------

[](#license)

MIT Public License

###  Health Score

32

—

LowBetter than 72% of packages

Maintenance20

Infrequent updates — may be unmaintained

Popularity21

Limited adoption so far

Community13

Small or concentrated contributor base

Maturity63

Established project with proven stability

 Bus Factor1

Top contributor holds 86.4% of commits — single point of failure

How is this calculated?**Maintenance (25%)** — Last commit recency, latest release date, and issue-to-star ratio. Uses a 2-year decay window.

**Popularity (30%)** — Total and monthly downloads, GitHub stars, and forks. Logarithmic scaling prevents top-heavy scores.

**Community (15%)** — Contributors, dependents, forks, watchers, and maintainers. Measures real ecosystem engagement.

**Maturity (30%)** — Project age, version count, PHP version support, and release stability.

###  Release Activity

Cadence

Every ~142 days

Recently: every ~174 days

Total

6

Last Release

3191d ago

Major Versions

v0.2.0 → v1.0.02015-11-21

### Community

Maintainers

![](https://www.gravatar.com/avatar/298baef5380170ffc9c8f276f59b7f099917717738aa6439ed037db29bb1efff?d=identicon)[zrashwani](/maintainers/zrashwani)

---

Top Contributors

[![zrashwani](https://avatars.githubusercontent.com/u/5447528?v=4)](https://github.com/zrashwani "zrashwani (19 commits)")[![AbdullahAlfar](https://avatars.githubusercontent.com/u/12845423?v=4)](https://github.com/AbdullahAlfar "AbdullahAlfar (2 commits)")[![scrutinizer-auto-fixer](https://avatars.githubusercontent.com/u/6253494?v=4)](https://github.com/scrutinizer-auto-fixer "scrutinizer-auto-fixer (1 commits)")

---

Tags

crawler newsscrapper

###  Code Quality

TestsPHPUnit

Code StylePHP\_CodeSniffer

### Embed Badge

![Health badge](/badges/zrashwani-news-scrapper/health.svg)

```
[![Health](https://phpackages.com/badges/zrashwani-news-scrapper/health.svg)](https://phpackages.com/packages/zrashwani-news-scrapper)
```

###  Alternatives

[jaybizzle/crawler-detect

CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent

2.3k96.7M215](/packages/jaybizzle-crawler-detect)[georgringer/news

News system - Versatile news system based on Extbase &amp; Fluid and using the latest technologies provided by TYPO3 CMS.

2815.1M90](/packages/georgringer-news)[php-soap/wsdl

Deals with WSDLs

173.5M12](/packages/php-soap-wsdl)[aedart/athenaeum

Athenaeum is a mono repository; a collection of various PHP packages

245.2k](/packages/aedart-athenaeum)

PHPackages © 2026

[Directory](/)[Categories](/categories)[Trending](/trending)[Changelog](/changelog)[Analyze](/analyze)
