PHPackages                             2dareis2do/newspaper3k-php-wrapper - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [Utility &amp; Helpers](/categories/utility)
4. /
5. 2dareis2do/newspaper3k-php-wrapper

ActiveLibrary[Utility &amp; Helpers](/categories/utility)

2dareis2do/newspaper3k-php-wrapper
==================================

PHP wrapper for Newspaper3k Article scraping &amp; curation

2.1.2(1y ago)1228GPL-3.0-or-laterPHPPHP &gt;=7.0

Since Feb 26Pushed 1y agoCompare

[ Source](https://github.com/2dareis2do/newspaper3k-php-wrapper)[ Packagist](https://packagist.org/packages/2dareis2do/newspaper3k-php-wrapper)[ Docs](https://github.com/2dareis2do/newspaper3k-php-wrapper)[ RSS](/packages/2dareis2do-newspaper3k-php-wrapper/feed)WikiDiscussions master Synced today

READMEChangelog (4)Dependencies (1)Versions (9)Used By (0)

Newspaper3k PHP Wrapper
=======================

[](#newspaper3k-php-wrapper)

[![Software License](https://camo.githubusercontent.com/e1514dd3f2095dbf68a0008ae62a631142953ad2e86aa94c504343f2c2c191da/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f6c6963656e73652d47504c2d627269676874677265656e2e7376673f7374796c653d666c61742d737175617265)](LICENSE)[![Packagist Version](https://camo.githubusercontent.com/71134fddf8aa1326a5cc4070049ca276331f2ac079067d68f0c04a8f849cc9fe/68747470733a2f2f696d672e736869656c64732e696f2f7061636b61676973742f762f3264617265697332646f2f6e6577737061706572336b2d7068702d777261707065722e7376673f7374796c653d666c61742d737175617265)](https://packagist.org/packages/2dareis2do/newspaper3k-php-wrapper)

Simple php wrapper for Newspaper3/4k Article scraping and curation.

Now updated to add support for changing the current working directory, enabling you to customise your curation script per job.

Update
------

[](#update)

2.1.0 introduces an additional parameter for a client to pass command parameter. This is useful where multiple versions of python (with respective dependencies) may be available on a single server. If no value is passed, it will default to the use the default python string. This supports both absolute or relative paths.

Customising ArticleScraping.py
------------------------------

[](#customising-articlescrapingpy)

This script is designed to use a modified version of the ArticleScraping script. e.g. Here is an custom example of ArticleScraping.py that utilises a Playwright wrapper. This can be utilised by passing the cwd parameter:

```
#!/usr/bin/python
# -*- coding: utf8 -*-

import json, sys, os
import nltk
import newspaper
from newspaper import Article
from datetime import datetime
import lxml, lxml.html
from playwright.sync_api import sync_playwright

sys.stdout = open(os.devnull, "w") #To prevent a function from printing in the batch console in Python

url = functionName = sys.argv[1]

def accept_cookies_and_fetch_article(url):
    # Using Playwright to handle login and fetch article
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)  # Set headless=False to watch the browser actions
        page = browser.new_page()

        # create a new incognito browser context
        context = browser.new_context()
        # create a new page inside context.
        page = context.new_page()

        page.goto(url)

        # Automating iframe button click
        page.frame_locator("iframe[title=\"SP Consent Message\"]").get_by_label("Essential cookies only").click()

        content = page.content()
        # dispose context once it is no longer needed.
        context.close()
        browser.close()

    # Using Newspaper4k to parse the page content
    article = newspaper.article(url, input_html=content, language='en')
    article.parse() # Parse the article
    article.nlp() # Keyword extraction wrapper

    return article

article = accept_cookies_and_fetch_article(url)

# article.download() #Downloads the link’s HTML content
# 1 time download of the sentence tokenizer
# perhaps better to run from command line as we don't need to install each time?
#nltk.download('all')
#nltk.download('punkt')

sys.stdout = sys.__stdout__

data = article.__dict__
del data['config']
del data['extractor']

for i in data:
    if type(data[i]) is set:
        data[i] = list(data[i])
    if type(data[i]) is datetime:
        data[i] = data[i].strftime("%Y-%m-%d %H:%M:%S")
    if type(data[i]) is lxml.html.HtmlElement:
        data[i] = lxml.html.tostring(data[i])
    if type(data[i]) is bytes:
        data[i] = str(data[i])

print(json.dumps(data))

```

Using Newspaper3kWrapper
------------------------

[](#using-newspaper3kwrapper)

In this shortened example we simply pass the current working directory to the Newspaper3kWrapper.

```
  use Twodareis2do\Scrape\Newspaper3kWrapper;

      try {

        // initiate the parser
        $this->parser = new Newspaper3kWrapper();

        // If no $cwd then use default 'ArticleScraping.py'
        if (isset($cwd)) {
          $output = $this->parser->scrape($value, $debug, $cwd);
        }
        else {
          $output = $this->parser->scrape($value, $debug);
        }
        // return any scraped output
        return $output;

      }
      catch (\Exception $e) {

        // Logs a notice to channel if we get http error response.
        $this->logger->notice('Newspaper Playwright Failed to get (1) URL @url "@error". @code', [
          '@url' => $value,
          '@error' => $e->getMessage(),
          '@code' => $e->getCode(),
        ]);

        // return empty string
        return '';
      }

```

Alternative Article Scraping Script
-----------------------------------

[](#alternative-article-scraping-script)

The path to `ArticleScraping.py` can be changed by passing the cwd. Here is an example that uses the Cloudscraper library.

```
#!/usr/bin/python
# -*- coding: utf8 -*-

import json, sys, os
import nltk
from newspaper import Article
from newspaper import Config
from newspaper.article import ArticleException, ArticleDownloadState
from datetime import datetime
import lxml, lxml.html
import cloudscraper

browser={
    'browser': 'chrome',
    'platform': 'android',
    'desktop': False
}

scraper = cloudscraper.create_scraper(browser)  # returns a CloudScraper instance

sys.stdout = open(os.devnull, "w") #To prevent a function from printing in the batch console in Python

url = functionName = sys.argv[1]

scraped = scraper.get(url).text

article = Article('')
article.html = scraped

ds = article.download_state

if ds == ArticleDownloadState.SUCCESS:
    article.parse() #Parse the article
    # 1 time download of the sentence tokenizer
    # perhaps better to run from command line as we don't need to install each time?
    #nltk.download('all')
    #nltk.download('punkt')
    article.nlp()#  Keyword extraction wrapper

    sys.stdout = sys.__stdout__

    data = article.__dict__
    del data['config']
    del data['extractor']

    for i in data:
        if type(data[i]) is set:
            data[i] = list(data[i])
        if type(data[i]) is datetime:
            data[i] = data[i].strftime("%Y-%m-%d %H:%M:%S")
        if type(data[i]) is lxml.html.HtmlElement:
            data[i] = lxml.html.tostring(data[i])
        if type(data[i]) is bytes:
            data[i] = str(data[i])

    print(json.dumps(data))

elif ds == ArticleDownloadState.FAILED_RESPONSE:
    pass

```

Features
--------

[](#features)

- Multi-threaded article download framework
- News url identification
- Text extraction from html
- Top image extraction from html
- All image extraction from html
- Keyword extraction from text
- Summary extraction from text
- Author extraction from text
- Google trending terms extraction
- Works in 10+ languages (English, Chinese, German, Arabic, ...)

```
    >>> import newspaper
    >>> newspaper.languages()

    Your available languages are:
    input code      full name

      ar              Arabic
      be              Belarusian
      bg              Bulgarian
      da              Danish
      de              German
      el              Greek
      en              English
      es              Spanish
      et              Estonian
      fa              Persian
      fi              Finnish
      fr              French
      he              Hebrew
      hi              Hindi
      hr              Croatian
      hu              Hungarian
      id              Indonesian
      it              Italian
      ja              Japanese
      ko              Korean
      lt              Lithuanian
      mk              Macedonian
      nb              Norwegian (Bokmål)
      nl              Dutch
      no              Norwegian
      pl              Polish
      pt              Portuguese
      ro              Romanian
      ru              Russian
      sl              Slovenian
      sr              Serbian
      sv              Swedish
      sw              Swahili
      th              Thai
      tr              Turkish
      uk              Ukrainian
      vi              Vietnamese
      zh              Chinese

```

Install dependencies
--------------------

[](#install-dependencies)

Run ✅ `pip3 install newspaper3k` ✅

NOT ⛔ `pip3 install newspaper` ⛔

On python3 you must install `newspaper3k`, **not** `newspaper`. `newspaper` is our python2 library. Although installing newspaper is simple with `pip `\_, you will run into fixable issues if you are trying to install on ubuntu.

**If you are on Debian / Ubuntu**, install using the following:

- Install `pip3` command needed to install `newspaper3k` package::

    $ sudo apt-get install python3-pip
- Python development version, needed for Python.h::

    $ sudo apt-get install python-dev
- lxml requirements::

    $ sudo apt-get install libxml2-dev libxslt-dev
- For PIL to recognize .jpg images::

    $ sudo apt-get install libjpeg-dev zlib1g-dev libpng12-dev

NOTE: If you find problem installing `libpng12-dev`, try installing `libpng-dev`.

- Download NLP related corpora::

    $ curl [https://raw.githubusercontent.com/codelucas/newspaper/master/download\_corpora.py](https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py) | python3
- Install the distribution via pip::

    $ pip3 install newspaper3k

**If you are on OSX**, install using the following, you may use both homebrew or macports:

::

```
$ brew install libxml2 libxslt

$ brew install libtiff libjpeg webp little-cms2

$ pip3 install newspaper3k

$ curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py | python3

```

**Otherwise**, install with the following:

NOTE: You will still most likely need to install the following libraries via your package manager

- PIL: `libjpeg-dev` `zlib1g-dev` `libpng12-dev`
- lxml: `libxml2-dev` `libxslt-dev`
- Python Development version: `python-dev`

::

```
$ pip3 install newspaper3k

$ curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py | python3

```

### Install on venv

[](#install-on-venv)

> The venv module supports creating lightweight “virtual environments”, each with their own independent set of Python packages installed in their site directories. A virtual environment is created on top of an existing Python installation, known as the virtual environment’s “base” Python, and may optionally be isolated from the packages in the base environment, so only those explicitly installed in the virtual environment are available. [Python](https://docs.python.org/3/library/venv.html)

also,

> A common directory location for a virtual environment is .venv. This name keeps the directory typically hidden in your shell and thus out of the way while giving it a name that explains why the directory exists. It also prevents clashing with .env environment variable definition files that some tooling supports.

Bearing this in mind here this is the recommended way to install your dependencies:

1. If installing for the first time you may need to make sure pip is enabled. On ubuntu 22.x first update apt e.g. `apt update` then install `apt install python3-pip`
2. If installing for the first time you may also need to make sure venv is available. On ubuntu 22.x it can be downloaded like so `apt install python3-venv`
3. Decide where you want to set up you venv. This can be somehwere on your virtual host. You can use the following syntax: `python -m venv /path/to/new/virtual/.venv`
4. Activate your .venv in your current session. e.g. `source /path/to/new/virtual/.venv/bin/activate`
5. The first time you set up your script you will likely need to download and install any necessary dependencies. You can use pip to help with this form your Virtual session. Once you have installed your dependencies, you can export a list to use for subsequent installs e.g. `python -m pip freeze > /path/to/requirements.txt`
6. Exit your virtual environment. e.g. `deactivate`

### Subsequent installs

[](#subsequent-installs)

The next time you have to set up your dependencies, you can now start using pip to install them automatically. e.g. `python -m pip install -r /path/to/requirements.txt`

Running on the server
---------------------

[](#running-on-the-server)

Chances are you web server does not run a virtual environment session. However, we can still specify the path to python in our newly created virtual environment folder and python will automatically load your installed dependencies (unlike the global server version). e.g. We can pass the absolute path to the version of python we want to use by passing the following $command parameter: `/path/to/python/.venv/bin/python`

This can also be passed relatively which can be more robust across different environments. e.g. `../relative/path/to/python/.venv/bin/python`

If we do not a path for $command, it will default to use the globally installed verion of `python`.

Installation
------------

[](#installation)

```
composer require 2dareis2do/newspaper3k-php-wrapper

```

1 time download of the sentence tokenizer
-----------------------------------------

[](#1-time-download-of-the-sentence-tokenizer)

After installing the NLTK package, please do install the necessary datasets/models for specific functions to work.

In particular you will need the [Punkt Sentence Tokenizer](https://www.nltk.org/api/nltk.tokenize.punkt.html).

e.g.

```
$ python

```

loads python interpreter:

```
>>> import nltk
>>> nltk.download('all')

```

or

```
>>> nltk.download('punkt')

```

Note that 'all' would be a few gigabytes so bear this in mind (Installing can quickly eat up any root partition disk space).

Usage
-----

[](#usage)

```
use Twodareis2do\Scrape\Newspaper3kWrapper;

$parser = new Newspaper3kWrapper();

$parser->scrape('your url');
```

Read more
---------

[](#read-more)

[Newspaper](https://github.com/codelucas/newspaper)

[nltk](http://www.nltk.org/install.html)

[Scrape &amp; Summarize News Articles Using Python](https://medium.com/@randerson112358/scrape-summarize-news-articles-using-python-51a48af1b4e2)

###  Health Score

29

—

LowBetter than 57% of packages

Maintenance42

Moderate activity, may be stable

Popularity15

Limited adoption so far

Community8

Small or concentrated contributor base

Maturity42

Maturing project, gaining track record

 Bus Factor1

Top contributor holds 74.1% of commits — single point of failure

How is this calculated?**Maintenance (25%)** — Last commit recency, latest release date, and issue-to-star ratio. Uses a 2-year decay window.

**Popularity (30%)** — Total and monthly downloads, GitHub stars, and forks. Logarithmic scaling prevents top-heavy scores.

**Community (15%)** — Contributors, dependents, forks, watchers, and maintainers. Measures real ecosystem engagement.

**Maturity (30%)** — Project age, version count, PHP version support, and release stability.

###  Release Activity

Cadence

Every ~64 days

Recently: every ~96 days

Total

7

Last Release

471d ago

Major Versions

1.0.0 → 2.0.02024-02-27

### Community

Maintainers

![](https://www.gravatar.com/avatar/87d7d5166c52eb84e4fbcfcdb3444c60ef513347e8ed2f58a38a3fa2dfdfc468?d=identicon)[2dareis2do](/maintainers/2dareis2do)

---

Top Contributors

[![2dareis2do](https://avatars.githubusercontent.com/u/1718370?v=4)](https://github.com/2dareis2do "2dareis2do (20 commits)")[![Mehrdad-Dadkhah](https://avatars.githubusercontent.com/u/3860685?v=4)](https://github.com/Mehrdad-Dadkhah "Mehrdad-Dadkhah (7 commits)")

---

Tags

languagescrape newsarticlenatural-processingNewspaper3kNewspaper4k

### Embed Badge

![Health badge](/badges/2dareis2do-newspaper3k-php-wrapper/health.svg)

```
[![Health](https://phpackages.com/badges/2dareis2do-newspaper3k-php-wrapper/health.svg)](https://phpackages.com/packages/2dareis2do-newspaper3k-php-wrapper)
```

###  Alternatives

[matomo/matomo

Matomo is the leading Free/Libre open analytics platform

21.7k38.9k](/packages/matomo-matomo)[civicrm/civicrm-core

Open source constituent relationship management for non-profits, NGOs and advocacy organizations.

751291.4k43](/packages/civicrm-civicrm-core)[spatie/laravel-export

Create a static site bundle from a Laravel app

674146.0k6](/packages/spatie-laravel-export)[georgringer/news

News system - Versatile news system based on Extbase &amp; Fluid and using the latest technologies provided by TYPO3 CMS.

2985.3M123](/packages/georgringer-news)[illuminate/process

The Illuminate Process package.

44869.2k99](/packages/illuminate-process)[dagger/dagger

Dagger PHP SDK

261.1k](/packages/dagger-dagger)

PHPackages © 2026

[Directory](/)[Categories](/categories)[Trending](/trending)[Changelog](/changelog)[Analyze](/analyze)