PHPackages                             manofstrong/sitescrapper - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [Utility &amp; Helpers](/categories/utility)
4. /
5. manofstrong/sitescrapper

ActiveLibrary[Utility &amp; Helpers](/categories/utility)

manofstrong/sitescrapper
========================

A Package to Scrape Websites from their Sitemaps and Extract Relevant Content from the Webpage and Upload to a Database

v0.0.1(6y ago)6711MITPHPPHP &gt;= 7.0CI failing

Since Oct 11Pushed 6y ago1 watchersCompare

[ Source](https://github.com/manofstrong/sitescrapper)[ Packagist](https://packagist.org/packages/manofstrong/sitescrapper)[ Docs](https://github.com/manofstrong/sitescrapper)[ RSS](/packages/manofstrong-sitescrapper/feed)WikiDiscussions master Synced 3d ago

READMEChangelog (1)Dependencies (8)Versions (2)Used By (0)

Site Scrapper
=============

[](#site-scrapper)

A PHP library to scrape websites based on their sitemaps and extract relevant content from the webpages the content is then uploaded a database for later use.

The [Sitemaps.org](http://www.sitemaps.org/) protocol is the leading standard and is supported by Google, Bing, Yahoo, Ask and many others and has now become an accepted means of displaying webpages in a website and their relevance. This has meant that most of the modern websites have implemented sitemaps which makes it easier for a web scrapper to avoid unnecessary measures to find the links and go straight to the source. Please note that the library can recursively parse through sitemaps so only one sitemap per website is needed.

This library also eliminates the need for exploring site specific html tags in search of the relevant content by stripping down the page content to the most important parts of the page by removing boilerplate and extracting the rest as full text. The library then obtains the keywords of the extracted text and the word count of the text. These are essential for later data analysis and keywords. Finally, the library uploads the content into MySQL database whose schema has been provided in the database folder of this project.

Basically, this is a blind bulk scrapping tool, just provide it with a list of sitemaps, run it. It will run for as long as it takes to scrape through the pages of the websites provided and upload them to the database for you.

**This library is designed to be run from the Command Line rather than web browser. Please consider it a CLI tool and use it as such.**

Features
--------

[](#features)

- Sitemap parsing (either a single site, or a list of sites)
- Scrapping (relevant content extraction)
- Keyword extraction
- Word count of extracted data
- Custom User-Agent string
- Database uploading of extracted content

Sitemap Formats supported
-------------------------

[](#sitemap-formats-supported)

- XML `.xml`
- Compressed XML `.xml.gz`
- Robots.txt rule sheet `robots.txt`

Webpage Formats supported
-------------------------

[](#webpage-formats-supported)

- HTML `text/html`

Sitemap File Formats supported
------------------------------

[](#sitemap-file-formats-supported)

- Text `text/txt`

Installation
------------

[](#installation)

The library is available for install via [Composer](https://getcomposer.org). Just add this to your `composer.json` file:

```
{
    "require": {
        "manofstrong/sitescrapper": "^0.0.1"
    }
}
```

Then run `composer update`.

Getting Started
---------------

[](#getting-started)

### Basic display example

[](#basic-display-example)

Returns the content of a specified number of pages from a single sitemap. Does not store into database.

```