PHPackages                             juzaweb/crawler - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [Utility &amp; Helpers](/categories/utility)
4. /
5. juzaweb/crawler

ActiveJuzaweb-module[Utility &amp; Helpers](/categories/utility)

juzaweb/crawler
===============

Juzaweb CMS Crawler module

2.0.2(2mo ago)1235↓100%[1 PRs](https://github.com/juzaweb/crawler/pulls)MITPHPPHP ^8.2CI passing

Since Jan 21Pushed 1mo ago1 watchersCompare

[ Source](https://github.com/juzaweb/crawler)[ Packagist](https://packagist.org/packages/juzaweb/crawler)[ Docs](https://juzaweb.com/cms)[ RSS](/packages/juzaweb-crawler/feed)WikiDiscussions master Synced 1mo ago

READMEChangelog (6)Dependencies (3)Versions (9)Used By (0)

Juzaweb Crawler Module
======================

[](#juzaweb-crawler-module)

A powerful and extensible crawler module for Juzaweb CMS that automates content aggregation from external websites.

Features
--------

[](#features)

- **Configurable Sources**: Define crawl targets using CSS selectors and Regex directly from the Admin interface.
- **Concurrent Crawling**: Uses Guzzle Pool for high-performance, concurrent requests.
- **Flexible Data Extraction**: Extract data as Text, HTML, or Arrays using precise CSS selectors.
- **Content Cleaning**: Automatically remove unwanted elements (ads, scripts, etc.) from crawled content.
- **Extensible Data Types**: Map crawled data to any model (Posts, Products, etc.) by implementing custom Data Types.
- **Admin Integration**: Full management of Sources, Pages, and Logs via the Juzaweb Admin Panel.

Installation
------------

[](#installation)

Install the package via Composer:

```
composer require juzaweb/crawler
```

Run the migrations to create the necessary database tables:

```
php artisan migrate
```

(Optional) Publish the configuration file and assets:

```
php artisan vendor:publish --tag=crawler-config
php artisan vendor:publish --tag=crawler-assets
```

Usage Workflow
--------------

[](#usage-workflow)

The crawling process follows a 3-step workflow: **Discover -&gt; Crawl -&gt; Process**.

### 1. Create a Crawler Source

[](#1-create-a-crawler-source)

Navigate to **Crawler &gt; Sources** in the Admin Panel and create a new Source.

- **Name**: A descriptive name for the source.
- **Data Type**: The type of content to create (e.g., "Post").
- **Link Element**: The CSS selector to find links to individual pages (e.g., `.post-list .post-title a`).
- **Link Regex**: (Optional) A regex pattern to filter the extracted links.
- **Components**: Define what data to extract from the detail page.
    - **Element**: CSS selector for the data (e.g., `h1.title`).
    - **Attribute**: (Optional) Attribute to extract (e.g., `src` for images). Leave empty for text/html.
    - **Format**: `Text`, `HTML`, or `Array`.
- **Removes**: CSS selectors for elements to remove from the extracted content (e.g., `.ad-banner`).

### 2. Add Seed Pages

[](#2-add-seed-pages)

In the Source edit page, add **Seed Pages**. These are the starting points for the crawler (e.g., a blog category page or a search result page).

- **URL**: The URL to start crawling from.
- **Next Page**: (Optional) Pattern for pagination (e.g., `page/:page`).

### 3. Run the Crawler

[](#3-run-the-crawler)

You can run the crawler manually using Artisan commands or schedule them.

#### Step 1: Discover Links

[](#step-1-discover-links)

Visits the Seed Pages and extracts links matching the `Link Element` selector.

```
php artisan crawl:pages
```

#### Step 2: Crawl Content

[](#step-2-crawl-content)

Visits the discovered links and extracts data based on the Source's `Components`.

```
php artisan crawl:links
```

#### Step 3: Process to Post

[](#step-3-process-to-post)

Converts the crawled data (Logs) into actual CMS content (e.g., Posts) using the `Data Type` handler.

```
php artisan crawl:content-to-post
```

Scheduling
----------

[](#scheduling)

To automate the crawling process, add the commands to your `app/Console/Kernel.php` (or use the Scheduler in your server environment).

```
// app/Console/Kernel.php

protected function schedule(Schedule $schedule)
{
    // Discover new links every hour
    $schedule->command('crawl:pages')->hourly();

    // Crawl content every 15 minutes
    $schedule->command('crawl:links')->everyFifteenMinutes();

    // Process posts every 30 minutes
    $schedule->command('crawl:content-to-post')->everyThirtyMinutes();
}
```

Extending (Custom Data Types)
-----------------------------

[](#extending-custom-data-types)

You can define custom Data Types to save crawled data to different models (e.g., Products, Videos).

### 1. Implement `CrawlerDataType`

[](#1-implement-crawlerdatatype)

Create a class that implements `Juzaweb\Modules\Crawler\Contracts\CrawlerDataType`.

```
namespace App\Crawler;

use Juzaweb\Modules\Crawler\Contracts\CrawlerDataType;
use Juzaweb\Modules\Crawler\Models\CrawlerLog;
use Illuminate\Database\Eloquent\Model;

class ProductDataType implements CrawlerDataType
{
    public function save(CrawlerLog $crawlerLog): Model
    {
        $data = $crawlerLog->content_json;

        // Logic to save data to your Product model
        $product = $crawlerLog->post ?: new \App\Models\Product();
        $product->name = $data['title'];
        $product->price = $data['price'];
        $product->description = $data['content'];
        $product->save();

        return $product;
    }

    public function components(): array
    {
        return [
            'title' => [
                'type' => 'text',
                'label' => 'Title',
            ],
            'price' => [
                'type' => 'text',
                'label' => 'Price',
            ],
            'content' => [
                'type' => 'html',
                'label' => 'Description',
            ],
        ];
    }

    public function getLabel(): string
    {
        return 'Products';
    }

    // ... implement other methods
}
```

### 2. Register the Data Type

[](#2-register-the-data-type)

Register your custom Data Type in a Service Provider (e.g., `AppServiceProvider` or a custom one).

```
use Juzaweb\Modules\Crawler\Facades\Crawler;
use App\Crawler\ProductDataType;

public function boot()
{
    Crawler::registerDataType('product', function () {
        return new ProductDataType();
    });
}
```

Now, "Products" will appear as a Data Type option when creating a Crawler Source.

###  Health Score

48

—

FairBetter than 94% of packages

Maintenance95

Actively maintained with recent releases

Popularity18

Limited adoption so far

Community9

Small or concentrated contributor base

Maturity59

Maturing project, gaining track record

 Bus Factor1

Top contributor holds 87.4% of commits — single point of failure

How is this calculated?**Maintenance (25%)** — Last commit recency, latest release date, and issue-to-star ratio. Uses a 2-year decay window.

**Popularity (30%)** — Total and monthly downloads, GitHub stars, and forks. Logarithmic scaling prevents top-heavy scores.

**Community (15%)** — Contributors, dependents, forks, watchers, and maintainers. Measures real ecosystem engagement.

**Maturity (30%)** — Project age, version count, PHP version support, and release stability.

###  Release Activity

Cadence

Every ~155 days

Recently: every ~189 days

Total

6

Last Release

60d ago

Major Versions

1.0.2 → 2.0.02026-03-05

### Community

Maintainers

![](https://www.gravatar.com/avatar/3169e8a8781068840e9300a57785089da521287dbe0279fc9cc7e8de1c1d95a9?d=identicon)[juzaweb](/maintainers/juzaweb)

---

Top Contributors

[![juzaweb](https://avatars.githubusercontent.com/u/47020363?v=4)](https://github.com/juzaweb "juzaweb (208 commits)")[![google-labs-jules[bot]](https://avatars.githubusercontent.com/in/842251?v=4)](https://github.com/google-labs-jules[bot] "google-labs-jules[bot] (30 commits)")

###  Code Quality

TestsPHPUnit

### Embed Badge

![Health badge](/badges/juzaweb-crawler/health.svg)

```
[![Health](https://phpackages.com/badges/juzaweb-crawler/health.svg)](https://phpackages.com/packages/juzaweb-crawler)
```

###  Alternatives

[ozdemir/datatables

Simplify your Datatables server-side processing effortlessly using our lightning-fast PHP library, streamlining your workflow seamlessly.

273158.4k](/packages/ozdemir-datatables)[that0n3guy/transliteration

Transliteration provides one-way string transliteration (romanization) and cleans text by replacing unwanted characters.

1296.5k4](/packages/that0n3guy-transliteration)

PHPackages © 2026

[Directory](/)[Categories](/categories)[Trending](/trending)[Changelog](/changelog)[Analyze](/analyze)
