PHPackages                             hizpark/crawler - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [Utility &amp; Helpers](/categories/utility)
4. /
5. hizpark/crawler

ActiveLibrary[Utility &amp; Helpers](/categories/utility)

hizpark/crawler
===============

Effortless recursive web crawling for comprehensive site indexing

v0.1.0(12mo ago)00MITPHPPHP &gt;=8.2CI passing

Since May 19Pushed 12mo ago1 watchersCompare

[ Source](https://github.com/changhorizon/crawler)[ Packagist](https://packagist.org/packages/hizpark/crawler)[ RSS](/packages/hizpark-crawler/feed)WikiDiscussions main Synced 1mo ago

READMEChangelog (1)Dependencies (4)Versions (2)Used By (0)

Crawler
=======

[](#crawler)

> Effortless recursive web crawling for comprehensive site indexing

[![License](https://camo.githubusercontent.com/68b3e06107ec9471647072392e88dc83d39e0ca672c962628afda19c8a1c18c3/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f6c6963656e73652f68697a7061726b2f637261776c65723f7374796c653d666c61742d737175617265)](https://camo.githubusercontent.com/68b3e06107ec9471647072392e88dc83d39e0ca672c962628afda19c8a1c18c3/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f6c6963656e73652f68697a7061726b2f637261776c65723f7374796c653d666c61742d737175617265)[![Latest Version](https://camo.githubusercontent.com/f34fbd6b65e3a5b8bfaeab760422c0134874bc06267678253d9a03e77f845785/68747470733a2f2f696d672e736869656c64732e696f2f7061636b61676973742f762f68697a7061726b2f637261776c65723f7374796c653d666c61742d737175617265)](https://camo.githubusercontent.com/f34fbd6b65e3a5b8bfaeab760422c0134874bc06267678253d9a03e77f845785/68747470733a2f2f696d672e736869656c64732e696f2f7061636b61676973742f762f68697a7061726b2f637261776c65723f7374796c653d666c61742d737175617265)[![PHP Version](https://camo.githubusercontent.com/10b897c523f00fa3f8f7b54dfe73999190e622480655d5b5f011e31fc32a7111/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f7068702d382e322d2d382e342d626c75653f7374796c653d666c61742d737175617265)](https://camo.githubusercontent.com/10b897c523f00fa3f8f7b54dfe73999190e622480655d5b5f011e31fc32a7111/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f7068702d382e322d2d382e342d626c75653f7374796c653d666c61742d737175617265)[![Static Analysis](https://camo.githubusercontent.com/1a477f5e7e742a33c1ff5b685167579083a26c342c4e997fe056ea4ef7bea73e/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f7374617469635f616e616c797369732d5048505374616e2d626c75653f7374796c653d666c61742d737175617265)](https://camo.githubusercontent.com/1a477f5e7e742a33c1ff5b685167579083a26c342c4e997fe056ea4ef7bea73e/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f7374617469635f616e616c797369732d5048505374616e2d626c75653f7374796c653d666c61742d737175617265)[![Tests](https://camo.githubusercontent.com/72829871c802983bff15745f71c846973b36a6fea0bb5298dadaeeb690463604/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f74657374732d504850556e69742d627269676874677265656e3f7374796c653d666c61742d737175617265)](https://camo.githubusercontent.com/72829871c802983bff15745f71c846973b36a6fea0bb5298dadaeeb690463604/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f74657374732d504850556e69742d627269676874677265656e3f7374796c653d666c61742d737175617265)[![codecov](https://camo.githubusercontent.com/28fb6604e32e05a365cd1476805d222707d633fb81a9645b2011d4402bff37d5/68747470733a2f2f636f6465636f762e696f2f67682f68697a7061726b2f637261776c65722f6272616e63682f6d61696e2f67726170682f62616467652e737667)](https://codecov.io/gh/hizpark/crawler)[![CI](https://github.com/hizpark/crawler/actions/workflows/ci.yml/badge.svg?style=flat-square)](https://github.com/hizpark/crawler/actions/workflows/ci.yml/badge.svg?style=flat-square)

A high-performance PHP web crawler library designed to recursively scrape all pages of a single website.This library is ideal for building site search, content analysis, SEO data collection, and similar projects. It is designed to be simple, extensible, and serve as a solid foundation for web crawling and data extraction.

✨ 特性
----

[](#-特性)

- **递归爬取**：自动提取页面内链接，广度优先遍历站点所有可访问页面
- **内容解析**：提取页面标题、关键词、描述以及正文文本，过滤无用标签和脚本
- **增量更新**：通过内容哈希值检测页面变更，避免重复写入数据库，提高效率
- **日志记录**：内置 PSR-3 日志接口，方便集成多种日志方案，实时跟踪爬取过程
- **多语言支持**：自动识别页面语言（lang属性），便于多语站点索引和处理
- **持久化存储**：采用 PDO 操作 MySQL，支持页面数据入库，方便后续搜索索引或分析
- **可配置延时**：支持爬取间隔时间设置，降低对目标站点的压力

📦 安装
----

[](#-安装)

```
composer require hizpark/crawler
```

📂 目录结构
------

[](#-目录结构)

```
src
├── Crawler.php
├── Document.php
├── Http
│   ├── CurlHttpClient.php
│   └── HttpClientInterface.php
└── Storage
    ├── DocumentStorageInterface.php
    └── PdoDocumentStorage.php
```

🚀 用法示例
------

[](#-用法示例)

### 示例 1：基础爬取示例

[](#示例-1基础爬取示例)

```
use Hizpark\Crawler\Crawler;
use Hizpark\Crawler\Http\CurlHttpClient;
use Hizpark\Crawler\Storage\PdoDocumentStorage;
use Psr\Log\NullLogger;

$startUrl = 'https://example.com';

$httpClient = new CurlHttpClient();
$storage = new PdoDocumentStorage($pdoConnection); // $pdoConnection 为已初始化的 PDO 实例
$logger = new NullLogger(); // 可替换为任何 PSR-3 实现，如 Monolog

$crawler = new Crawler($startUrl, $httpClient, $storage, $logger);
$crawler->run();
```

### 示例 2：自定义爬取间隔

[](#示例-2自定义爬取间隔)

```
$crawlDelay = 3;

$crawler = new Crawler($startUrl, $httpClient, $storage, $logger, $crawlDelay);
$crawler->run();
```

📐 接口说明
------

[](#-接口说明)

### Crawler::\_\_construct

[](#crawler__construct)

> 构造函数，创建一个爬虫实例。

```
public function __construct(
    string $startUrl,
    HttpClientInterface $httpClient,
    DocumentStorageInterface $storage,
    LoggerInterface $logger,
    int $crawlDelay = 1,
);
```

- **$startUrl**：起始 URL。
- **$httpClient**：实现 `HttpClientInterface` 的 HTTP 客户端
- **$storage**：实现 `DocumentStorageInterface` 的文档存储实现
- **$logger**：实现 `LoggerInterface`（PSR-3 标准）的日志处理器
- **$crawlDelay**：每次抓取之间的间隔，单位为秒

### Crawler::run

[](#crawlerrun)

> 启动爬虫任务。

```
public function run(): void
```

### Crawler::fetchUrlContent

[](#crawlerfetchurlcontent)

> 获取 HTML 页面正文内容，过滤掉响应头，若请求无效返回 null。

```
private function fetchUrlContent(string $url): ?string
```

### Crawler::scrapePage

[](#crawlerscrapepage)

> 抓取并解析页面，存储内容，并递归抓取内部链接。

```
private function scrapePage(string $url): void
```

### Crawler::extractLinks

[](#crawlerextractlinks)

> 从 HTML 中提取合法的绝对链接。

```
private function extractLinks(string $html): array
```

### Crawler::resolveUrl

[](#crawlerresolveurl)

> 将相对链接解析为绝对 URL，仅保留与起始 URL 同源的链接。

```
private function resolveUrl(string $href): ?string
```

### Hizpark\\Crawler\\Storage\\DocumentStorageInterface

[](#hizparkcrawlerstoragedocumentstorageinterface)

> 定义用于文档存储的接口，包含存储与状态检测方法。

```
interface DocumentStorageInterface
{
    /**
     * 保存或更新文档内容。
     *
     * @param string $url
     * @param Document $document
     * @return bool 返回 true 表示内容发生变化已保存
     */
    public function save(string $url, Document $document): bool;

    /**
     * 检查指定 URL 是否已处理过。
     *
     * @param string $url
     * @return bool
     */
    public function isProcessed(string $url): bool;
}
```

### Hizpark\\Crawler\\Http\\HttpClientInterface

[](#hizparkcrawlerhttphttpclientinterface)

> 定义用于发起 HTTP 请求的接口。

```
interface HttpClientInterface
{
    /**
     * 发送 GET 请求。
     *
     * @param string $url
     * @return array|null 包含 statusCode, contentType 和 content，若失败则为 null
     */
    public function get(string $url): ?array;
}
```

**备注：** 所有接口实现均来自包内组件，适合用于扩展和替换。爬虫本体逻辑通过 `DocumentStorageInterface` 和 `HttpClientInterface` 解耦，便于测试与复用。

🔍 静态分析
------

[](#-静态分析)

使用 PHPStan 工具进行静态分析，确保代码的质量和一致性：

```
composer stan
```

🎯 代码风格
------

[](#-代码风格)

使用 PHP-CS-Fixer 工具检查代码风格：

```
composer cs:chk
```

使用 PHP-CS-Fixer 工具自动修复代码风格问题：

```
composer cs:fix
```

✅ 单元测试
------

[](#-单元测试)

执行 PHPUnit 单元测试：

```
composer test
```

执行 PHPUnit 单元测试并生成代码覆盖率报告：

```
composer test:coverage
```

🤝 贡献指南
------

[](#-贡献指南)

欢迎 Issue 与 PR，建议遵循以下流程：

1. Fork 仓库
2. 创建新分支进行开发
3. 提交 PR 前请确保测试通过、风格一致
4. 提交详细描述

📜 License
---------

[](#-license)

MIT License. See the [LICENSE](LICENSE) file for details.

###  Health Score

25

—

LowBetter than 37% of packages

Maintenance50

Moderate activity, may be stable

Popularity0

Limited adoption so far

Community4

Small or concentrated contributor base

Maturity39

Early-stage or recently created project

How is this calculated?**Maintenance (25%)** — Last commit recency, latest release date, and issue-to-star ratio. Uses a 2-year decay window.

**Popularity (30%)** — Total and monthly downloads, GitHub stars, and forks. Logarithmic scaling prevents top-heavy scores.

**Community (15%)** — Contributors, dependents, forks, watchers, and maintainers. Measures real ecosystem engagement.

**Maturity (30%)** — Project age, version count, PHP version support, and release stability.

###  Release Activity

Cadence

Unknown

Total

1

Last Release

364d ago

### Community

Maintainers

![](https://www.gravatar.com/avatar/ebedf625b7a40647c79c0b4788067638b60eb50ed9d0cab47ea0852e03772020?d=identicon)[changhorizon](/maintainers/changhorizon)

###  Code Quality

TestsPHPUnit

Static AnalysisPHPStan

Code StylePHP CS Fixer

Type Coverage Yes

### Embed Badge

![Health badge](/badges/hizpark-crawler/health.svg)

```
[![Health](https://phpackages.com/badges/hizpark-crawler/health.svg)](https://phpackages.com/packages/hizpark-crawler)
```

PHPackages © 2026

[Directory](/)[Categories](/categories)[Trending](/trending)[Changelog](/changelog)[Analyze](/analyze)
