PHPackages                             leopoletto/robots-txt-parser - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [Parsing &amp; Serialization](/categories/parsing)
4. /
5. leopoletto/robots-txt-parser

ActiveLibrary[Parsing &amp; Serialization](/categories/parsing)

leopoletto/robots-txt-parser
============================

A comprehensive PHP package for parsing robots.txt files, including support for meta tags and X-Robots-Tag HTTP headers

2.1.0(2mo ago)113↓100%MITPHPPHP ^8.2CI passing

Since Nov 7Pushed 1mo agoCompare

[ Source](https://github.com/leopoletto/robots-txt-parser)[ Packagist](https://packagist.org/packages/leopoletto/robots-txt-parser)[ RSS](/packages/leopoletto-robots-txt-parser/feed)WikiDiscussions main Synced 1mo ago

READMEChangelog (6)Dependencies (4)Versions (8)Used By (0)

Robots TXT Parser
=================

[](#robots-txt-parser)

[![Latest Version on Packagist](https://camo.githubusercontent.com/54d9c7b4cfbfc79e5aefb4f38b1f08faa96eeba3992da2d4e5e2e50a974de58b/68747470733a2f2f696d672e736869656c64732e696f2f7061636b61676973742f762f6c656f706f6c6574746f2f726f626f74732d7478742d7061727365722e7376673f7374796c653d666c61742d737175617265)](https://packagist.org/packages/leopoletto/robots-txt-parser)[![Tests](https://camo.githubusercontent.com/f7a2898d3313be518dd1538fb5e93995ca1abfa8b9f74b449ca1e66d6ef99a13/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f616374696f6e732f776f726b666c6f772f7374617475732f6c656f706f6c6574746f2f726f626f74732d7478742d7061727365722f72756e2d74657374732d706870756e69742e796d6c3f6272616e63683d6d61696e266c6162656c3d7465737473267374796c653d666c61742d737175617265)](https://github.com/leopoletto/robots-txt-parser/actions/workflows/run-tests-phpunit.yml)[![Total Downloads](https://camo.githubusercontent.com/2b9fc2c0e62afe586765370c033bb5399ccd6433d97f6aee0eac67d080851741/68747470733a2f2f696d672e736869656c64732e696f2f7061636b61676973742f64742f6c656f706f6c6574746f2f726f626f74732d7478742d7061727365722e7376673f7374796c653d666c61742d737175617265)](https://packagist.org/packages/leopoletto/robots-txt-parser)

A comprehensive PHP package for parsing and analyzing robots.txt files. This library is designed to help you understand the structure and content of robots.txt files, including support for X-Robots-Tag HTTP headers and meta tags from HTML pages.

> **Note**: This library is designed for **parsing and analyzing** robots.txt files to understand their structure. It also provides methods to **check if a user agent is allowed to access a specific path** using the `uaAllowed()` and `isAllowed()` methods.

Installation
------------

[](#installation)

Install via Composer:

```
composer require leopoletto/robots-txt-parser
```

Requirements
------------

[](#requirements)

- PHP 8.2 or higher

### Dependencies

[](#dependencies)

- Guzzle HTTP Client
- Illuminate Collections

Quick Start
-----------

[](#quick-start)

```
use Leopoletto\RobotsTxtParser\RobotsTxtParser;

// Instantiate the parser
$parser = new RobotsTxtParser();

// Configure your bot's user agent (required for parseUrl)
$parser->configureUserAgent('MyBot', '1.0', 'https://example.com/mybot');

// Parse from URL
$response = $parser->parseUrl('https://example.com');
```

Configuration
-------------

[](#configuration)

Before parsing from a URL, you must configure your bot's user agent. This is used when making HTTP requests.

### Method 1: Using `configureUserAgent()`

[](#method-1-using-configureuseragent)

```
$parser->configureUserAgent('BotName', '1.0', 'https://example.com/bot');
// Results in: Mozilla/5.0 (compatible; BotName/1.0; https://example.com/bot)
```

### Method 2: Using `setUserAgent()`

[](#method-2-using-setuseragent)

```
$parser->setUserAgent('MyCustomUserAgent/1.0');
```

Parsing Methods
---------------

[](#parsing-methods)

The library provides three methods for parsing robots.txt content:

### 1. Parse from URL (`parseUrl`)

[](#1-parse-from-url-parseurl)

Parses robots.txt from a URL and also extracts:

- **X-Robots-Tag** HTTP headers from the robots.txt response
- **Meta tags** (robots, googlebot, googlebot-news) from the HTML page if a non-robots.txt URL is provided

```
$parser = new RobotsTxtParser();
$parser->configureUserAgent('MyBot', '1.0', 'https://example.com');

// Parse from any URL (will automatically fetch /robots.txt)
$response = $parser->parseUrl('https://example.com');
// or
$response = $parser->parseUrl('https://example.com/robots.txt');

$records = $response->records();
```

**What `parseUrl` returns:**

- All robots.txt directives (User-agent, Allow, Disallow, Crawl-delay, Sitemap)
- X-Robots-Tag headers from the robots.txt response
- Meta tags from the HTML page (if parsing a non-robots.txt URL)
- Comments and syntax errors

### 2. Parse from File (`parseFile`)

[](#2-parse-from-file-parsefile)

Parses a robots.txt file from the local filesystem.

```
$parser = new RobotsTxtParser();
$response = $parser->parseFile('/path/to/robots.txt');

$records = $response->records();
```

### 3. Parse from Text (`parseText`)

[](#3-parse-from-text-parsetext)

Parses robots.txt content directly from a string.

```
$parser = new RobotsTxtParser();
$content = "User-agent: *\nDisallow: /admin/";
$response = $parser->parseText($content);

$records = $response->records();
```

Accessing Parsed Data
---------------------

[](#accessing-parsed-data)

All parsing methods return a `Response` object with the following methods:

### Basic Information

[](#basic-information)

```
$response = $parser->parseUrl('https://example.com');

// Get the size of the parsed content in bytes
$size = $response->size();

// Get all records as a collection
$records = $response->records();

// Get total number of records
$totalLines = $records->lines();
```

### User Agents

[](#user-agents)

Get all user agents and their directives:

```
// Get all user agents
$userAgents = $records->userAgents()->toArray();

// Get a specific user agent
$googlebot = $records->userAgents('Googlebot')->toArray();
```

**Example output:**

```
{
    "*": {
        "line": 19,
        "userAgent": "*",
        "description": null,
        "category": null,
        "allow": [
            {
                "line": 20,
                "directive": "allow",
                "path": "/researchtools/ose/$"
            }
        ],
        "disallow": [
            {
                "line": 32,
                "directive": "disallow",
                "path": "/admin/"
            }
        ],
        "crawlDelay": []
    },
    "GPTBot": {
        "line": 11,
        "userAgent": "GPTBot",
        "description": "GPTBot is OpenAI's web crawler that collects data from publicly accessible web pages to improve AI models like ChatGPT, while respecting robots.txt and opt-out requests",
        "category": "AI Data Scraper",
        "allow": [],
        "disallow": [
            {
                "line": 12,
                "directive": "disallow",
                "path": "/blog/"
            }
        ],
        "crawlDelay": []
    }
}
```

**Note:** The `description` and `category` fields are automatically populated for recognized bots from the built-in dataset. Unknown bots will have `null` values for these fields.

### Directives

[](#directives)

Get specific directive types:

```
// Get all disallowed paths
$disallowed = $records->disallowed()->toArray();

// Get disallowed paths for a specific user agent
$disallowed = $records->disallowed('Googlebot')->toArray();

// Get all allowed paths
$allowed = $records->allowed()->toArray();

// Get crawl delays
$crawlDelays = $records->crawlDelay()->toArray();
```

**Example output:**

```
[
    {
        "line": 32,
        "directive": "disallow",
        "path": "/admin/"
    },
    {
        "line": 33,
        "directive": "disallow",
        "path": "/private/"
    }
]
```

### Display User Agent Information

[](#display-user-agent-information)

When you want to see which user agents apply to each directive:

```
// Show user agents as an array for each directive
$disallowed = $records->displayUserAgent()->disallowed()->toArray();
```

**Example output:**

```
[
    {
        "line": 32,
        "directive": "disallow",
        "path": "/admin/",
        "userAgent": ["*", "GPT-User"]
    }
]
```

When querying by a specific user agent with `displayUserAgent()`, directives are expanded:

```
// Expand directives for all user agents in the same group
$disallowed = $records->displayUserAgent()->disallowed('*')->toArray();
```

**Example output:**

```
[
    {
        "line": 32,
        "directive": "disallow",
        "path": "/admin/",
        "userAgent": "*"
    },
    {
        "line": 32,
        "directive": "disallow",
        "path": "/admin/",
        "userAgent": "GPT-User"
    }
]
```

### Sitemaps

[](#sitemaps)

```
$sitemaps = $records->sitemaps()->toArray();
```

**Example output:**

```
[
    {
        "line": 52,
        "url": "https://example.com/sitemap.xml",
        "valid": true
    }
]
```

### Comments

[](#comments)

```
$comments = $records->comments()->toArray();
```

**Example output:**

```
[
    {
        "line": 1,
        "comment": "File last updated May 5, 2025"
    }
]
```

### X-Robots-Tag Headers (from `parseUrl`)

[](#x-robots-tag-headers-from-parseurl)

When parsing from a URL, you can access X-Robots-Tag HTTP headers. Each header is validated with conflict detection, redundancy analysis, and user agent validation:

```
$headers = $records->headersDirectives()->toArray();
```

**Example output:**

```
[
    {
        "user_agent": "googlebot",
        "user_agent_valid": true,
        "raw": "googlebot: noindex, nofollow",
        "directives": [
            { "name": "noindex", "value": null, "type": "simple", "valid": true },
            { "name": "nofollow", "value": null, "type": "simple", "valid": true }
        ],
        "valid": true,
        "issues": [],
        "conflicts": [],
        "redundancies": [],
        "is_full_spec": false
    }
]
```

### Meta Tags (from `parseUrl`)

[](#meta-tags-from-parseurl)

When parsing from a URL (non-robots.txt), you can access robots meta tags from the HTML. Supports `robots`, `googlebot`, `googlebot-news`, and `bingbot` meta tags with full validation:

```
$metaTags = $records->metaTagsDirectives()->toArray();
```

**Example output:**

```
[
    {
        "tag_name": "robots",
        "raw": "index, follow, max-image-preview:large",
        "directives": [
            { "name": "index", "value": null, "type": "simple", "valid": true },
            { "name": "follow", "value": null, "type": "simple", "valid": true },
            { "name": "max-image-preview", "value": "large", "type": "parametric", "valid": true }
        ],
        "valid": true,
        "issues": [],
        "conflicts": [],
        "redundancies": [],
        "is_full_spec": false
    }
]
```

### Syntax Errors

[](#syntax-errors)

Check for parsing errors:

```
$errors = $records->syntaxErrors()->toArray();
```

**Example output:**

```
[
    {
        "line": 5,
        "message": "Directive must follow a user agent"
    }
]
```

### Checking Path Access

[](#checking-path-access)

Check if a specific user agent is allowed to access a path:

```
// Check if GPTBot is allowed to access a specific path
$isAllowed = $records->uaAllowed('GPTBot', '/ja-jp/community/search?q=hello');
// Returns: false (if disallowed) or true (if allowed)

// Alias method
$isAllowed = $records->isAllowed('GPTBot', '/ko-kr/make/something');
// Returns: true
```

**Features:**

- **Case-insensitive user agent matching** - Works with any case variation
- **Automatic wildcard fallback** - If the user agent doesn't exist, falls back to `*` rules
- **Pattern matching support:**
    - `*` wildcard - Matches any sequence of characters (can appear multiple times)
    - `$` end anchor - Matches only at the end of the path
- **Rule specificity** - More specific (longer) paths take precedence
- **Default behavior** - Returns `true` (allowed) if no rules match

**Examples:**

```
$parser = new RobotsTxtParser();
$robots = $parser->parseUrl('https://example.com/robots.txt');
$records = $robots->records();

// Check specific user agent
$records->uaAllowed('GPTBot', '/allowed/path');     // true
$records->uaAllowed('GPTBot', '/blocked/path');     // false

// Falls back to wildcard if user agent not found
$records->uaAllowed('UnknownBot', '/path');          // Uses * rules

// Works with wildcard patterns
// If robots.txt has: Disallow: /blog/*?s=*
$records->uaAllowed('*', '/blog/article?s=test');  // false

// Works with end anchor
// If robots.txt has: Allow: /make/$
$records->uaAllowed('*', '/make/');                 // true
$records->uaAllowed('*', '/make/something');        // false
```

### Directive Validation

[](#directive-validation)

The `DirectiveValidator` can be used standalone to validate directive strings. It recognizes all standard simple directives, parametric directives (`max-snippet`, `max-image-preview`, `max-video-preview`), and detects conflicts, redundancies, and deprecated directives.

```
use Leopoletto\RobotsTxtParser\Validators\DirectiveValidator;

$validator = new DirectiveValidator();

// Validate a directive string
$result = $validator->validate('index, noindex, max-snippet:150');
```

**Example output:**

```
{
    "raw": "index, noindex, max-snippet:150",
    "source": "meta",
    "user_agent": null,
    "directives": [
        { "name": "index", "value": null, "type": "simple", "valid": true },
        { "name": "noindex", "value": null, "type": "simple", "valid": true },
        { "name": "max-snippet", "value": "150", "type": "parametric", "valid": true }
    ],
    "valid": false,
    "issues": [],
    "conflicts": [
        {
            "directives": ["index", "noindex"],
            "severity": "high",
            "message": "Conflicting directives: index and noindex",
            "resolution": "Most restrictive wins (noindex)"
        }
    ],
    "redundancies": [],
    "is_full_spec": false
}
```

You can also validate user agents for X-Robots-Tag headers:

```
$result = $validator->validateUserAgent('googlebot');
// { "user_agent": "googlebot", "valid": true, "issues": [] }

$result = $validator->validateUserAgent('unknownbot');
// { "user_agent": "unknownbot", "valid": false, "issues": [{ "type": "unknown_user_agent", ... }] }
```

**Validation features:**

- **17 simple directives** recognized (index, noindex, follow, nofollow, nosnippet, noimageindex, noarchive, archive, notranslate, translate, all, none, nositelinkssearchbox, noodp, noydir)
- **Parametric directives** validated: `max-snippet` (integer), `max-image-preview` (none/standard/large), `max-video-preview` (integer)
- **Conflict detection** for contradictory pairs (index/noindex, follow/nofollow, nosnippet/max-snippet, all/none)
- **Redundancy detection** for shorthand overlaps (e.g., `all` already includes `index`)
- **Deprecation warnings** for `noodp` and `noydir`
- **Full spec detection** when index + all three parametric directives are present
- **Deduplication** of repeated directives
- **Case insensitive** parsing

### Directive Merging

[](#directive-merging)

The `RobotsMerger` combines meta tag and header directives into a single effective ruleset. Header directives are applied after meta directives, allowing headers to override meta tags.

```
use Leopoletto\RobotsTxtParser\Helpers\RobotsMerger;

$metaDirectives = $records->metaTagsDirectives()->toArray();
$headerDirectives = $records->headersDirectives()->toArray();

$merged = RobotsMerger::merge($metaDirectives, $headerDirectives);
```

**Example output:**

```
{
    "effective_rules": {
        "index": true,
        "follow": true,
        "max_snippet": -1,
        "max_image_preview": "standard",
        "max_video_preview": -1,
        "archive": true,
        "translate": true,
        "image_index": true
    },
    "sources": {
        "meta_count": 1,
        "header_count": 2
    }
}
```

When no directives are present, the default permissive state is returned (all indexing/following allowed, no restrictions on snippets or previews).

Complete Example
----------------

[](#complete-example)

Here's a complete example showing all available data:

```
use Leopoletto\RobotsTxtParser\RobotsTxtParser;

$parser = new RobotsTxtParser();
$parser->configureUserAgent('MyBot', '1.0', 'https://example.com/mybot');

// Parse from URL
$response = $parser->parseUrl('https://example.com');
$records = $response->records();

// Build comprehensive response
$data = [
    'size' => $response->size(),
    'lines' => $records->lines(),
    'user_agents' => $records->userAgents()->toArray(),
    'disallowed' => $records->displayUserAgent()->disallowed()->toArray(),
    'allowed' => $records->allowed()->toArray(),
    'crawls_delay' => $records->crawlDelay()->toArray(),
    'sitemaps' => $records->sitemaps()->toArray(),
    'comments' => $records->comments()->toArray(),
    'html' => $records->metaTagsDirectives()->toArray(),      // From parseUrl only
    'headers' => $records->headersDirectives()->toArray(),    // From parseUrl only
    'errors' => $records->syntaxErrors()->toArray(),
];

return response()->json($data);
```

See `public/example.json` for a complete example of the output structure.

User Agent Groups
-----------------

[](#user-agent-groups)

The library correctly handles consecutive User-agent declarations, which in robots.txt format means they share the same directives:

```
User-agent: *
User-agent: GPT-User
Disallow: /admin/
```

Both `*` and `GPT-User` will have the same directives. When you query by either user agent, you'll get the same results:

```
$disallowed1 = $records->disallowed('*')->toArray();
$disallowed2 = $records->disallowed('GPT-User')->toArray();
// Both return the same directives
```

Features
--------

[](#features)

- ✅ Parse robots.txt from URL, file, or text
- ✅ Extract X-Robots-Tag HTTP headers
- ✅ Extract robots meta tags from HTML pages (robots, googlebot, googlebot-news, bingbot)
- ✅ Handle consecutive User-agent declarations (groups)
- ✅ Efficient storage (no duplicate directives)
- ✅ Support for all standard directives (Allow, Disallow, Crawl-delay, Sitemap)
- ✅ Comments and syntax error detection
- ✅ Memory-efficient streaming for large files
- ✅ **User agent recognition** - Automatic description and category for recognized bots
- ✅ **Path access checking** - Check if a user agent is allowed to access a specific path
- ✅ **Pattern matching** - Support for `*` wildcards and `$` end anchors
- ✅ **Directive validation** - Validate simple and parametric directives with conflict, redundancy, and deprecation detection
- ✅ **Directive merging** - Merge meta tag and header directives with most-restrictive-wins resolution
- ✅ Comprehensive test coverage

Credits
-------

[](#credits)

- [leopoletto](https://github.com/leopoletto)
- [All Contributors](../../contributors)

Contributing
------------

[](#contributing)

Contributions are welcome! Please feel free to submit a Pull Request.

License
-------

[](#license)

The MIT License (MIT). Please see [License File](LICENSE.md) for more information.

###  Health Score

41

—

FairBetter than 89% of packages

Maintenance87

Actively maintained with recent releases

Popularity9

Limited adoption so far

Community9

Small or concentrated contributor base

Maturity52

Maturing project, gaining track record

 Bus Factor2

2 contributors hold 50%+ of commits

How is this calculated?**Maintenance (25%)** — Last commit recency, latest release date, and issue-to-star ratio. Uses a 2-year decay window.

**Popularity (30%)** — Total and monthly downloads, GitHub stars, and forks. Logarithmic scaling prevents top-heavy scores.

**Community (15%)** — Contributors, dependents, forks, watchers, and maintainers. Measures real ecosystem engagement.

**Maturity (30%)** — Project age, version count, PHP version support, and release stability.

###  Release Activity

Cadence

Every ~22 days

Recently: every ~28 days

Total

6

Last Release

72d ago

Major Versions

1.1.0 → 2.0.12026-02-10

### Community

Maintainers

![](https://www.gravatar.com/avatar/f512b4103a0b36258dcb5589f8ca7ac285b1fd2ce465c0c0f3c2c6697e957017?d=identicon)[leopoletto](/maintainers/leopoletto)

---

Top Contributors

[![leopoletto](https://avatars.githubusercontent.com/u/1036401?v=4)](https://github.com/leopoletto "leopoletto (34 commits)")[![dependabot[bot]](https://avatars.githubusercontent.com/in/29110?v=4)](https://github.com/dependabot[bot] "dependabot[bot] (23 commits)")[![github-actions[bot]](https://avatars.githubusercontent.com/in/15368?v=4)](https://github.com/github-actions[bot] "github-actions[bot] (21 commits)")

###  Code Quality

TestsPHPUnit

Code StylePHP CS Fixer

### Embed Badge

![Health badge](/badges/leopoletto-robots-txt-parser/health.svg)

```
[![Health](https://phpackages.com/badges/leopoletto-robots-txt-parser/health.svg)](https://phpackages.com/packages/leopoletto-robots-txt-parser)
```

###  Alternatives

[spatie/laravel-sitemap

Create and generate sitemaps with ease

2.6k14.6M107](/packages/spatie-laravel-sitemap)[ultrono/laravel-sitemap

Sitemap generator for Laravel 11, 12 and 13

36412.6k6](/packages/ultrono-laravel-sitemap)[mischasigtermans/laravel-toon

Token-Optimized Object Notation encoder/decoder for Laravel with intelligent nested object handling

13113.1k](/packages/mischasigtermans-laravel-toon)[dniccum/nova-documentation

A Laravel Nova tool that allows you to add markdown-based documentation to your administrator's dashboard.

37116.4k](/packages/dniccum-nova-documentation)[sbsaga/toon

🧠 TOON for Laravel — a compact, human-readable, and token-efficient data format for AI prompts &amp; LLM contexts. Perfect for ChatGPT, Gemini, Claude, Mistral, and OpenAI integrations (JSON ⇄ TOON).

6115.6k](/packages/sbsaga-toon)[aedart/athenaeum

Athenaeum is a mono repository; a collection of various PHP packages

255.2k](/packages/aedart-athenaeum)

PHPackages © 2026

[Directory](/)[Categories](/categories)[Trending](/trending)[Changelog](/changelog)[Analyze](/analyze)