PHPackages                             survos/site-discovery-bundle - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [API Development](/categories/api)
4. /
5. survos/site-discovery-bundle

ActiveSymfony-bundle[API Development](/categories/api)

survos/site-discovery-bundle
============================

Discover hosted SaaS tenant sites (\*.example.com) via web archive indexes. Currently supports Internet Archive CDX API only.

2.0.194(2mo ago)073↓66.7%1MITPHPPHP ^8.4

Since Mar 4Pushed 1mo agoCompare

[ Source](https://github.com/survos/site-discovery-bundle)[ Packagist](https://packagist.org/packages/survos/site-discovery-bundle)[ GitHub Sponsors](https://github.com/kbond)[ RSS](/packages/survos-site-discovery-bundle/feed)WikiDiscussions main Synced 1mo ago

READMEChangelogDependencies (12)Versions (69)Used By (1)

survos/site-discovery-bundle
============================

[](#survossite-discovery-bundle)

Discovers tenant hostnames under a shared SaaS domain — e.g. `*.pastperfectonline.com`, `*.omeka.net` — by querying web archive indexes.

**Current backend: Internet Archive CDX API only.**Common Crawl support is planned but not yet implemented.

**Requirements:** PHP 8.4+, Symfony 8.0+

---

Background: how the CDX API works
---------------------------------

[](#background-how-the-cdx-api-works)

The Internet Archive crawls the web continuously and stores every URL in a CDX (Capture inDeX). URLs are sorted in **SURT** (Sort-friendly URI Reordering Transform) order, which reverses domain label order:

```
https://fauquierhistory.pastperfectonline.com/path
  →  com,pastperfectonline,fauquierhistory)/path

```

For a SaaS platform like PastPerfect Online, every tenant has a subdomain. Their SURT keys look like:

```
com,pastperfectonline,fauquierhistory)/
com,pastperfectonline,bainbridgehistorymuseum)/advancedsearch

```

The tenant slug (`fauquierhistory`) sits between the shared SURT prefix (`com,pastperfectonline,`) and the closing `)`. This bundle queries CDX for all URLs under the registered domain, filters to subdomain-only rows, and extracts unique slugs.

### Computing the SURT prefix

[](#computing-the-surt-prefix)

Reverse the domain labels, join with commas, add a trailing comma:

DomainSURT prefix`pastperfectonline.com``com,pastperfectonline,``omeka.net``net,omeka,``myheritage.com``com,myheritage,``arcgis.com``com,arcgis,`---

Installation
------------

[](#installation)

```
composer require survos/site-discovery-bundle
```

Register if not using Symfony Flex:

```
// config/bundles.php
return [
    Survos\SiteDiscoveryBundle\SurvosSiteDiscoveryBundle::class => ['all' => true],
];
```

---

Configuration
-------------

[](#configuration)

```
# config/packages/survos_site_discovery.yaml
survos_site_discovery:
    user_agent: "MyApp SiteDiscovery"   # defaults to "SurvosSiteDiscoveryBundle"
```

---

Console command
---------------

[](#console-command)

```
site:discover   [options]

```

### Arguments

[](#arguments)

ArgumentDescription`domain`Bare registered domain, e.g. `pastperfectonline.com``surtPrefix`SURT prefix for subdomain rows, e.g. `com,pastperfectonline,`### Options

[](#options)

OptionDefaultDescription`--output`stdoutWrite JSONL to this file path`--limit`0Stop after N unique sites (0 = unlimited). **Always use a small number during development.**`--page-size`5000CDX rows per API request (max ~10 000)`--scheme``https`URL scheme used in `base_url`### Examples

[](#examples)

```
# Discover PastPerfect Online sites, print to stdout (first 5 for testing)
bin/console site:discover pastperfectonline.com com,pastperfectonline, --limit=5

# Write to a JSONL file
bin/console site:discover pastperfectonline.com com,pastperfectonline, \
    --output=var/discovery/pastperfect-sites.jsonl

# Discover Omeka.net sites
bin/console site:discover omeka.net net,omeka, \
    --output=var/discovery/omeka-sites.jsonl

# Full discovery — slow, expect 10–30 s per CDX page
bin/console site:discover pastperfectonline.com com,pastperfectonline, \
    --output=var/discovery/pastperfect-sites.jsonl
```

### Output JSONL shape

[](#output-jsonl-shape)

One JSON object per line:

```
{
  "slug":           "fauquierhistory",
  "host":           "fauquierhistory.pastperfectonline.com",
  "base_url":       "https://fauquierhistory.pastperfectonline.com",
  "discovered_via": "internet_archive_cdx",
  "validated":      false,
  "validated_at":   null
}
```

---

PHP API
-------

[](#php-api)

### Inject `CdxDiscoveryService`

[](#inject-cdxdiscoveryservice)

```
use Survos\SiteDiscoveryBundle\Service\CdxDiscoveryService;
use Survos\SiteDiscoveryBundle\Model\DiscoveredSite;

final class MyHarvester
{
    public function __construct(
        private readonly CdxDiscoveryService $cdx,
    ) {}

    public function run(): void
    {
        foreach ($this->cdx->discover('pastperfectonline.com', 'com,pastperfectonline,') as $site) {
            // $site is a DiscoveredSite value object
            echo $site->slug;     // "fauquierhistory"
            echo $site->host;     // "fauquierhistory.pastperfectonline.com"
            echo $site->baseUrl;  // "https://fauquierhistory.pastperfectonline.com"

            $row = $site->toArray(); // JSONL-ready associative array
        }
    }
}
```

### `CdxDiscoveryService::discover()` signature

[](#cdxdiscoveryservicediscover-signature)

```
public function discover(
    string $domain,       // e.g. "pastperfectonline.com"
    string $surtPrefix,   // e.g. "com,pastperfectonline,"
    string $scheme   = 'https',
    int    $limit    = 0,     // 0 = unlimited; set small for dev/testing
    int    $pageSize = 5000,
): \Generator  // yields DiscoveredSite
```

### `DiscoveredSite` value object

[](#discoveredsite-value-object)

```
final readonly class DiscoveredSite
{
    public string $slug;           // "fauquierhistory"
    public string $host;           // "fauquierhistory.pastperfectonline.com"
    public string $baseUrl;        // "https://fauquierhistory.pastperfectonline.com"
    public string $discoveredVia;  // "internet_archive_cdx"

    public function toArray(): array;  // JSONL-ready
}
```

---

CDX API technical notes
-----------------------

[](#cdx-api-technical-notes)

These notes are provided for agents and developers integrating with or extending this bundle.

**Endpoint:** `https://web.archive.org/cdx/search/cdx`

**Parameters used by this bundle:**

ParameterValuePurpose`url`e.g. `pastperfectonline.com`Registered domain (no wildcard)`matchType``domain`Returns all URLs in the entire domain tree`output``json`Structured response; row 0 is a header array`fl``urlkey`Only fetch the SURT key column — cheapest option`collapse``urlkey`CDX-level deduplication`filter``urlkey:{surtPrefix}[a-z0-9]`Restrict to subdomain rows; skips bare domain entries`limit`5000Rows per page`showResumeKey``true`Enables pagination`resumeKey``{key from previous page}`Continue from prior page**Pagination:** when `showResumeKey=true`, the last row of each page is a resume-key string (not a urlkey). It does NOT start with the SURT prefix — that is how we distinguish it from real data rows. Pass it as `resumeKey` on the next request.

**Why `fl=urlkey` instead of `fl=original`:** the `original` field contains the raw crawled URL, which requires URL parsing to extract the hostname. The `urlkey` encodes the slug directly and unambiguously. One regex match is all that is needed.

**Why `matchType=domain` instead of `matchType=host`:** `matchType=host` with a wildcard (`*.example.com`) returns empty results. `matchType=domain` with the bare registered domain returns the full tree.

**Latency:** CDX API requests with `matchType=domain` can take 10–30 seconds per page. The response is streamed; the bundle waits for the full response. Plan accordingly.

**Coverage gaps:** sites blocked by `robots.txt` during crawl, or newer than the most recent IA crawl, will not appear. Use the output as a candidate list to be validated, not as a definitive registry.

---

Rate limiting
-------------

[](#rate-limiting)

The CDX API is free and unauthenticated. The Internet Archive does not publish a formal rate limit, but hammering the API with parallel requests is antisocial. This bundle makes one sequential request per page. Do not add concurrency.

---

Downstream validation
---------------------

[](#downstream-validation)

This bundle only discovers candidate hostnames. It does not validate that a host is currently live or that it is still running the expected platform. Add a probe step in your consumer bundle, e.g.:

```
// Pseudo-code — implement in your bundle
$response = $httpClient->request('GET', $site->baseUrl . '/AdvancedSearch');
$isLive = $response->getStatusCode() === 200
    && str_contains($response->getContent(), 'pastperfectonline');
```

---

Planned backends
----------------

[](#planned-backends)

- Common Crawl Host Index (higher coverage; requires DuckDB or Athena)
- Static seed file (CSV/JSONL of known hosts, for offline or pre-seeded use)

Pull requests for additional backends are welcome. Implement `CdxDiscoveryService` as a reference — yield `DiscoveredSite` objects, accept a `$limit` parameter, stream results lazily.

---

License
-------

[](#license)

MIT

###  Health Score

48

—

FairBetter than 94% of packages

Maintenance90

Actively maintained with recent releases

Popularity13

Limited adoption so far

Community10

Small or concentrated contributor base

Maturity66

Established project with proven stability

 Bus Factor1

Top contributor holds 100% of commits — single point of failure

How is this calculated?**Maintenance (25%)** — Last commit recency, latest release date, and issue-to-star ratio. Uses a 2-year decay window.

**Popularity (30%)** — Total and monthly downloads, GitHub stars, and forks. Logarithmic scaling prevents top-heavy scores.

**Community (15%)** — Contributors, dependents, forks, watchers, and maintainers. Measures real ecosystem engagement.

**Maturity (30%)** — Project age, version count, PHP version support, and release stability.

###  Release Activity

Cadence

Every ~0 days

Total

68

Last Release

70d ago

### Community

Maintainers

![](https://www.gravatar.com/avatar/21b39551f92ed4143772c622f9e571589c5a72c96ab3c53fe67489ce0d83e806?d=identicon)[tacman1123](/maintainers/tacman1123)

---

Top Contributors

[![tacman](https://avatars.githubusercontent.com/u/619585?v=4)](https://github.com/tacman "tacman (2 commits)")

---

Tags

symfonyHarvestcdxinternet-archivesite-discovery

### Embed Badge

![Health badge](/badges/survos-site-discovery-bundle/health.svg)

```
[![Health](https://phpackages.com/badges/survos-site-discovery-bundle/health.svg)](https://phpackages.com/packages/survos-site-discovery-bundle)
```

###  Alternatives

[sylius/sylius

E-Commerce platform for PHP, based on Symfony framework.

8.4k5.6M651](/packages/sylius-sylius)[sulu/sulu

Core framework that implements the functionality of the Sulu content management system

1.3k1.3M152](/packages/sulu-sulu)[worksome/graphlint

A static analysis tool for GraphQL

13189.4k](/packages/worksome-graphlint)

PHPackages © 2026

[Directory](/)[Categories](/categories)[Trending](/trending)[Changelog](/changelog)[Analyze](/analyze)
