PHPackages                             toflar/fast-set - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [Utility &amp; Helpers](/categories/utility)
4. /
5. toflar/fast-set

ActiveLibrary[Utility &amp; Helpers](/categories/utility)

toflar/fast-set
===============

Provides the means to create memory-efficient precompiled sets with very fast fingerprint lookups

1.0.1(1mo ago)71.2k1MITPHPPHP ^8.1CI passing

Since Mar 30Pushed 1mo agoCompare

[ Source](https://github.com/Toflar/FastSet)[ Packagist](https://packagist.org/packages/toflar/fast-set)[ RSS](/packages/toflar-fast-set/feed)WikiDiscussions main Synced today

READMEChangelog (2)Dependencies (5)Versions (4)Used By (1)

FastSet
=======

[](#fastset)

**FastSet** is a tiny, read-only, memory-efficient membership set for PHP.

It answers exactly one question:

> Does this string exist in my set?

…and it does so **very fast** and with **very little memory**, even for large sets.

---

Why FastSet exists
------------------

[](#why-fastset-exists)

PHP’s native arrays are extremely fast for lookups, but they are also **very memory-hungry**.
Storing large dictionaries (100k+ strings) quickly becomes impractical, especially when you need multiple sets at once.

FastSet is designed for cases where:

- The dictionary is **static** (no inserts, deletes, or updates).
- You only need **`has(string $entry): bool`**.
- **Memory usage matters**.
- Lookup speed must be **predictable and fast**.

Typical use cases:

- Linguistic dictionaries (e.g. word decomposition)
- Large allow/deny lists
- Static vocabularies
- Read-only validation sets

---

Design overview
---------------

[](#design-overview)

FastSet uses a **sorted fingerprint blob** instead of storing the original strings.

### Build time

[](#build-time)

1. Read to original dictionary built using the `SetBuilder`
2. Hash each entry using the configured hash
3. Sort all fingerprints.
4. Write two files:
    - `hashes_.bin` – concatenated fingerprints (without the 2-byte prefixes for maximum compression)
    - `index_.bin` – a 2-byte prefix index (65,537 offsets)

### Lookup time

[](#lookup-time)

1. Hash the input string with the same algorithm.
2. Use the **first 2 bytes of the hash** to select a small bucket.
3. Binary-search **only inside that bucket**.

Because hash prefixes are uniformly distributed, buckets are tiny. Real world tests on the fixture file that contains `~134,000` entries in this repository (`tests/Fixtures/terms_de.txt`) using the `xxh3` hash algorithm:

- Used buckets: `57,112 (≈ 87.2%)`
- Average non-empty bucket size: `2.36` entries
- Worst case bucket size (biggest bucket): `11` entries
- Worst case comparisons: `log₂(11) = 4`

Of course, the bigger your dictionary, the bigger the individual buckets.

All comparisons are fixed-width, not variable-length UTF-8 strings.

---

Properties
----------

[](#properties)

- ✅ **Very low memory usage**
- ✅ **Predictable lookup cost**
- ✅ **No PHP hash tables**
- ✅ **No database**
- ✅ **No dependencies**
- ⚠️ **Read-only**
- ⚠️ **Hash collisions are theoretically possible**
    (acceptable for many linguistic and heuristic use cases)

---

Installation
------------

[](#installation)

```
composer require toflar/fast-set

```

---

Usage
-----

[](#usage)

### 1. Build the set (one-time)

[](#1-build-the-set-one-time)

Hashes are an excellent tool for evenly distributing entries, which is exactly what makes lookups in `FastSet`extremely fast. However, hashes are not well suited for distribution:

- They are often larger than the original terms (especially for short words).
- They are effectively random, which means gzip compression performs very poorly.

Shipping prebuilt hash files would therefore often mean shipping more data than the original dictionary.

The solution: The `SetBuilder`

This library ships with a `SetBuilder` that is designed specifically for distribution size efficiency. Instead of shipping hashes, you ship a compressed dictionary that:

- exploits shared prefixes between terms
- avoids repeating identical prefixes
- compresses extremely well with gzip

The hash-based data structures are then generated locally at build time.

```
$myOriginalSet = __DIR__ . '/dictionary.txt'; // one entry per line

// Encode/Compress with the prefix algorithm:
SetBuilder::buildSet($myOriginalSet, './compressed.txt');

// Encode/Compress with the prefix algorithm and gzip on top (the .gz suffix determines that):
SetBuilder::buildSet($myOriginalSet, './compressed.gz');

// If your dictionary is not a line-feed separated file, but you have an array, you can also build it from an
// array directly:
SetBuilder::buildFromArray($myArray, './compressed.gz');

// You can use either xxh3 (default) or xxh128
// xxh128 will have a smaller probability for false-positives but require more memory
$hashAlgorithm = 'xxh3';

// You then ship either "compressed.txt" or "compressed.gz" with your application. Instantiating
// is then done as follows:
$set = new FastSet(__DIR__ . '/dict', $hashAlgorithm);
$set->build(__DIR__ . '/compressed.(txt|gz)'); // Must be a file built using the SetBuilder
```

Calling `build` creates the following files on-the-fly:

```
dict/
├── hashes_.bin
└── index_.bin

```

> Important: Do not ship `hashes_.bin` or `index_.bin`. Only ship the compressed dictionary created by the `SetBuilder`.

---

#### How to choose the right hashing algorithm

[](#how-to-choose-the-right-hashing-algorithm)

This library only supports the use of xxHash. Why? Because it's [extremely fast](https://xxhash.com/). It does, however, not make sense to allow all versions of xxHash. Hence, only `xxh3` (64bit) and `xxh128`(128bit) are supported.

`xxh3` will almost always be the right choice for you. That's why it's the default. But decide for yourself:

Collision probability by hash size for `100_000` terms:

HashBitsProbability of ≥1 collisionInterpretationxxh3232~0.31 (31%)❌ Unacceptablexxh364~2.7 × 10⁻¹⁰~1 in 3.7 billionxxh128128~1.5 × 10⁻²⁹Effectively zeroCollision probability by hash size for `500_000` terms:

HashBitsProbability of ≥1 collisionInterpretationxxh3232~1.0❌ Guaranteed collisionsxxh364~6.8 × 10⁻⁹~1 in 147 millionxxh128128~3.7 × 10⁻²⁸Effectively zeroWhich algorithm to choose depends on the risk of collisions you want to take but as long as you are &lt; 1 million terms, 64-bit is astronomically safe. Maybe this is a good rule of thumb:

> Use `xxh3` by default. Switch to `xxh128` if your set contains millions of terms, or you need essentially zero collision risk.

### 2. Lookup

[](#2-lookup)

Once you have initialized/built your `FastSet` calling `build()` so that the required files have been built, you can then use it as follows:

```
$set = new FastSet(__DIR__ . '/dict');

if ($set->has('look-me-up')) {
    // exists
}
```

The `hashes_.bin` and `index_.bin` files are loaded lazily on first lookup, but you can also call `initialize()`explicitly if you want to load them into memory at a specific point in time.

---

Caveats
-------

[](#caveats)

- **Not suitable if you need exact, collision-free guarantees**
- **Not suitable for mutable sets**
- Input normalization (case-folding, Unicode normalization) is intentionally left to the caller

---

License
-------

[](#license)

MIT

###  Health Score

45

—

FairBetter than 91% of packages

Maintenance94

Actively maintained with recent releases

Popularity24

Limited adoption so far

Community8

Small or concentrated contributor base

Maturity45

Maturing project, gaining track record

 Bus Factor1

Top contributor holds 100% of commits — single point of failure

How is this calculated?**Maintenance (25%)** — Last commit recency, latest release date, and issue-to-star ratio. Uses a 2-year decay window.

**Popularity (30%)** — Total and monthly downloads, GitHub stars, and forks. Logarithmic scaling prevents top-heavy scores.

**Community (15%)** — Contributors, dependents, forks, watchers, and maintainers. Measures real ecosystem engagement.

**Maturity (30%)** — Project age, version count, PHP version support, and release stability.

###  Release Activity

Cadence

Every ~63 days

Total

2

Last Release

31d ago

PHP version history (2 changes)1.0.0PHP ^8.3

1.0.1PHP ^8.1

### Community

Maintainers

![](https://avatars.githubusercontent.com/u/481937?v=4)[Yanick Witschi](/maintainers/Toflar)[@Toflar](https://github.com/Toflar)

---

Top Contributors

[![Toflar](https://avatars.githubusercontent.com/u/481937?v=4)](https://github.com/Toflar "Toflar (22 commits)")

---

Tags

hashmemorysetbinarydictionarylookupFingerprint

###  Code Quality

TestsPHPUnit

### Embed Badge

![Health badge](/badges/toflar-fast-set/health.svg)

```
[![Health](https://phpackages.com/badges/toflar-fast-set/health.svg)](https://phpackages.com/packages/toflar-fast-set)
```

###  Alternatives

[hashids/hashids

Generate short, unique, non-sequential ids (like YouTube and Bitly) from numbers

5.4k52.1M315](/packages/hashids-hashids)[phpcollection/phpcollection

General-Purpose Collection Library for PHP

1.0k64.7M34](/packages/phpcollection-phpcollection)[marc-mabe/php-enum

Simple and fast implementation of enumerations with native PHP

50458.3M110](/packages/marc-mabe-php-enum)[alchemy/binary-driver

A set of tools to build binary drivers

19511.1M40](/packages/alchemy-binary-driver)[equip/structure

Simple, immutable data structures

40202.4k5](/packages/equip-structure)[internal/destroy

561.5M16](/packages/internal-destroy)

PHPackages © 2026

[Directory](/)[Categories](/categories)[Trending](/trending)[Changelog](/changelog)[Analyze](/analyze)
