PHPackages                             ezimuel/phpvector - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [Database &amp; ORM](/categories/database)
4. /
5. ezimuel/phpvector

ActiveLibrary[Database &amp; ORM](/categories/database)

ezimuel/phpvector
=================

A vector database in PHP implementing HNSW for approximate nearest-neighbor search and BM25 for hybrid full-text + vector retrieval.

0.2.0(1mo ago)112↑1400%1[2 PRs](https://github.com/ezimuel/PHPVector/pulls)MITPHPPHP ^8.1CI passing

Since Mar 11Pushed 1mo agoCompare

[ Source](https://github.com/ezimuel/PHPVector)[ Packagist](https://packagist.org/packages/ezimuel/phpvector)[ RSS](/packages/ezimuel-phpvector/feed)WikiDiscussions main Synced 1mo ago

READMEChangelog (2)Dependencies (2)Versions (3)Used By (0)

PHPVector
=========

[](#phpvector)

A pure-PHP vector database implementing **HNSW** (Hierarchical Navigable Small World) for approximate nearest-neighbour search and **BM25** for full-text retrieval. Both engines can be combined into a single **hybrid search** pipeline.

Requirements
------------

[](#requirements)

- PHP 8.2+
- No external PHP extensions required for core functionality
- `ext-pcntl` (optional) — enables asynchronous document writes for lower insert latency

Installation
------------

[](#installation)

```
composer require ezimuel/phpvector
```

Quick start
-----------

[](#quick-start)

### 1. Insert documents

[](#1-insert-documents)

A `Document` holds a dense embedding vector, optional raw text for BM25, and any metadata you want returned with results. The `id` field is optional — if omitted, a random UUID v4 is assigned automatically.

```
use PHPVector\Document;
use PHPVector\VectorDatabase;

$db = new VectorDatabase();

$db->addDocuments([
    new Document(
        id: 1,
        vector: [0.12, 0.85, 0.44, 0.67],
        text: 'PHP vector database with HNSW index',
        metadata: ['url' => 'https://example.com/1', 'lang' => 'en'],
    ),
    new Document(
        id: 2,
        vector: [0.91, 0.23, 0.78, 0.05],
        text: 'Approximate nearest neighbour search in PHP',
        metadata: ['url' => 'https://example.com/2', 'lang' => 'en'],
    ),
    new Document(
        id: 3,
        vector: [0.33, 0.61, 0.19, 0.88],
        text: 'BM25 full-text ranking algorithm explained',
        metadata: ['url' => 'https://example.com/3', 'lang' => 'en'],
    ),
    // No id — a UUID v4 is assigned automatically.
    new Document(
        vector: [0.55, 0.42, 0.71, 0.30],
        text: 'Hybrid search with Reciprocal Rank Fusion',
    ),
]);
```

### 2. Vector search

[](#2-vector-search)

Find the *k* most similar documents to a query vector using HNSW.

```
$queryVector = [0.10, 0.80, 0.50, 0.60];

$results = $db->vectorSearch(vector: $queryVector, k: 2);

foreach ($results as $result) {
    echo sprintf(
        "[%d] score=%.4f  %s\n",
        $result->rank,
        $result->score,
        $result->document->metadata['url'],
    );
}
// [1] score=0.9987  https://example.com/1
// [2] score=0.8341  https://example.com/3
```

### 3. Full-text search

[](#3-full-text-search)

Rank documents by BM25 relevance against a text query.

```
$results = $db->textSearch(query: 'nearest neighbour PHP', k: 2);

foreach ($results as $result) {
    echo sprintf(
        "[%d] score=%.4f  %s\n",
        $result->rank,
        $result->score,
        $result->document->metadata['url'],
    );
}
// [1] score=1.2430  https://example.com/2
// [2] score=0.8761  https://example.com/1
```

### 4. Hybrid search

[](#4-hybrid-search)

Fuse vector similarity and BM25 scores into a single ranked list.

#### Reciprocal Rank Fusion (recommended)

[](#reciprocal-rank-fusion-recommended)

RRF is rank-based and scale-invariant — no tuning required.

```
use PHPVector\HybridMode;

$results = $db->hybridSearch(
    vector: $queryVector,
    text:   'vector database PHP',
    k:      3,
    mode:   HybridMode::RRF,
);

foreach ($results as $result) {
    echo sprintf(
        "[%d] score=%.4f  %s\n",
        $result->rank,
        $result->score,
        $result->document->metadata['url'],
    );
}
```

#### Weighted combination

[](#weighted-combination)

Normalises both score ranges to \[0, 1\] then applies explicit weights.

```
$results = $db->hybridSearch(
    vector:       $queryVector,
    text:         'vector database PHP',
    k:            3,
    mode:         HybridMode::Weighted,
    vectorWeight: 0.7,
    textWeight:   0.3,
);
```

Configuration
-------------

[](#configuration)

Both the HNSW and BM25 engines are fully configurable. Pass config objects to the `VectorDatabase` constructor.

```
use PHPVector\BM25\Config as BM25Config;
use PHPVector\BM25\SimpleTokenizer;
use PHPVector\Distance;
use PHPVector\HNSW\Config as HNSWConfig;
use PHPVector\VectorDatabase;

$db = new VectorDatabase(
    hnswConfig: new HNSWConfig(
        M:               16,    // Max connections per node per layer. Higher → better recall, more memory.
        efConstruction:  200,   // Beam width during index build. Higher → better graph quality, slower inserts.
        efSearch:        50,    // Beam width during query. Higher → better recall, slower queries.
        distance:        Distance::Cosine, // Cosine | Euclidean | DotProduct | Manhattan
        useHeuristic:    true,  // Diverse neighbour selection (recommended).
    ),
    bm25Config: new BM25Config(
        k1: 1.5,   // TF saturation. Range 1.2–2.0.
        b:  0.75,  // Length normalisation. 0 = none, 1 = full.
    ),
    tokenizer: new SimpleTokenizer(
        stopWords:      SimpleTokenizer::DEFAULT_STOP_WORDS,
        minTokenLength: 2,
    ),
);
```

### Distance metrics

[](#distance-metrics)

MetricBest for`Distance::Cosine`Text embeddings, normalised vectors`Distance::Euclidean`Raw, unnormalized vectors`Distance::DotProduct`Unit-normalized vectors (faster than Cosine)`Distance::Manhattan`Sparse vectors, robustness to outliers### HNSW tuning cheat-sheet

[](#hnsw-tuning-cheat-sheet)

GoalKnobBetter recallIncrease `efSearch` or `efConstruction`Faster queriesDecrease `efSearch`Less memoryDecrease `M`Better graph on clustered dataKeep `useHeuristic: true`Persistence
-----------

[](#persistence)

PHPVector uses a **folder-based** persistence model. Each database lives in its own directory containing separate files for the HNSW graph, the BM25 index, and one file per document. This design has two key advantages:

- **Low memory footprint on load** — only the HNSW graph and BM25 index are loaded into memory. Individual document files (`docs/{n}.bin`) are read lazily, only for the documents that appear in search results.
- **Low insert latency** — document files are written to disk asynchronously in a forked child process (requires `ext-pcntl`), so `addDocument()` returns immediately.

### Folder layout

[](#folder-layout)

```
/var/data/mydb/
  meta.json       — distance metric, dimension, document ID map
  hnsw.bin        — HNSW graph (vectors + connections)
  bm25.bin        — BM25 inverted index
  docs/
    0.bin         — document 0 (id, text, metadata)
    1.bin         — document 1
    …

```

### Saving

[](#saving)

Pass a `path` to the constructor to enable persistence. Each `addDocument()` call writes the document file to `docs/` (asynchronously when `ext-pcntl` is available). Call `save()` once to flush the HNSW graph and BM25 index — it waits for any outstanding async writes before proceeding.

```
use PHPVector\Document;
use PHPVector\VectorDatabase;

$db = new VectorDatabase(path: '/var/data/mydb');

$db->addDocuments([
    new Document(id: 1, vector: [0.12, 0.85, 0.44], text: 'PHP vector search', metadata: ['source' => 'blog']),
    new Document(id: 2, vector: [0.91, 0.23, 0.78], text: 'Approximate nearest neighbour'),
    // ... thousands more
]);

// Flush HNSW graph + BM25 index to disk (document files already written).
$db->save();
```

### Loading

[](#loading)

Use `VectorDatabase::open()` to load a previously saved folder. Only `hnsw.bin` and `bm25.bin` are read into memory; document files are loaded on demand after search.

Pass the same `HNSWConfig` (including the same `distance` metric) that was used when building the index — a `RuntimeException` is thrown on mismatch.

```
use PHPVector\VectorDatabase;

$db = VectorDatabase::open('/var/data/mydb');

// All three search modes work immediately.
$results = $db->vectorSearch(vector: $queryVector, k: 5);
$results = $db->textSearch(query: 'nearest neighbour', k: 5);
$results = $db->hybridSearch(vector: $queryVector, text: 'nearest neighbour', k: 5);
```

### Custom configuration on open

[](#custom-configuration-on-open)

```
use PHPVector\BM25\Config as BM25Config;
use PHPVector\Distance;
use PHPVector\HNSW\Config as HNSWConfig;
use PHPVector\VectorDatabase;

$db = VectorDatabase::open(
    path:       '/var/data/mydb',
    hnswConfig: new HNSWConfig(
        M:        16,
        efSearch: 100,
        distance: Distance::Euclidean,  // must match the value used on save()
    ),
    bm25Config: new BM25Config(k1: 1.2, b: 0.8),
    tokenizer:  new MyCustomTokenizer(),
);
```

> **Note:** Only `efSearch` and `bm25Config`/`tokenizer` affect query-time behaviour and can differ from build time. `distance` and the graph parameters (`M`, `efConstruction`) are fixed at build time — `distance` is validated on `open()` and must match.

### Incremental updates

[](#incremental-updates)

You can add new documents to a database that was loaded from disk, then call `save()` again. The existing document files are left in place; only the new ones are written along with updated index files.

```
$db = VectorDatabase::open('/var/data/mydb');
$db->addDocument(new Document(vector: [0.55, 0.42, 0.71], text: 'New document'));
$db->save(); // writes docs/N.bin + updated hnsw.bin, bm25.bin, meta.json
```

### Typical workflow: build once, serve many

[](#typical-workflow-build-once-serve-many)

```
// build.php — run once (or nightly)
$db = new VectorDatabase(
    hnswConfig: new HNSWConfig(M: 32, efConstruction: 400),
    path: '/var/data/mydb',
);
foreach (fetchDocumentsFromDatabase() as $doc) {
    $db->addDocument($doc);
}
$db->save();

// serve.php — loaded on every request or worker boot
$db = VectorDatabase::open('/var/data/mydb', new HNSWConfig(M: 32));
$results = $db->vectorSearch($queryVector, k: 10);
```

Multi-language stop words
-------------------------

[](#multi-language-stop-words)

Stop words are provided via `StopWordsProviderInterface`. Built-in providers:

```
use PHPVector\BM25\SimpleTokenizer;
use PHPVector\BM25\StopWords\EnglishStopWords;
use PHPVector\BM25\StopWords\ItalianStopWords;
use PHPVector\BM25\StopWords\FileStopWords;
use PHPVector\VectorDatabase;

// English (default)
$db = new VectorDatabase();

// Italian
$db = new VectorDatabase(
    tokenizer: new SimpleTokenizer(new ItalianStopWords()),
);

// Load from file (one word per line, # for comments)
$db = new VectorDatabase(
    tokenizer: new SimpleTokenizer(new FileStopWords('/path/to/stopwords.txt')),
);

### Stop words file format (`FileStopWords`)

Use a plain UTF-8 text file with one stop word per line.

Rules:
- Empty lines are ignored
- Lines starting with `#` are treated as comments
- Words are normalized to lowercase when loaded

Example (`stopwords-it.txt`):

```txt
# Italian stop words
e
di
a
che
il
la
```

// No stop words $db = new VectorDatabase( tokenizer: new SimpleTokenizer(stopWords: \[\]), );

```

Available providers:
- `EnglishStopWords` - English stop words (default)
- `ItalianStopWords` - Italian stop words
- `FileStopWords` - Load from file

## Custom tokenizer

Implement `TokenizerInterface` to plug in stemming, lemmatization, or any language-specific logic.

```php
use PHPVector\BM25\TokenizerInterface;

final class PorterStemTokenizer implements TokenizerInterface
{
    public function tokenize(string $text): array
    {
        $tokens = preg_split('/\s+/', mb_strtolower(trim($text)), -1, PREG_SPLIT_NO_EMPTY);
        return array_map(fn($t) => porter_stem($t), $tokens); // your stemmer here
    }
}

$db = new VectorDatabase(tokenizer: new PorterStemTokenizer());

```

Benchmark
---------

[](#benchmark)

A [VectorDBBench](https://github.com/zilliztech/VectorDBBench)-style CLI benchmark lives in `benchmark/`. It measures index build throughput, serial QPS, P99 tail latency, Recall@k against brute-force ground truth, and persistence speed.

```
# Quick run (1 K and 10 K vectors, 128 dimensions)
php benchmark/benchmark.php

# Full run — save report to a file
php benchmark/benchmark.php --scenarios=xs,small,medium,large,highdim --output=report.md

# Large dataset, skip recall (brute-force would be slow)
php benchmark/benchmark.php --scenarios=large --no-recall --queries=500

# Tune HNSW parameters
php benchmark/benchmark.php --scenarios=small --ef-search=100 --m=32

# All options
php benchmark/benchmark.php --help
```

**Available scenarios**

KeyVectorsDimsNotes`xs`1,000128Quick smoke test`small`10,000128SIFT-small scale`medium`50,000128SIFT-medium scale`large`100,000128Requires ~512 MB RAM`highdim`10,000768Text-embedding scale (Cohere-style)The report is printed as Markdown to stdout (or a file via `--output`). Progress messages go to stderr so piping works cleanly: `php benchmark/benchmark.php > report.md`.

Running the tests
-----------------

[](#running-the-tests)

```
composer install
./vendor/bin/phpunit
```

Copyright
---------

[](#copyright)

(C) 2026 by [Enrico Zimuel](https://www.zimuel.it)

###  Health Score

39

—

LowBetter than 85% of packages

Maintenance96

Actively maintained with recent releases

Popularity11

Limited adoption so far

Community9

Small or concentrated contributor base

Maturity33

Early-stage or recently created project

 Bus Factor1

Top contributor holds 73.7% of commits — single point of failure

How is this calculated?**Maintenance (25%)** — Last commit recency, latest release date, and issue-to-star ratio. Uses a 2-year decay window.

**Popularity (30%)** — Total and monthly downloads, GitHub stars, and forks. Logarithmic scaling prevents top-heavy scores.

**Community (15%)** — Contributors, dependents, forks, watchers, and maintainers. Measures real ecosystem engagement.

**Maturity (30%)** — Project age, version count, PHP version support, and release stability.

###  Release Activity

Cadence

Every ~5 days

Total

2

Last Release

53d ago

### Community

Maintainers

![](https://www.gravatar.com/avatar/75c7c511421feb14316a01d29a7566bd4fdd97147b5a4f3faa5a065f9d0a0193?d=identicon)[ezimuel](/maintainers/ezimuel)

---

Top Contributors

[![ezimuel](https://avatars.githubusercontent.com/u/475967?v=4)](https://github.com/ezimuel "ezimuel (14 commits)")[![danielebarbaro](https://avatars.githubusercontent.com/u/4376886?v=4)](https://github.com/danielebarbaro "danielebarbaro (5 commits)")

###  Code Quality

TestsPHPUnit

Static AnalysisPHPStan

Type Coverage Yes

### Embed Badge

![Health badge](/badges/ezimuel-phpvector/health.svg)

```
[![Health](https://phpackages.com/badges/ezimuel-phpvector/health.svg)](https://phpackages.com/packages/ezimuel-phpvector)
```

###  Alternatives

[doctrine/orm

Object-Relational-Mapper for PHP

10.2k285.3M6.2k](/packages/doctrine-orm)[jdorn/sql-formatter

a PHP SQL highlighting library

3.9k115.1M102](/packages/jdorn-sql-formatter)[illuminate/database

The Illuminate Database package.

2.8k52.4M9.3k](/packages/illuminate-database)[mongodb/mongodb

MongoDB driver library

1.6k64.0M542](/packages/mongodb-mongodb)[ramsey/uuid-doctrine

Use ramsey/uuid as a Doctrine field type.

90340.3M209](/packages/ramsey-uuid-doctrine)[reliese/laravel

Reliese Components for Laravel Framework code generation.

1.7k3.4M16](/packages/reliese-laravel)

PHPackages © 2026

[Directory](/)[Categories](/categories)[Trending](/trending)[Changelog](/changelog)[Analyze](/analyze)
