PHPackages                             ezimuel/phpvector - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [Database &amp; ORM](/categories/database)
4. /
5. ezimuel/phpvector

ActiveLibrary[Database &amp; ORM](/categories/database)

ezimuel/phpvector
=================

A vector database in PHP implementing HNSW for approximate nearest-neighbor search and BM25 for hybrid full-text + vector retrieval.

0.3.0(3w ago)1628↓75%2[1 PRs](https://github.com/ezimuel/PHPVector/pulls)1MITPHPPHP ^8.2CI passing

Since Mar 11Pushed 3w agoCompare

[ Source](https://github.com/ezimuel/PHPVector)[ Packagist](https://packagist.org/packages/ezimuel/phpvector)[ RSS](/packages/ezimuel-phpvector/feed)WikiDiscussions main Synced 3w ago

READMEChangelog (3)Dependencies (5)Versions (12)Used By (1)

PHPVector
=========

[](#phpvector)

A pure-PHP vector database implementing **HNSW** (Hierarchical Navigable Small World) for approximate nearest-neighbour search and **BM25** for full-text retrieval. Both engines can be combined into a single **hybrid search** pipeline.

Requirements
------------

[](#requirements)

- PHP 8.2+
- No external PHP extensions required for core functionality
- `ext-pcntl` (optional) — enables asynchronous document writes for lower insert latency

Installation
------------

[](#installation)

```
composer require ezimuel/phpvector
```

Quick start
-----------

[](#quick-start)

### 1. Insert documents

[](#1-insert-documents)

A `Document` holds a dense embedding vector, optional raw text for BM25, and any metadata you want returned with results. The `id` field is optional — if omitted, a random UUID v4 is assigned automatically.

```
use PHPVector\Document;
use PHPVector\VectorDatabase;

$db = new VectorDatabase();

$db->addDocuments([
    new Document(
        id: 1,
        vector: [0.12, 0.85, 0.44, 0.67],
        text: 'PHP vector database with HNSW index',
        metadata: ['url' => 'https://example.com/1', 'lang' => 'en'],
    ),
    new Document(
        id: 2,
        vector: [0.91, 0.23, 0.78, 0.05],
        text: 'Approximate nearest neighbour search in PHP',
        metadata: ['url' => 'https://example.com/2', 'lang' => 'en'],
    ),
    new Document(
        id: 3,
        vector: [0.33, 0.61, 0.19, 0.88],
        text: 'BM25 full-text ranking algorithm explained',
        metadata: ['url' => 'https://example.com/3', 'lang' => 'en'],
    ),
    // No id — a UUID v4 is assigned automatically.
    new Document(
        vector: [0.55, 0.42, 0.71, 0.30],
        text: 'Hybrid search with Reciprocal Rank Fusion',
    ),
]);
```

### 2. Vector search

[](#2-vector-search)

Find the *k* most similar documents to a query vector using HNSW.

```
$queryVector = [0.10, 0.80, 0.50, 0.60];

$results = $db->vectorSearch(vector: $queryVector, k: 2);

foreach ($results as $result) {
    echo sprintf(
        "[%d] score=%.4f  %s\n",
        $result->rank,
        $result->score,
        $result->document->metadata['url'],
    );
}
// [1] score=0.9987  https://example.com/1
// [2] score=0.8341  https://example.com/3
```

### 3. Full-text search

[](#3-full-text-search)

Rank documents by BM25 relevance against a text query.

```
$results = $db->textSearch(query: 'nearest neighbour PHP', k: 2);

foreach ($results as $result) {
    echo sprintf(
        "[%d] score=%.4f  %s\n",
        $result->rank,
        $result->score,
        $result->document->metadata['url'],
    );
}
// [1] score=1.2430  https://example.com/2
// [2] score=0.8761  https://example.com/1
```

### 4. Hybrid search

[](#4-hybrid-search)

Fuse vector similarity and BM25 scores into a single ranked list.

#### Reciprocal Rank Fusion (recommended)

[](#reciprocal-rank-fusion-recommended)

RRF is rank-based and scale-invariant — no tuning required.

```
use PHPVector\HybridMode;

$results = $db->hybridSearch(
    vector: $queryVector,
    text:   'vector database PHP',
    k:      3,
    mode:   HybridMode::RRF,
);

foreach ($results as $result) {
    echo sprintf(
        "[%d] score=%.4f  %s\n",
        $result->rank,
        $result->score,
        $result->document->metadata['url'],
    );
}
```

#### Weighted combination

[](#weighted-combination)

Normalises both score ranges to \[0, 1\] then applies explicit weights.

```
$results = $db->hybridSearch(
    vector:       $queryVector,
    text:         'vector database PHP',
    k:            3,
    mode:         HybridMode::Weighted,
    vectorWeight: 0.7,
    textWeight:   0.3,
);
```

Configuration
-------------

[](#configuration)

Both the HNSW and BM25 engines are fully configurable. Pass config objects to the `VectorDatabase` constructor.

```
use PHPVector\BM25\Config as BM25Config;
use PHPVector\BM25\SimpleTokenizer;
use PHPVector\Distance;
use PHPVector\HNSW\Config as HNSWConfig;
use PHPVector\VectorDatabase;

$db = new VectorDatabase(
    hnswConfig: new HNSWConfig(
        M:               16,    // Max connections per node per layer. Higher → better recall, more memory.
        efConstruction:  200,   // Beam width during index build. Higher → better graph quality, slower inserts.
        efSearch:        50,    // Beam width during query. Higher → better recall, slower queries.
        distance:        Distance::Cosine, // Cosine | Euclidean | DotProduct | Manhattan
        useHeuristic:    true,  // Diverse neighbour selection (recommended).
    ),
    bm25Config: new BM25Config(
        k1: 1.5,   // TF saturation. Range 1.2–2.0.
        b:  0.75,  // Length normalisation. 0 = none, 1 = full.
    ),
    tokenizer: new SimpleTokenizer(
        stopWords:      SimpleTokenizer::DEFAULT_STOP_WORDS,
        minTokenLength: 2,
    ),
);
```

### Distance metrics

[](#distance-metrics)

MetricBest for`Distance::Cosine`Text embeddings, normalised vectors`Distance::Euclidean`Raw, unnormalized vectors`Distance::DotProduct`Unit-normalized vectors (faster than Cosine)`Distance::Manhattan`Sparse vectors, robustness to outliers### HNSW tuning cheat-sheet

[](#hnsw-tuning-cheat-sheet)

GoalKnobBetter recallIncrease `efSearch` or `efConstruction`Faster queriesDecrease `efSearch`Less memoryDecrease `M`Better graph on clustered dataKeep `useHeuristic: true`Persistence
-----------

[](#persistence)

PHPVector uses a **folder-based** persistence model. Each database lives in its own directory containing separate files for the HNSW graph, the BM25 index, and one file per document. This design has two key advantages:

- **Low memory footprint on load** — only the HNSW graph and BM25 index are loaded into memory. Individual document files (`docs/{n}.bin`) are read lazily, only for the documents that appear in search results.
- **Low insert latency** — document files are written to disk asynchronously in a forked child process (requires `ext-pcntl`), so `addDocument()` returns immediately.

### Folder layout

[](#folder-layout)

```
/var/data/mydb/
  meta.json       — distance metric, dimension, document ID map
  hnsw.bin        — HNSW graph (vectors + connections)
  bm25.bin        — BM25 inverted index
  docs/
    0.bin         — document 0 (id, text, metadata)
    1.bin         — document 1
    …

```

### Saving

[](#saving)

Pass a `path` to the constructor to enable persistence. Each `addDocument()` call writes the document file to `docs/` (asynchronously when `ext-pcntl` is available). Call `save()` once to flush the HNSW graph and BM25 index — it waits for any outstanding async writes before proceeding.

```
use PHPVector\Document;
use PHPVector\VectorDatabase;

$db = new VectorDatabase(path: '/var/data/mydb');

$db->addDocuments([
    new Document(id: 1, vector: [0.12, 0.85, 0.44], text: 'PHP vector search', metadata: ['source' => 'blog']),
    new Document(id: 2, vector: [0.91, 0.23, 0.78], text: 'Approximate nearest neighbour'),
    // ... thousands more
]);

// Flush HNSW graph + BM25 index to disk (document files already written).
$db->save();
```

### Loading

[](#loading)

Use `VectorDatabase::open()` to load a previously saved folder. Only `hnsw.bin` and `bm25.bin` are read into memory; document files are loaded on demand after search.

Pass the same `HNSWConfig` (including the same `distance` metric) that was used when building the index — a `RuntimeException` is thrown on mismatch.

```
use PHPVector\VectorDatabase;

$db = VectorDatabase::open('/var/data/mydb');

// All three search modes work immediately.
$results = $db->vectorSearch(vector: $queryVector, k: 5);
$results = $db->textSearch(query: 'nearest neighbour', k: 5);
$results = $db->hybridSearch(vector: $queryVector, text: 'nearest neighbour', k: 5);
```

### Custom configuration on open

[](#custom-configuration-on-open)

```
use PHPVector\BM25\Config as BM25Config;
use PHPVector\Distance;
use PHPVector\HNSW\Config as HNSWConfig;
use PHPVector\VectorDatabase;

$db = VectorDatabase::open(
    path:       '/var/data/mydb',
    hnswConfig: new HNSWConfig(
        M:        16,
        efSearch: 100,
        distance: Distance::Euclidean,  // must match the value used on save()
    ),
    bm25Config: new BM25Config(k1: 1.2, b: 0.8),
    tokenizer:  new MyCustomTokenizer(),
);
```

> **Note:** Only `efSearch` and `bm25Config`/`tokenizer` affect query-time behaviour and can differ from build time. `distance` and the graph parameters (`M`, `efConstruction`) are fixed at build time — `distance` is validated on `open()` and must match.

### Incremental updates

[](#incremental-updates)

You can add new documents to a database that was loaded from disk, then call `save()` again. The existing document files are left in place; only the new ones are written along with updated index files.

```
$db = VectorDatabase::open('/var/data/mydb');
$db->addDocument(new Document(vector: [0.55, 0.42, 0.71], text: 'New document'));
$db->save(); // writes docs/N.bin + updated hnsw.bin, bm25.bin, meta.json
```

### Typical workflow: build once, serve many

[](#typical-workflow-build-once-serve-many)

```
// build.php — run once (or nightly)
$db = new VectorDatabase(
    hnswConfig: new HNSWConfig(M: 32, efConstruction: 400),
    path: '/var/data/mydb',
);
foreach (fetchDocumentsFromDatabase() as $doc) {
    $db->addDocument($doc);
}
$db->save();

// serve.php — loaded on every request or worker boot
$db = VectorDatabase::open('/var/data/mydb', new HNSWConfig(M: 32));
$results = $db->vectorSearch($queryVector, k: 10);
```

Multi-language stop words
-------------------------

[](#multi-language-stop-words)

Stop words are provided via `StopWordsProviderInterface`. Built-in providers:

```
use PHPVector\BM25\SimpleTokenizer;
use PHPVector\BM25\StopWords\EnglishStopWords;
use PHPVector\BM25\StopWords\ItalianStopWords;
use PHPVector\BM25\StopWords\FileStopWords;
use PHPVector\VectorDatabase;

// English (default)
$db = new VectorDatabase();

// Italian
$db = new VectorDatabase(
    tokenizer: new SimpleTokenizer(new ItalianStopWords()),
);

// Load from file (one word per line, # for comments)
$db = new VectorDatabase(
    tokenizer: new SimpleTokenizer(new FileStopWords('/path/to/stopwords.txt')),
);

### Stop words file format (`FileStopWords`)

Use a plain UTF-8 text file with one stop word per line.

Rules:
- Empty lines are ignored
- Lines starting with `#` are treated as comments
- Words are normalized to lowercase when loaded

Example (`stopwords-it.txt`):

```txt
# Italian stop words
e
di
a
che
il
la
```

// No stop words $db = new VectorDatabase( tokenizer: new SimpleTokenizer(stopWords: \[\]), );

```

Available providers:
- `EnglishStopWords` - English stop words (default)
- `ItalianStopWords` - Italian stop words
- `FileStopWords` - Load from file

## Deleting and updating documents

```php
// Delete a document by ID
$deleted = $db->deleteDocument(1);  // returns true if found, false otherwise

// Update a document (delete + insert with same ID)
$updated = $db->updateDocument(new Document(
    id: 1,
    vector: [0.5, 0.5, 0.3, 0.2],
    text: 'Updated content here',
    metadata: ['version' => 2],
));

// After modifications, call save() to persist
$db->save();

```

Deleted documents are soft-deleted from the HNSW graph (kept for connectivity but excluded from results) and fully removed from the BM25 index. Document files are deleted from disk immediately.

Metadata filtering
------------------

[](#metadata-filtering)

Filter search results by document metadata. Filters can be combined with any search method — vector, text, or hybrid.

### Creating filters

[](#creating-filters)

Use the `MetadataFilter` value object. All eleven operators are supported:

```
use PHPVector\Metadata\MetadataFilter;

// Equality / inequality
$filter = MetadataFilter::eq('status', 'published');
$filter = MetadataFilter::neq('type', 'draft');

// Comparison operators
$filter = MetadataFilter::lt('price', 100);
$filter = MetadataFilter::lte('price', 100);
$filter = MetadataFilter::gt('rating', 4.0);
$filter = MetadataFilter::gte('rating', 4.0);

// Set membership
$filter = MetadataFilter::in('category', ['tech', 'science', 'engineering']);
$filter = MetadataFilter::notIn('status', ['deleted', 'archived']);

// Array containment — checks if metadata array contains the value
$filter = MetadataFilter::contains('tags', 'php');  // matches ['tags' => ['php', 'vector']]

// Existence checks — does a metadata key exist (regardless of value)?
$filter = MetadataFilter::exists('thumbnail');
$filter = MetadataFilter::notExists('deleted_at');
```

### Filtering search results

[](#filtering-search-results)

Pass filters to any search method. Multiple filters are ANDed together by default.

```
use PHPVector\Metadata\MetadataFilter;

// Vector search with filters
$results = $db->vectorSearch(
    vector: $queryVector,
    k: 10,
    filters: [
        MetadataFilter::eq('lang', 'en'),
        MetadataFilter::gt('year', 2020),
    ],
);

// Text search with filters
$results = $db->textSearch(
    query: 'machine learning',
    k: 10,
    filters: [
        MetadataFilter::in('category', ['tech', 'science']),
    ],
);

// Hybrid search with filters
$results = $db->hybridSearch(
    vector: $queryVector,
    text: 'machine learning',
    k: 10,
    filters: [
        MetadataFilter::eq('status', 'published'),
    ],
);
```

### OR groups (nested arrays)

[](#or-groups-nested-arrays)

Wrap filters in a nested array to create OR groups. Filters at the top level are ANDed; filters inside a nested array are ORed.

```
// (category = 'tech' OR category = 'science') AND status = 'published'
$results = $db->vectorSearch(
    vector: $queryVector,
    k: 10,
    filters: [
        [
            MetadataFilter::eq('category', 'tech'),
            MetadataFilter::eq('category', 'science'),
        ],  // OR group
        MetadataFilter::eq('status', 'published'),  // ANDed with the OR group
    ],
);
```

### Over-fetching for filtered queries

[](#over-fetching-for-filtered-queries)

When filters are applied, the search may need to examine more candidates than `k` to find enough matching documents. By default, the search fetches `k * 5` candidates, then filters. You can tune this:

```
// Fetch 10× candidates before filtering (useful when filters are very selective)
$results = $db->vectorSearch(
    vector: $queryVector,
    k: 10,
    filters: [MetadataFilter::eq('rare_tag', 'value')],
    overFetch: 10,
);

// Or set the default multiplier at construction time
$db = new VectorDatabase(
    overFetchMultiplier: 10,
);
```

> **Note:** Filtered queries may return fewer than `k` results if not enough documents match.

### Updating metadata

[](#updating-metadata)

Update metadata on existing documents without re-indexing vectors or text:

```
// Add or update metadata keys
$db->patchMetadata(id: 1, patch: [
    'status' => 'archived',
    'updated_at' => '2026-03-24',
]);

// Remove metadata keys by setting to null
$db->patchMetadata(id: 1, patch: [
    'deprecated_field' => null,  // key will be removed
]);

// patchMetadata returns false if document not found
if (!$db->patchMetadata(id: 999, patch: ['key' => 'value'])) {
    echo "Document not found\n";
}
```

The `patchMetadata()` method:

- Merges patch into existing metadata (existing keys preserved unless overwritten)
- Does NOT touch HNSW or BM25 indexes (fast, metadata-only operation)
- Persists immediately when database has a path configured

### Metadata-only search

[](#metadata-only-search)

Query documents by metadata alone, without a vector or text query:

```
use PHPVector\Metadata\SortDirection;

// Find all documents matching filters
$results = $db->metadataSearch(
    filters: [MetadataFilter::eq('status', 'published')],
);

// With limit
$results = $db->metadataSearch(
    filters: [MetadataFilter::gt('year', 2020)],
    limit: 100,
);

// With sorting by metadata key
$results = $db->metadataSearch(
    filters: [MetadataFilter::eq('status', 'published')],
    sortBy: 'created_at',
    sortDirection: SortDirection::Desc,
);

// Empty filters returns all documents
$allDocs = $db->metadataSearch(filters: [], limit: 50);
```

> **Note:** Documents missing the `sortBy` key are placed at the end of results. All results have `score = 1.0` (no ranking).

### Strict type comparison

[](#strict-type-comparison)

Metadata filtering uses **strict type comparison** (PHP `===`). This means:

- String `'5'` does NOT match integer `5`
- Float `1.0` does NOT match integer `1`

```
// Document with metadata: ['year' => 2024] (integer)
MetadataFilter::eq('year', 2024);    // ✓ matches
MetadataFilter::eq('year', '2024');  // ✗ does not match (string vs int)

// Document with metadata: ['rating' => 4.5] (float)
MetadataFilter::gt('rating', 4);     // ✓ matches (4.5 > 4)
MetadataFilter::eq('rating', 4.5);   // ✓ matches
MetadataFilter::eq('rating', '4.5'); // ✗ does not match (string vs float)
```

Custom tokenizer
----------------

[](#custom-tokenizer)

Implement `TokenizerInterface` to plug in stemming, lemmatization, or any language-specific logic.

```
use PHPVector\BM25\TokenizerInterface;

final class PorterStemTokenizer implements TokenizerInterface
{
    public function tokenize(string $text): array
    {
        $tokens = preg_split('/\s+/', mb_strtolower(trim($text)), -1, PREG_SPLIT_NO_EMPTY);
        return array_map(fn($t) => porter_stem($t), $tokens); // your stemmer here
    }
}

$db = new VectorDatabase(tokenizer: new PorterStemTokenizer());
```

Benchmark
---------

[](#benchmark)

A [VectorDBBench](https://github.com/zilliztech/VectorDBBench)-style CLI benchmark lives in `benchmark/`. It measures index build throughput, serial QPS, P99 tail latency, Recall@k against brute-force ground truth, and persistence speed.

```
# Quick run (1 K and 10 K vectors, 128 dimensions)
php benchmark/benchmark.php

# Full run — save report to a file
php benchmark/benchmark.php --scenarios=xs,small,medium,large,highdim --output=report.md

# Large dataset, skip recall (brute-force would be slow)
php benchmark/benchmark.php --scenarios=large --no-recall --queries=500

# Tune HNSW parameters
php benchmark/benchmark.php --scenarios=small --ef-search=100 --m=32

# All options
php benchmark/benchmark.php --help
```

**Available scenarios**

KeyVectorsDimsNotes`xs`1,000128Quick smoke test`small`10,000128SIFT-small scale`medium`50,000128SIFT-medium scale`large`100,000128Requires ~512 MB RAM`highdim`10,000768Text-embedding scale (Cohere-style)The report is printed as Markdown to stdout (or a file via `--output`). Progress messages go to stderr so piping works cleanly: `php benchmark/benchmark.php > report.md`.

Running the tests
-----------------

[](#running-the-tests)

```
composer install
./vendor/bin/phpunit
```

Copyright
---------

[](#copyright)

(C) 2026 by [Enrico Zimuel](https://www.zimuel.it)

###  Health Score

43

—

FairBetter than 90% of packages

Maintenance94

Actively maintained with recent releases

Popularity17

Limited adoption so far

Community13

Small or concentrated contributor base

Maturity43

Maturing project, gaining track record

 Bus Factor1

Top contributor holds 60.5% of commits — single point of failure

How is this calculated?**Maintenance (25%)** — Last commit recency, latest release date, and issue-to-star ratio. Uses a 2-year decay window.

**Popularity (30%)** — Total and monthly downloads, GitHub stars, and forks. Logarithmic scaling prevents top-heavy scores.

**Community (15%)** — Contributors, dependents, forks, watchers, and maintainers. Measures real ecosystem engagement.

**Maturity (30%)** — Project age, version count, PHP version support, and release stability.

###  Release Activity

Cadence

Every ~39 days

Total

3

Last Release

26d ago

PHP version history (2 changes)0.1.0PHP ^8.1

0.3.0PHP ^8.2

### Community

Maintainers

![](https://www.gravatar.com/avatar/75c7c511421feb14316a01d29a7566bd4fdd97147b5a4f3faa5a065f9d0a0193?d=identicon)[ezimuel](/maintainers/ezimuel)

---

Top Contributors

[![danielebarbaro](https://avatars.githubusercontent.com/u/4376886?v=4)](https://github.com/danielebarbaro "danielebarbaro (49 commits)")[![ezimuel](https://avatars.githubusercontent.com/u/475967?v=4)](https://github.com/ezimuel "ezimuel (28 commits)")[![dependabot[bot]](https://avatars.githubusercontent.com/in/29110?v=4)](https://github.com/dependabot[bot] "dependabot[bot] (4 commits)")

###  Code Quality

TestsPHPUnit

Static AnalysisPHPStan

Code StylePHP CS Fixer

Type Coverage Yes

### Embed Badge

![Health badge](/badges/ezimuel-phpvector/health.svg)

```
[![Health](https://phpackages.com/badges/ezimuel-phpvector/health.svg)](https://phpackages.com/packages/ezimuel-phpvector)
```

###  Alternatives

[jdorn/sql-formatter

a PHP SQL highlighting library

3.9k116.5M113](/packages/jdorn-sql-formatter)[propel/propel1

Propel is an open-source Object-Relational Mapping (ORM) for PHP5.

8351.6M87](/packages/propel-propel1)[yemenopensource/filament-excel

This package useful for importing excel files into models.

194.2k](/packages/yemenopensource-filament-excel)

PHPackages © 2026

[Directory](/)[Categories](/categories)[Trending](/trending)[Changelog](/changelog)[Analyze](/analyze)
