PHPackages                             illuma-law/laravel-semantic-deduper - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [Utility &amp; Helpers](/categories/utility)
4. /
5. illuma-law/laravel-semantic-deduper

ActiveLibrary[Utility &amp; Helpers](/categories/utility)

illuma-law/laravel-semantic-deduper
===================================

Blazing-fast, generic semantic deduplication for Laravel collections and arrays.

v0.1.4(1mo ago)062MITPHPPHP ^8.3CI failing

Since Apr 20Pushed 1mo agoCompare

[ Source](https://github.com/illuma-law/laravel-semantic-deduper)[ Packagist](https://packagist.org/packages/illuma-law/laravel-semantic-deduper)[ RSS](/packages/illuma-law-laravel-semantic-deduper/feed)WikiDiscussions main Synced 1w ago

READMEChangelogDependencies (15)Versions (6)Used By (0)

Laravel Semantic Deduper
========================

[](#laravel-semantic-deduper)

[![Run Tests](https://github.com/illuma-law/laravel-semantic-deduper/actions/workflows/run-tests.yml/badge.svg)](https://github.com/illuma-law/laravel-semantic-deduper/actions/workflows/run-tests.yml)[![License: MIT](https://camo.githubusercontent.com/fdf2982b9f5d7489dcf44570e714e3a15fce6253e0cc6b5aa61a075aac2ff71b/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4c6963656e73652d4d49542d79656c6c6f772e737667)](https://opensource.org/licenses/MIT)

Blazing-fast, generic semantic deduplication for Laravel collections and arrays.

When dealing with search results or building context windows for LLMs (RAG pipelines), you often encounter near-duplicate content (e.g., an article snippet and its summary). Feeding redundant data to an LLM wastes tokens and degrades output quality.

This package uses an optimized Cosine Similarity algorithm to identify and remove these "near-duplicates" based on their vector embeddings, ensuring your final dataset is highly diverse and relevant.

Features
--------

[](#features)

- **Blazing Fast**: The core cosine similarity math is highly optimized for PHP (strict float typing, loop unrolling, early returns).
- **Generic Structure**: Works flawlessly with associative arrays or Laravel Collections—it just needs an embedding vector.
- **Fluent Builder API**: Configure thresholds, group limits, and grouping logic on the fly.
- **Clean DTOs**: Returns your deduplicated data wrapped in structured, type-safe Data Transfer Objects (`GroupedContext`, `ContextGroup`, `ContextItem`).
- **Flexible Grouping**: Group results by categories, sources, or document types before deduplicating them.
- **Fuzzy String Deduplication**: New utilities for scoring string similarity (Levenshtein/SimilarText) and a fluent Collection macro.
- **Text Chunking**: Optimized sliding-window chunker for RAG pipelines.
- **Data Merging**: Utilities for deep merging associative arrays and absorbing fields between records.

Installation
------------

[](#installation)

You can install the package via composer:

```
composer require illuma-law/laravel-semantic-deduper
```

Publish the configuration file:

```
php artisan vendor:publish --tag="semantic-deduper-config"
```

Configuration
-------------

[](#configuration)

The published `config/semantic-deduper.php` file defines global fallback defaults:

```
return [
    // Maximum items to keep per group
    'max_per_group' => 3,

    // Hard limit on total items across all groups combined
    'max_total' => 12,

    // The Cosine Similarity threshold.
    // 1.0 means exact match. 0.95 means very similar.
    // If a new item is >= 0.95 similar to an already accepted item, it is dropped.
    'near_duplicate_threshold' => 0.92,
];
```

Usage &amp; Integration
-----------------------

[](#usage--integration)

### Basic Usage

[](#basic-usage)

Use the `SemanticClusterer` to group and deduplicate results in one go.

```
use IllumaLaw\SemanticDeduper\SemanticClusterer;

$results = [
    ['id' => 1, 'source' => 'web', 'embedding' => [0.1, 0.2, 0.3]],
    ['id' => 2, 'source' => 'web', 'embedding' => [0.11, 0.19, 0.31]], // Very similar to #1
    ['id' => 3, 'source' => 'pdf', 'embedding' => [0.9, 0.8, -0.1]],
];

$grouped = SemanticClusterer::make()
    ->groupBy('source')
    ->threshold(0.95) // Anything 95% similar is dropped
    ->maxPerGroup(2)
    ->cluster($results);

foreach ($grouped->groups as $group) {
    echo "Group: " . $group->label . "\n";
    foreach ($group->items as $item) {
        // Only Item #1 and #3 will be kept
        echo " - Item ID: " . $item->get('id') . "\n";
    }
}
```

### Advanced Configuration

[](#advanced-configuration)

The fluent API allows deep customization of how the data is grouped and evaluated.

```
use IllumaLaw\SemanticDeduper\SemanticClusterer;

$grouped = SemanticClusterer::make()
    ->maxPerGroup(5)
    ->maxTotal(20)
    ->threshold(0.90)
    ->embeddingKey('vector_data') // Tell it where your embeddings live (default: 'embedding')
    ->idKey('uuid')               // Default: 'id'
    ->groupBy(function ($row) {
        // Complex grouping closure
        return $row['category'] . '-' . $row['type'];
    })
    ->cluster($collection);
```

### Working with the DTOs

[](#working-with-the-dtos)

The `cluster()` method returns a `GroupedContext` object. This object provides numerous helpers to seamlessly extract your final dataset:

```
// Check if all data was dropped
if ($grouped->isEmpty()) {
    // ...
}

// Get the final count of retained items
$total = $grouped->totalCount();

// Get a flat array of all retained ContextItem objects across all groups
$items = $grouped->allItems();

// Commonly used in RAG: Extract just the IDs so you can fetch the real Eloquent models
$modelIds = $grouped->collectIdentifiers('id');

$models = Article::whereIn('id', $modelIds)->get();
```

Utilities
---------

[](#utilities)

### Fuzzy Collection Deduplication

[](#fuzzy-collection-deduplication)

This package adds a `dedupeFuzzy` macro to Laravel Collections:

```
$collection = collect([
    ['name' => 'John Doe'],
    ['name' => 'Jon Doe'], // fuzzy duplicate
    ['name' => 'Jane Smith'],
]);

$deduplicated = $collection->dedupeFuzzy('name', threshold: 85.0);
// Result contains 'John Doe' and 'Jane Smith'
```

### String Similarity

[](#string-similarity)

```
use IllumaLaw\SemanticDeduper\Utils\StringSimilarity;

$score = StringSimilarity::score('Laravel', 'Laraval'); // ~85.7
$lev   = StringSimilarity::levenshteinScore('Laravel', 'Laraval');
```

### Text Chunking

[](#text-chunking)

Optimized for creating overlapping chunks for vector embeddings.

```
use IllumaLaw\SemanticDeduper\Utils\TextChunker;

$chunks = TextChunker::chunk($longText, chunkSize: 500, overlap: 50);
```

### Data Merging

[](#data-merging)

```
use IllumaLaw\SemanticDeduper\Utils\DataMerger;

$merged = DataMerger::deepMerge($canonicalArray, $duplicateArray);

// Identify which fields in $duplicate can fill blanks in $canonical
$updates = DataMerger::identifyAbsorbableUpdates($canonical, $duplicate, ['email', 'phone']);
```

Performance Note
----------------

[](#performance-note)

While this package is highly optimized for PHP execution, computing cosine similarity in memory is O(N²) for each group. It is designed to run efficiently on result sets of up to a few thousand items (e.g., the raw output of a search engine query before sending to an LLM). Do not attempt to run it against your entire database.

Testing
-------

[](#testing)

The package includes a comprehensive Pest test suite covering edge cases and mathematical precision.

```
composer test
```

License
-------

[](#license)

The MIT License (MIT). Please see [License File](LICENSE) for more information.

###  Health Score

40

—

FairBetter than 86% of packages

Maintenance91

Actively maintained with recent releases

Popularity12

Limited adoption so far

Community6

Small or concentrated contributor base

Maturity42

Maturing project, gaining track record

 Bus Factor1

Top contributor holds 100% of commits — single point of failure

How is this calculated?**Maintenance (25%)** — Last commit recency, latest release date, and issue-to-star ratio. Uses a 2-year decay window.

**Popularity (30%)** — Total and monthly downloads, GitHub stars, and forks. Logarithmic scaling prevents top-heavy scores.

**Community (15%)** — Contributors, dependents, forks, watchers, and maintainers. Measures real ecosystem engagement.

**Maturity (30%)** — Project age, version count, PHP version support, and release stability.

###  Release Activity

Cadence

Every ~2 days

Total

5

Last Release

44d ago

### Community

Maintainers

![](https://www.gravatar.com/avatar/2affac64f2726a640084b203503518ca01f582536d60a0a299b614486ed95aaa?d=identicon)[miguelenes](/maintainers/miguelenes)

---

Top Contributors

[![miguelenes](https://avatars.githubusercontent.com/u/1568086?v=4)](https://github.com/miguelenes "miguelenes (8 commits)")

###  Code Quality

TestsPest

Static AnalysisPHPStan

Code StyleLaravel Pint

### Embed Badge

![Health badge](/badges/illuma-law-laravel-semantic-deduper/health.svg)

```
[![Health](https://phpackages.com/badges/illuma-law-laravel-semantic-deduper/health.svg)](https://phpackages.com/packages/illuma-law-laravel-semantic-deduper)
```

###  Alternatives

[psalm/plugin-laravel

Psalm plugin for Laravel

3325.1M337](/packages/psalm-plugin-laravel)[spatie/laravel-health

Monitor the health of a Laravel application

88011.3M149](/packages/spatie-laravel-health)[laravel/ai

The official AI SDK for Laravel.

9782.1M153](/packages/laravel-ai)[laracraft-tech/laravel-useful-additions

A collection of useful Laravel additions!

58122.8k](/packages/laracraft-tech-laravel-useful-additions)[simplestats-io/laravel-client

Analytics for Laravel. Track visitors, registrations, and payments. Discover which channels actually drive revenue, not just traffic. Server-side, GDPR compliant, ad-blocker proof.

5019.3k](/packages/simplestats-io-laravel-client)[aedart/athenaeum

Athenaeum is a mono repository; a collection of various PHP packages

245.2k](/packages/aedart-athenaeum)

PHPackages © 2026

[Directory](/)[Categories](/categories)[Trending](/trending)[Changelog](/changelog)[Analyze](/analyze)
