PHPackages                             serafim/tf-idf - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [PDF &amp; Document Generation](/categories/documents)
4. /
5. serafim/tf-idf

ActiveLibrary[PDF &amp; Document Generation](/categories/documents)

serafim/tf-idf
==============

Library to calculate TF-IDF (Term Frequency - Inverse Document Frequency) for generic documents

0.1.0(3y ago)41.7k1MITPHPPHP ^8.1

Since Feb 6Pushed 2y ago1 watchersCompare

[ Source](https://github.com/SerafimArts/TF-IDF)[ Packagist](https://packagist.org/packages/serafim/tf-idf)[ Docs](https://github.com/SerafimArts/TF-IDF)[ RSS](/packages/serafim-tf-idf/feed)WikiDiscussions master Synced yesterday

READMEChangelog (1)Dependencies (6)Versions (2)Used By (1)

 [![PHP 8.1+](https://camo.githubusercontent.com/519722ad41f806e38945ccda1b8d361f1b937fb63680de23a8e9d254f4e9ae76/68747470733a2f2f706f7365722e707567782e6f72672f7365726166696d2f74662d6964662f726571756972652f7068703f7374796c653d666f722d7468652d6261646765)](https://packagist.org/packages/serafim/tf-idf) [![Latest Stable Version](https://camo.githubusercontent.com/515aa50949ad75fb082b7c96e8a1b5be9bdd555897f56067bc1166c797f3eb73/68747470733a2f2f706f7365722e707567782e6f72672f7365726166696d2f74662d6964662f76657273696f6e3f7374796c653d666f722d7468652d6261646765)](https://packagist.org/packages/serafim/tf-idf) [![Latest Unstable Version](https://camo.githubusercontent.com/235f62156a1c84790ea31ae67b59ce76714641c0d35ad6a12f4a3661ce296a5e/68747470733a2f2f706f7365722e707567782e6f72672f7365726166696d2f74662d6964662f762f756e737461626c653f7374796c653d666f722d7468652d6261646765)](https://packagist.org/packages/serafim/tf-idf) [![Total Downloads](https://camo.githubusercontent.com/5ee7cc58e44e404b8bcb418c10b43784a190353c5b632b903e078c1ccd0243a9/68747470733a2f2f706f7365722e707567782e6f72672f7365726166696d2f74662d6964662f646f776e6c6f6164733f7374796c653d666f722d7468652d6261646765)](https://packagist.org/packages/serafim/tf-idf) [![License MIT](https://camo.githubusercontent.com/a51aa97b0ae38c8f58098119da4d7b0d103a15362fbac847270246d15e26bafe/68747470733a2f2f706f7365722e707567782e6f72672f7365726166696d2f74662d6964662f6c6963656e73653f7374796c653d666f722d7468652d6261646765)](https://raw.githubusercontent.com/SerafimArts/TF-IDF/master/LICENSE.md)

 [![](https://github.com/SerafimArts/TF-IDF/workflows/tests/badge.svg)](https://github.com/SerafimArts/TF-IDF/actions)

Introduction
------------

[](#introduction)

TF-IDF is a method of information retrieval that is used to rank the importance of words in a document. It is based on the idea that words that appear in a document more often are more relevant to the document.

TF-IDF is the product of Term Frequency and Inverse Document Frequency. Here’s the formula for TF-IDF calculation.

```
TF-IDF = Term Frequency (TF) * Inverse Document Frequency (IDF)

```

### Term Frequency

[](#term-frequency)

the ratio of the number of occurrences of a certain word to the total number of words in the document. Thus, the importance of the word $t\_{{i}}$ within a single document is evaluated

$\\mathrm{tf}(t, d) = \\frac{n\_t}{\\sum \_kn\_k}$

where $n\_t$ is the number of occurrences of the word $t$ in the document, and the denominator is the total number of words in the document.

### Inverse Document Frequency

[](#inverse-document-frequency)

The inverse of the frequency with which a certain word occurs in the documents of the collection. The founder of this concept is [Karen Spark Jones](https://en.wikipedia.org/wiki/Karen_Sp%C3%A4rck_Jones). Accounting for IDF reduces the weight of commonly used words. There is only one IDF value for each unique word within a given collection of documents.

$\\mathrm{idf}(t, D) = \\log \\frac {|D|}{| {,d\_{i}\\in D\\mid t\\in d\_{i},} |}$

where

- $|D|$ — The number of documents in the collection;
- ${\\displaystyle |{d\_{i}\\in D\\mid t\\in d\_{i}}|}$ — the number of documents in collection $D$ where $t$ occurs (when ${\\displaystyle n\_{t}\\neq 0}$).

The choice of the base of the logarithm in the formula does not matter, since changing the base changes the weight of each word by a constant factor, which does not affect the weight ratio.

Thus, the TF-IDF measure is the product of two factors:

$\\mathrm{tf-idf}(t, d, D) = \\mathrm{tf}(t,d)\\times \\mathrm{idf}(t,D)$

High weight in TF-IDF will be given to words with high frequency within a particular document and low frequency in other documents.

Installation
------------

[](#installation)

TF-IDF is available as composer repository and can be installed using the following command in a root of your project:

```
$ composer require serafim/tf-idf
```

Quick Start
-----------

[](#quick-start)

Getting information about words:

```
$vectorizer = new \Serafim\TFIDF\Vectorizer();

$vectorizer->addFile(__DIR__ . '/path/to/file-1.txt');
$vectorizer->addFile(__DIR__ . '/path/to/file-2.txt');

foreach ($vectorizer->compute() as $document => $entries) {
    var_dump($document);

    foreach ($entries as $entry) {
        var_dump($entry);
    }
}
```

Example Result:

```
Serafim\TFIDF\Document\FileDocument {
    locale: "ru_RU"
    pathname: "/home/example/how-it-works.md"
}

Serafim\TFIDF\Entry {
    term: "работает"
    occurrences: 4
    df: 1
    tf: 0.012012012012012
    idf: 0.69314718055995
    tfidf: 0.0083260922589783
}

Serafim\TFIDF\Entry {
    term: "php"
    occurrences: 26
    df: 2
    tf: 0.078078078078078
    idf: 0.0
    tfidf: 0.0
}

Serafim\TFIDF\Entry {
    term: "запуска"
    occurrences: 2
    df: 1
    tf: 0.006006006006006
    idf: 0.69314718055995
    tfidf: 0.0041630461294892
}

// ...etc...
```

### Adding Documents

[](#adding-documents)

The IDF (Inverse Document Frequency) calculation requires several documents in the corpus. To do this, you can use several methods:

```
$vectorizer = new \Serafim\TFIDF\Vectorizer();

$vectorizer->addFile(__DIR__ . '/path/to/file.txt');
$vectorizer->addFile(new \SplFileInfo(__DIR__ . '/path/to/file.txt'));
$vectorizer->addText('example text');
$vectorizer->addStream(fopen(__DIR__ . '/path/to/file.txt', 'rb'));

// OR

$vectorizer->add(new class implements \Serafim\TFIDF\Document\TextDocumentInterface {
    public function getLocale(): string { /* ... */ }
    public function getContent(): string { /* ... */ }
});
```

### Creating Documents

[](#creating-documents)

```
$vectorizer = new \Serafim\TFIDF\Vectorizer();

$file = $vectorizer->createFile(__DIR__ . '/path/to/file.txt');
$text = $vectorizer->createText('example text');
$stream = $vectorizer->createStream(fopen(__DIR__ . '/path/to/file.txt', 'rb'));
```

### Computing

[](#computing)

To calculate TF-IDF between loaded documents, use the `compute(): iterable` method:

```
foreach ($vectorizer->compute() as $document => $result) {
    // $document = object(Serafim\TFIDF\Document\DocumentInterface)
    // $result   = list
}
```

To calculate the TF-IDF between the loaded documents and the passed one, use the `computeFor(StreamingDocumentInterface|TextDocumentInterface): iterable` method:

```
$text = $vectorizer->createText('example text');

$result = $vectorizer->computeFor($text);

// $result = list
```

### Custom Memory Driver

[](#custom-memory-driver)

By default, all operations are calculated in memory. This happens pretty quickly, but it can overflow it. You can write your own driver if you need to save memory.

```
use Serafim\TFIDF\Vectorizer;
use Serafim\TFIDF\Memory\FactoryInterface;
use Serafim\TFIDF\Memory\MemoryInterface;

$vectorizer = new Vectorizer(
    memory: new class implements FactoryInterface {
        // Method for creating a memory area for counters
        public function create(): MemoryInterface
        {
            return new class implements MemoryInterface, \IteratorAggregate {
                // Increment counter for the given term.
                public function inc(string $term): void { /* ... */ }

                // Return counter value for the given term or
                // 0 if the counter is not found.
                public function get(string $term): int { /* ... */ }

                // Should return TRUE if there is a counter for the
                // specified term.
                public function has(string $term): bool { /* ... */ }

                // Returns the number of registered counters.
                public function count(): int { /* ... */ }

                // Returns a list of terms and counter values in
                // format: [ WORD => 42 ]
                public function getIterator(): \Traversable { /* ... */ }

                // Destruction of the allocated memory area.
                public function __destruct() { /* ... */ }
            };
        }
    }
);
```

### Custom Stop Words

[](#custom-stop-words)

In the case that it is required that some set of "stop words", which would not be taken into account in the result, a custom implementation should be specified.

> Please note that by default, the list of stop words from the [voku/stop-words](https://github.com/voku/stop-words) package is used.

```
use Serafim\TFIDF\Vectorizer;
use Serafim\TFIDF\StopWords\FactoryInterface;
use Serafim\TFIDF\StopWords\StopWordsInterface;

$vectorizer = new Vectorizer(
    stopWords: new class implements FactoryInterface {
        public function create(string $locale): StopWordsInterface
        {
            // You can use a different set of stop word drivers depending
            // on the locale ("$locale" argument) of the document.
            return new class implements StopWordsInterface {
                // TRUE should be returned if the word should be ignored.
                // For example prepositions.
                public function match(string $term): bool
                {
                    return \in_array($term, ['and', 'or', /* ... */], true);
                }
            };
        }
    }
);
```

### Custom Locale

[](#custom-locale)

```
use Serafim\TFIDF\Vectorizer;
use Serafim\TFIDF\Locale\IntlRepository;

$vectorizer = new Vectorizer(
    locales: new class extends IntlRepository {
        // Specifying the default locale
        public function getDefault(): string
        {
            return 'en_US';
        }
    }
);
```

### Custom Tokenizer

[](#custom-tokenizer)

If for some reason the analysis of words in the text does not suit you, you can write your own tokenizer.

```
use Serafim\TFIDF\Vectorizer;
use Serafim\TFIDF\Tokenizer\TokenizerInterface;
use Serafim\TFIDF\Document\StreamingDocumentInterface;
use Serafim\TFIDF\Document\TextDocumentInterface;

$vectorizer = new Vectorizer(
    tokenizer: new class implements TokenizerInterface {
        // Please note that there can be several types of document:
        //  - Text Document: One that contains text in string representation.
        //  - Streaming Document: One that can be read and may contain a
        //    large amount of data.
        public function tokenize(StreamingDocumentInterface|TextDocumentInterface $document): iterable
        {
            $content = $document instanceof StreamingDocumentInterface
                ? \stream_get_contents($document->getContentStream())
                : $document->getContent();

            // Please note that the document also contains the locale, based on
            // which the term (word) separation logic can change.
            //
            // i.e. `if ($document->getLocale() === 'ar') { ... }`
            //

            return \preg_split('/[\s,]+/isum', $content);
        }
    }
);
```

###  Health Score

25

—

LowBetter than 35% of packages

Maintenance20

Infrequent updates — may be unmaintained

Popularity19

Limited adoption so far

Community9

Small or concentrated contributor base

Maturity44

Maturing project, gaining track record

 Bus Factor1

Top contributor holds 100% of commits — single point of failure

How is this calculated?**Maintenance (25%)** — Last commit recency, latest release date, and issue-to-star ratio. Uses a 2-year decay window.

**Popularity (30%)** — Total and monthly downloads, GitHub stars, and forks. Logarithmic scaling prevents top-heavy scores.

**Community (15%)** — Contributors, dependents, forks, watchers, and maintainers. Measures real ecosystem engagement.

**Maturity (30%)** — Project age, version count, PHP version support, and release stability.

###  Release Activity

Cadence

Unknown

Total

1

Last Release

1244d ago

### Community

Maintainers

![](https://avatars.githubusercontent.com/u/150420?v=4)[Ruslan Sharipov](/maintainers/Serafim)[@serafim](https://github.com/serafim)

---

Top Contributors

[![SerafimArts](https://avatars.githubusercontent.com/u/2461257?v=4)](https://github.com/SerafimArts "SerafimArts (12 commits)")

---

Tags

textdocumentstatisticTF-IDFtfTFIDFidf

###  Code Quality

TestsPHPUnit

Static AnalysisPsalm

Code StylePHP\_CodeSniffer

Type Coverage Yes

### Embed Badge

![Health badge](/badges/serafim-tf-idf/health.svg)

```
[![Health](https://phpackages.com/badges/serafim-tf-idf/health.svg)](https://phpackages.com/packages/serafim-tf-idf)
```

###  Alternatives

[smalot/pdfparser

Pdf parser library. Can read and extract information from pdf file.

2.7k40.5M272](/packages/smalot-pdfparser)[tecnickcom/tc-lib-pdf

PHP PDF Library

1.9k527.3k13](/packages/tecnickcom-tc-lib-pdf)[faisalman/simple-excel-php

Easily parse / convert / write between Microsoft Excel XML / CSV / TSV / HTML / JSON / etc formats

578610.1k1](/packages/faisalman-simple-excel-php)[kartik-v/yii2-export

A library to export server/db data in various formats (e.g. excel, html, pdf, csv etc.)

1693.3M36](/packages/kartik-v-yii2-export)[netcarver/textile

Textile markup language parser

2291.5M16](/packages/netcarver-textile)[stevebauman/autodoc-facades

Auto-generate PHP doc annotations for Laravel facades

98216.6k12](/packages/stevebauman-autodoc-facades)

PHPackages © 2026

[Directory](/)[Categories](/categories)[Trending](/trending)[Changelog](/changelog)[Analyze](/analyze)
