PHPackages                             turanjanin/serbian-language-tools - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [Parsing &amp; Serialization](/categories/parsing)
4. /
5. turanjanin/serbian-language-tools

ActiveLibrary[Parsing &amp; Serialization](/categories/parsing)

turanjanin/serbian-language-tools
=================================

Set of tools for tokenization, transliteration and diacritic restoration of a text written in Serbian language.

1.0.3(9mo ago)192.3k4MITPHPPHP ^7.4|^8.0

Since May 25Pushed 9mo ago1 watchersCompare

[ Source](https://github.com/turanjanin/serbian-language-tools)[ Packagist](https://packagist.org/packages/turanjanin/serbian-language-tools)[ RSS](/packages/turanjanin-serbian-language-tools/feed)WikiDiscussions master Synced 1mo ago

READMEChangelogDependencies (1)Versions (5)Used By (0)

Serbian Language Tools - PHP library for Transliteration &amp; Diacritic Restoration
====================================================================================

[](#serbian-language-tools---php-library-for-transliteration--diacritic-restoration)

Serbian Language Tools is a PHP library for dealing with text written in Serbian language. It features:

- Tokenizer
- **Diacritic restoration tool**
- Transliterator between Serbian Cyrillic and Latin alphabets
- Alphabet detection

Requirements
------------

[](#requirements)

This library requires PHP 7.4 or greater with [sqlite3](https://www.php.net/manual/en/book.sqlite3.php), [intl](https://www.php.net/manual/en/book.intl.php) and [mbstring](https://www.php.net/manual/en/book.mbstring.php) extensions.

Installation
------------

[](#installation)

You can install the package via composer:

```
composer require turanjanin/serbian-language-tools
```

Usage
-----

[](#usage)

In order to use the library, you need to tokenize the string. Tokenization is a process of splitting the string into a series of related characters. This library can recognize the following tokens: Word, Whitespace, URI (which includes URLs, hashtags and at-mentions), Interpunction, HTML and Emoticon.

Tokenizing can be achieved by creating a new instance of `Text` class using the named constructor:

```
use Turanjanin\SerbianLanguageTools\Text;

$text = Text::fromString('Zdravo svete, ovo je primer teksta!');
```

Text object will now contain an array of various tokens that can be processed. You can use this object as any other PHP array since it implements `ArrayAccess` interface.

```
echo count($text) . "\n"; // 13
echo get_class($text[1]). "\n"; // Turanjanin\SerbianLanguageTools\Tokens\Whitespace
echo $text[9] . "\n"; // primer
```

### Diacritic Restoration / Diacritization

[](#diacritic-restoration--diacritization)

Serbian Latin alphabet includes a couple of specific characters that are not found in ASCII encoding table. These characters feature diacritics - č, ć, š, ž, dž, đ - which are often omitted in everyday communication (social media, emails and SMS), mainly due to the widespread usage of English keyboard layouts.

This degraded Latin alphabet can be easily understood by human readers but it poses significant challenge for search engines and natural language processing. Therefore, this library features an algorithm that allows automated restoration of ASCII text by using a [dictionary of Serbian words](dictionary/README.md) and phrases for context disambiguation.

The algorithm inspects all `Word` tokens and looks for restoration candidates - the words with s, c, z or dj characters. After that, the following two steps are applied:

1. The most common phrases are searched for inside the text and, if found, words are replaced with their diacritical equivalents. This step takes word context into consideration which allows us to give advantage to some less used variations. For example, `sto hiljada` won't be replaced with `što hiljada`, even though the form `što` *(why)* has much greater frequency compared to word `sto` *(hundred)*.
2. Every restoration candidate is looked up in the dictionary and, if there are known variations, token is replaced with `RestoredWord` (if there is only one possible variation) or `MultipleRestoredWord` (if there are more possible variations). In case of more than one variation, the one with the highest frequency will be marked as preferred.

Diacritic restoration can be performed by calling the invokable class:

```
use Turanjanin\SerbianLanguageTools\Text;
use Turanjanin\SerbianLanguageTools\Transformers\DiacriticRestorer;

$text = Text::fromString('Cetiri cavke cuceci dzangrizavo cijucu u zeleznickoj skoli.');
echo (new DiacriticRestorer)($text); // Četiri čavke čučeći džangrizavo cijuču u železničkoj školi.
```

Dictionary needed for this algorithm is stored in custom-made SQLite database that is included with this library. You can extend this database or use different storage solution by providing custom implementation of `Turanjanin\SerbianLanguageTools\Dictionary\Dictionary` interface.

### Transliteration

[](#transliteration)

Library supports transliteration of text between Cyrillic, Latin and ASCII alphabets. Transliteration can be performed by calling appropriate invokable class:

```
use Turanjanin\SerbianLanguageTools\Text;
use Turanjanin\SerbianLanguageTools\Transformers\ToAsciiLatin;
use Turanjanin\SerbianLanguageTools\Transformers\ToCyrillic;
use Turanjanin\SerbianLanguageTools\Transformers\ToLatin;

$cyrillic = Text::fromString('Ово је ћирилични текст');
$latin = Text::fromString('Primer latiničnog teksta');

echo (new ToLatin)($cyrillic); // Ovo je ćirilični tekst

echo (new ToCyrillic)($latin); // Пример латиничног текста

echo (new ToAsciiLatin)($cyrillic); // Ovo je cirilicni tekst
```

If you need only transliteration between Latin and Cyrillic alphabets, take a look at the simpler library - [turanjanin/serbian-transliterator](https://github.com/turanjanin/serbian-transliterator).

### Alphabet Detection

[](#alphabet-detection)

Library can be used to detect if text is written in Serbian Cyrillic or Latin alphabet:

```
use Turanjanin\SerbianLanguageTools\Text;

Text::fromString('Ovo je latinica')->isLatin(); // true
Text::fromString('Ovo je latinica')->isCyrillic(); // false
```

Author
------

[](#author)

- [Jovan Turanjanin](https://github.com/turanjanin)

License
-------

[](#license)

The MIT License (MIT). Please see [License File](LICENSE.md) for more information.

###  Health Score

43

—

FairBetter than 91% of packages

Maintenance58

Moderate activity, may be stable

Popularity29

Limited adoption so far

Community9

Small or concentrated contributor base

Maturity60

Established project with proven stability

 Bus Factor1

Top contributor holds 100% of commits — single point of failure

How is this calculated?**Maintenance (25%)** — Last commit recency, latest release date, and issue-to-star ratio. Uses a 2-year decay window.

**Popularity (30%)** — Total and monthly downloads, GitHub stars, and forks. Logarithmic scaling prevents top-heavy scores.

**Community (15%)** — Contributors, dependents, forks, watchers, and maintainers. Measures real ecosystem engagement.

**Maturity (30%)** — Project age, version count, PHP version support, and release stability.

###  Release Activity

Cadence

Every ~512 days

Total

4

Last Release

275d ago

### Community

Maintainers

![](https://www.gravatar.com/avatar/68466c69758685978c60a5fdeec67acf59039d386f58033bc110b2f67cfda332?d=identicon)[turanjanin](/maintainers/turanjanin)

---

Top Contributors

[![turanjanin](https://avatars.githubusercontent.com/u/9646495?v=4)](https://github.com/turanjanin "turanjanin (5 commits)")

---

Tags

cyrillicdiacritics-restorationdiacritizationphp-libraryserbian-languagetransliterationtokenizerlatintransliterationcyrillicSerbianlatinisationdiacriticssrpskidiacritic restorationcyrillizationcelava latinica

###  Code Quality

TestsPHPUnit

### Embed Badge

![Health badge](/badges/turanjanin-serbian-language-tools/health.svg)

```
[![Health](https://phpackages.com/badges/turanjanin-serbian-language-tools/health.svg)](https://phpackages.com/packages/turanjanin-serbian-language-tools)
```

###  Alternatives

[yethee/tiktoken

PHP version of tiktoken

1583.1M15](/packages/yethee-tiktoken)[turanjanin/serbian-transliterator

Transliterate between Serbian Cyrillic and Latin scripts.

1217.0k](/packages/turanjanin-serbian-transliterator)[gioni06/gpt3-tokenizer

PHP package for Byte Pair Encoding (BPE) used by GPT-3.

85537.5k8](/packages/gioni06-gpt3-tokenizer)[rajentrivedi/tokenizer-x

TokenizerX calculates required tokens for given prompt

91214.0k3](/packages/rajentrivedi-tokenizer-x)[nicoswd/php-rule-parser

Rule Engine - Rule Parser &amp; Evaluator

13078.6k7](/packages/nicoswd-php-rule-parser)[jeremeamia/functionparser

Function parser for PHP functions, methods, and closures

48169.7k6](/packages/jeremeamia-functionparser)

PHPackages © 2026

[Directory](/)[Categories](/categories)[Trending](/trending)[Changelog](/changelog)[Analyze](/analyze)
