PHPackages                             pcrov/unicode - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [Utility &amp; Helpers](/categories/utility)
4. /
5. pcrov/unicode

ActiveLibrary[Utility &amp; Helpers](/categories/utility)

pcrov/unicode
=============

Miscellaneous Unicode utility functions

0.1.1(5y ago)21.3M—8.9%1MITPHPPHP &gt;=7.3CI failing

Since Mar 1Pushed 3y ago3 watchersCompare

[ Source](https://github.com/pcrov/Unicode)[ Packagist](https://packagist.org/packages/pcrov/unicode)[ Docs](https://github.com/pcrov/unicode)[ RSS](/packages/pcrov-unicode/feed)WikiDiscussions master Synced 2d ago

READMEChangelogDependencies (1)Versions (3)Used By (1)

Unicode
=======

[](#unicode)

[![CI Status](https://github.com/pcrov/Unicode/workflows/CI/badge.svg)](https://github.com/pcrov/Unicode/actions?query=workflow%3ACI)[![License](https://camo.githubusercontent.com/c8244ee57e0480b7cd8729ef398814a492f6f2ea0e6713e34624c2f5d3e78bf3/68747470733a2f2f706f7365722e707567782e6f72672f7063726f762f756e69636f64652f6c6963656e7365)](https://github.com/pcrov/Unicode/blob/master/LICENSE)[![Latest Stable Version](https://camo.githubusercontent.com/210a469a0f5aa5bb5fe84686c160ccc5114ba584d559498d0dd0436bbfb726b4/68747470733a2f2f706f7365722e707567782e6f72672f7063726f762f756e69636f64652f762f737461626c65)](https://packagist.org/packages/pcrov/unicode)

Miscellaneous Unicode utility functions.

Functions
---------

[](#functions)

Namespace `pcrov\Unicode`.

#### `surrogate_pair_to_code_point(int $high, int $low): int`

[](#surrogate_pair_to_code_pointint-high-int-low-int)

Translates a UTF-16 surrogate pair into a single code point. [Wikipedia's UTF-16 article](https://en.wikipedia.org/wiki/UTF-16#U.2B10000_to_U.2B10FFFF)explains what this is fairly well.

#### `utf8_find_invalid_byte_sequence(string $string): ?int`

[](#utf8_find_invalid_byte_sequencestring-string-int)

Returns the position of the first invalid byte sequence or null if the input is valid.

#### `utf8_get_invalid_byte_sequence(string $string): ?string`

[](#utf8_get_invalid_byte_sequencestring-string-string)

Returns the first invalid byte sequence or null if the input is valid.

#### `utf8_get_state_machine(): array`

[](#utf8_get_state_machine-array)

Provides a state machine letting you walk a (potentially endless) UTF-8 sequence byte by byte.

It is in the form of `[byte => [valid next byte => ...,], ...]`

Example use:

```
function utf8_generate_all_code_points(): string
{
    $generator = function (array $machine, string $buffer = "") use (&$generator) {
        // Completed a UTF-8 encoded code point.
        if ($buffer !== "" && isset($machine["\x0"])) {
            return $buffer;
        }

        $out = "";
        foreach ($machine as $byte => $next) {
            $out .= $generator($next, $buffer . $byte);
        }

        return $out;
    };

    return $generator(utf8_get_state_machine());
}
```

#### `utf8_validate(string $string): bool`

[](#utf8_validatestring-string-bool)

Does what it says on the box.

Data
----

[](#data)

The test/data directory holds two files containing all possible UTF-8 encoded characters. All 1,112,064 of them. One as plain text, the other as json. These are not included in packaged stable releases but can be generated with the example `utf8_generate_all_code_points()`function above (returns the plain text string.)

Excerpts from the [Unicode 10.0.0 standard](http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf#page=55):
------------------------------------------------------------------------------------------------------------

[](#excerpts-from-the-unicode-1000-standard)

Recreated here for ease of reference. Nobody likes PDFs.

### Table 3-6. UTF-8 Bit Distribution

[](#table-3-6-utf-8-bit-distribution)

Scalar ValueFirst ByteSecond ByteThird ByteFourth Byte00000000 0xxxxxxx0xxxxxxx00000yyy yyxxxxxx110yyyyy10xxxxxxzzzzyyyy yyxxxxxx1110zzzz10yyyyyy10xxxxxx000uuuuu zzzzyyyy yyxxxxxx11110uuu10uuzzzz10yyyyyy10xxxxxx### Table 3-7. Well-Formed UTF-8 Byte Sequences

[](#table-3-7-well-formed-utf-8-byte-sequences)

Code PointsFirst ByteSecond ByteThird ByteFourth ByteU+0000..U+007F00..7FU+0080..U+07FFC2..DF80..BFU+0800..U+0FFFE0***A0***..BF80..BFU+1000..U+CFFFE1..EC80..BF80..BFU+D000..U+D7FFED80..***9F***80..BFU+E000..U+FFFFEE..EF80..BF80..BFU+10000..U+3FFFFF0***90***..BF80..BF80..BFU+40000..U+FFFFFF1..F380..BF80..BF80..BFU+100000..U+10FFFFF480..***8F***80..BF80..BF

###  Health Score

34

—

LowBetter than 75% of packages

Maintenance20

Infrequent updates — may be unmaintained

Popularity41

Moderate usage in the ecosystem

Community10

Small or concentrated contributor base

Maturity49

Maturing project, gaining track record

 Bus Factor1

Top contributor holds 100% of commits — single point of failure

How is this calculated?**Maintenance (25%)** — Last commit recency, latest release date, and issue-to-star ratio. Uses a 2-year decay window.

**Popularity (30%)** — Total and monthly downloads, GitHub stars, and forks. Logarithmic scaling prevents top-heavy scores.

**Community (15%)** — Contributors, dependents, forks, watchers, and maintainers. Measures real ecosystem engagement.

**Maturity (30%)** — Project age, version count, PHP version support, and release stability.

###  Release Activity

Cadence

Every ~970 days

Total

2

Last Release

2075d ago

PHP version history (2 changes)0.1.0PHP ^7.0

0.1.1PHP &gt;=7.3

### Community

Maintainers

![](https://avatars.githubusercontent.com/u/8586747?v=4)[Paul](/maintainers/pcrov)[@pcrov](https://github.com/pcrov)

---

Top Contributors

[![pcrov](https://avatars.githubusercontent.com/u/8586747?v=4)](https://github.com/pcrov "pcrov (22 commits)")

---

Tags

phpunicodeutf-8utf-8unicode

###  Code Quality

TestsPHPUnit

### Embed Badge

![Health badge](/badges/pcrov-unicode/health.svg)

```
[![Health](https://phpackages.com/badges/pcrov-unicode/health.svg)](https://phpackages.com/packages/pcrov-unicode)
```

###  Alternatives

[nette/utils

🛠 Nette Utils: lightweight utilities for string &amp; array manipulation, image handling, safe JSON encoding/decoding, validation, slug or strong password generating etc.

2.1k430.4M1.7k](/packages/nette-utils)[voku/portable-utf8

Portable UTF-8 library - performance optimized (unicode) string functions for php.

52323.5M49](/packages/voku-portable-utf8)[danielstjules/stringy

A string manipulation library with multibyte support

2.4k26.3M192](/packages/danielstjules-stringy)[jbroadway/urlify

A fast PHP slug generator and transliteration library that converts non-ascii characters for use in URLs.

6758.1M80](/packages/jbroadway-urlify)[clue/utf8-react

Streaming UTF-8 parser, built on top of ReactPHP.

6611.0M4](/packages/clue-utf8-react)[ausi/slug-generator

Slug Generator

8022.4M27](/packages/ausi-slug-generator)

PHPackages © 2026

[Directory](/)[Categories](/categories)[Trending](/trending)[Changelog](/changelog)[Analyze](/analyze)
