PHPackages                             pcrov/unicode - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [Utility &amp; Helpers](/categories/utility)
4. /
5. pcrov/unicode

ActiveLibrary[Utility &amp; Helpers](/categories/utility)

pcrov/unicode
=============

Miscellaneous Unicode utility functions

0.1.1(5y ago)21.2M↑24.2%1MITPHPPHP &gt;=7.3

Since Mar 1Pushed 3y ago3 watchersCompare

[ Source](https://github.com/pcrov/Unicode)[ Packagist](https://packagist.org/packages/pcrov/unicode)[ Docs](https://github.com/pcrov/unicode)[ RSS](/packages/pcrov-unicode/feed)WikiDiscussions master Synced 1mo ago

READMEChangelogDependencies (1)Versions (3)Used By (1)

Unicode
=======

[](#unicode)

[![CI Status](https://github.com/pcrov/Unicode/workflows/CI/badge.svg)](https://github.com/pcrov/Unicode/actions?query=workflow%3ACI)[![License](https://camo.githubusercontent.com/c8244ee57e0480b7cd8729ef398814a492f6f2ea0e6713e34624c2f5d3e78bf3/68747470733a2f2f706f7365722e707567782e6f72672f7063726f762f756e69636f64652f6c6963656e7365)](https://github.com/pcrov/Unicode/blob/master/LICENSE)[![Latest Stable Version](https://camo.githubusercontent.com/210a469a0f5aa5bb5fe84686c160ccc5114ba584d559498d0dd0436bbfb726b4/68747470733a2f2f706f7365722e707567782e6f72672f7063726f762f756e69636f64652f762f737461626c65)](https://packagist.org/packages/pcrov/unicode)

Miscellaneous Unicode utility functions.

Functions
---------

[](#functions)

Namespace `pcrov\Unicode`.

#### `surrogate_pair_to_code_point(int $high, int $low): int`

[](#surrogate_pair_to_code_pointint-high-int-low-int)

Translates a UTF-16 surrogate pair into a single code point. [Wikipedia's UTF-16 article](https://en.wikipedia.org/wiki/UTF-16#U.2B10000_to_U.2B10FFFF)explains what this is fairly well.

#### `utf8_find_invalid_byte_sequence(string $string): ?int`

[](#utf8_find_invalid_byte_sequencestring-string-int)

Returns the position of the first invalid byte sequence or null if the input is valid.

#### `utf8_get_invalid_byte_sequence(string $string): ?string`

[](#utf8_get_invalid_byte_sequencestring-string-string)

Returns the first invalid byte sequence or null if the input is valid.

#### `utf8_get_state_machine(): array`

[](#utf8_get_state_machine-array)

Provides a state machine letting you walk a (potentially endless) UTF-8 sequence byte by byte.

It is in the form of `[byte => [valid next byte => ...,], ...]`

Example use:

```
function utf8_generate_all_code_points(): string
{
    $generator = function (array $machine, string $buffer = "") use (&$generator) {
        // Completed a UTF-8 encoded code point.
        if ($buffer !== "" && isset($machine["\x0"])) {
            return $buffer;
        }

        $out = "";
        foreach ($machine as $byte => $next) {
            $out .= $generator($next, $buffer . $byte);
        }

        return $out;
    };

    return $generator(utf8_get_state_machine());
}
```

#### `utf8_validate(string $string): bool`

[](#utf8_validatestring-string-bool)

Does what it says on the box.

Data
----

[](#data)

The test/data directory holds two files containing all possible UTF-8 encoded characters. All 1,112,064 of them. One as plain text, the other as json. These are not included in packaged stable releases but can be generated with the example `utf8_generate_all_code_points()`function above (returns the plain text string.)

Excerpts from the [Unicode 10.0.0 standard](http://www.unicode.org/versions/Unicode10.0.0/ch03.pdf#page=55):
------------------------------------------------------------------------------------------------------------

[](#excerpts-from-the-unicode-1000-standard)

Recreated here for ease of reference. Nobody likes PDFs.

### Table 3-6. UTF-8 Bit Distribution

[](#table-3-6-utf-8-bit-distribution)

Scalar ValueFirst ByteSecond ByteThird ByteFourth Byte00000000 0xxxxxxx0xxxxxxx00000yyy yyxxxxxx110yyyyy10xxxxxxzzzzyyyy yyxxxxxx1110zzzz10yyyyyy10xxxxxx000uuuuu zzzzyyyy yyxxxxxx11110uuu10uuzzzz10yyyyyy10xxxxxx### Table 3-7. Well-Formed UTF-8 Byte Sequences

[](#table-3-7-well-formed-utf-8-byte-sequences)

Code PointsFirst ByteSecond ByteThird ByteFourth ByteU+0000..U+007F00..7FU+0080..U+07FFC2..DF80..BFU+0800..U+0FFFE0***A0***..BF80..BFU+1000..U+CFFFE1..EC80..BF80..BFU+D000..U+D7FFED80..***9F***80..BFU+E000..U+FFFFEE..EF80..BF80..BFU+10000..U+3FFFFF0***90***..BF80..BF80..BFU+40000..U+FFFFFF1..F380..BF80..BF80..BFU+100000..U+10FFFFF480..***8F***80..BF80..BF

###  Health Score

34

—

LowBetter than 77% of packages

Maintenance20

Infrequent updates — may be unmaintained

Popularity41

Moderate usage in the ecosystem

Community10

Small or concentrated contributor base

Maturity49

Maturing project, gaining track record

 Bus Factor1

Top contributor holds 100% of commits — single point of failure

How is this calculated?**Maintenance (25%)** — Last commit recency, latest release date, and issue-to-star ratio. Uses a 2-year decay window.

**Popularity (30%)** — Total and monthly downloads, GitHub stars, and forks. Logarithmic scaling prevents top-heavy scores.

**Community (15%)** — Contributors, dependents, forks, watchers, and maintainers. Measures real ecosystem engagement.

**Maturity (30%)** — Project age, version count, PHP version support, and release stability.

###  Release Activity

Cadence

Every ~970 days

Total

2

Last Release

2030d ago

PHP version history (2 changes)0.1.0PHP ^7.0

0.1.1PHP &gt;=7.3

### Community

Maintainers

![](https://www.gravatar.com/avatar/de76478d90325fc47b29d5b59c35d7ead6eefb54162210ca01bdad9c8d6c522f?d=identicon)[pcrov](/maintainers/pcrov)

---

Top Contributors

[![pcrov](https://avatars.githubusercontent.com/u/8586747?v=4)](https://github.com/pcrov "pcrov (22 commits)")

---

Tags

phpunicodeutf-8utf-8unicode

###  Code Quality

TestsPHPUnit

### Embed Badge

![Health badge](/badges/pcrov-unicode/health.svg)

```
[![Health](https://phpackages.com/badges/pcrov-unicode/health.svg)](https://phpackages.com/packages/pcrov-unicode)
```

###  Alternatives

[nette/utils

🛠 Nette Utils: lightweight utilities for string &amp; array manipulation, image handling, safe JSON encoding/decoding, validation, slug or strong password generating etc.

2.1k394.3M1.5k](/packages/nette-utils)[voku/portable-utf8

Portable UTF-8 library - performance optimized (unicode) string functions for php.

52322.4M40](/packages/voku-portable-utf8)[danielstjules/stringy

A string manipulation library with multibyte support

2.4k26.0M191](/packages/danielstjules-stringy)[jbroadway/urlify

A fast PHP slug generator and transliteration library that converts non-ascii characters for use in URLs.

6737.4M62](/packages/jbroadway-urlify)[ausi/slug-generator

Slug Generator

8002.2M22](/packages/ausi-slug-generator)[joypixels/emoji-toolkit

JoyPixels is a complete set of emoji designed for the web. The emoji-toolkit includes libraries to easily convert unicode characters to shortnames (:smile:) and shortnames to JoyPixels emoji images. PNG formats provided for the emoji images.

465817.1k7](/packages/joypixels-emoji-toolkit)

PHPackages © 2026

[Directory](/)[Categories](/categories)[Trending](/trending)[Changelog](/changelog)[Analyze](/analyze)
