PHPackages                             nick-jones/php-ucd - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [Database &amp; ORM](/categories/database)
4. /
5. nick-jones/php-ucd

ActiveLibrary[Database &amp; ORM](/categories/database)

nick-jones/php-ucd
==================

Interface into the Unicode Character Database

v3.1.0(6y ago)3119[1 issues](https://github.com/nick-jones/php-ucd/issues)1MITPHPPHP &gt;=5.6

Since Oct 8Pushed 6y ago1 watchersCompare

[ Source](https://github.com/nick-jones/php-ucd)[ Packagist](https://packagist.org/packages/nick-jones/php-ucd)[ RSS](/packages/nick-jones-php-ucd/feed)WikiDiscussions master Synced 2d ago

READMEChangelogDependencies (13)Versions (19)Used By (1)

PHP UCD
=======

[](#php-ucd)

[![Travis](https://camo.githubusercontent.com/2f3ef7f2d880f3121d55d90353d2cedd16af96005261d8206de9dd27d4c12e39/68747470733a2f2f696d672e736869656c64732e696f2f7472617669732f6e69636b2d6a6f6e65732f7068702d7563642e7376673f7374796c653d666c61742d737175617265)](https://travis-ci.org/nick-jones/php-ucd)[![Scrutinizer](https://camo.githubusercontent.com/7872f18c4fbb538b28794d64b68ff8c3a1f4844fcf70137f5cb535a85da2ce44/68747470733a2f2f696d672e736869656c64732e696f2f7363727574696e697a65722f672f6e69636b2d6a6f6e65732f7068702d7563642e7376673f7374796c653d666c61742d737175617265)](https://scrutinizer-ci.com/g/nick-jones/php-ucd/)[![Minimum PHP Version](https://camo.githubusercontent.com/0aa7445f06e06d72b9552b4ace117e3765a60fa64d6c973cac177557fd20368f/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f7068702d253345253344253230352e352d3838393242462e7376673f7374796c653d666c61742d737175617265)](https://php.net/)

This project aims to present a PHP interface into the [Unicode Character Database](http://unicode.org/ucd/) (UCD). It provides a means to lookup, filter, and interrogate the metadata &amp; properties of unicode characters.

Installation
------------

[](#installation)

You can install this [library](https://packagist.org/packages/nick-jones/php-ucd) via [composer](http://getcomposer.org):

`composer require nick-jones/php-ucd`

Usage
-----

[](#usage)

The primary interface to utilise is `UCD\Database`. This provides a number of methods to interrogate "codepoint assigned" entities (i.e. `Character`, `NonCharacter`, and `Surrogate` instances) that reside within the UCD:

- `Database::getByCodepoint(Codepoint $codepoint)` - resolves a codepoint assigned entity
- `Database::getCharacterByCodepoint(Codepoint $codepoint)` - as above, but will only return `Character` instances
- `Database::getByCodepoints(Codepoint\Collection $codepoints)` - resolves multiple codepoint assigned entities
- `Database::getCodepointsByBlock(Block $block)` - resolves codepoints residing in the supplied block
- `Database::getByBlock(Block $block)` - resolves codepoint assigned entities residing in the supplied block
- `Database::getCodepointsByCategory(GeneralCategory $category)` - resolves codepoints residing in the supplied category
- `Database::getByCategory(GeneralCategory $category)` - resolves codepoint assigned entities residing in the supplied category
- `Database::getCodepointsByScript(Script $script)` - resolves codepoints residing in the supplied script
- `Database::getByScript(Script $script)` - resolves codepoint assigned entities residing in the supplied script
- `Database::all()` - returns a `Collection` instance containing everything assigned a codepoint within the database
- `Database::onlyCharacters()` - returns a `Collection` instance containing only `Character` instances
- `Database::onlyNonCharacters()` - returns a `Collection` instance containing only `NonCharacter` instances
- `Database::onlySurrogates()` - returns a `Collection` instance containing only `Surrogate` instances

The `UCD\Unicode\Character\Collection` class, returned by a number of methods, provides methods for filtering, traversal, codepoint extractions, amongst other things.

It is likely that you will want to leverage the default `Character\Repository` for resolution of characters, etc, in which case, calling `UCD\Collection::fromDisk()` will give you an instance backed by `FileRepository`. You can, of course, leverage a different `Character\Repository` implementation, if you so wish, by providing it to the constructor of `UCD\Database`.

Because this project makes good use of [generators](https://php.net/generators), the memory footprint of interrogating the dataset is fairly nominal.

Caveats
-------

[](#caveats)

As of Unicode 8.0 there are &gt; 260,000 items assigned codepoints. Reading, filtering and traversing all of these will take a few seconds. As such, if your intention is to identify items by filtering rules, you would be well advised to cache the output in some suitable form (e.g. build a regex, or PHP array of codepoints) which can then be interrogated, rather than always returning to filter and traverse the database. If your intention is to perform lookup by codepoint, then it is no problem to call into this library when and as required, as this is an efficient operation.

Examples
--------

[](#examples)

### Manual Filtering + Traversal

[](#manual-filtering--traversal)

Say you wish to dump all characters that hold a numeric property and reside outside of the Basic Latin (ASCII) block. You could simply leverage the `Collection::filterWith(callable $filter)` method to interrogate the properties of each `Character` instance. You could then perhaps dump their latin equivalent representation by calling `::getNumber()` on the `Numericity` property. For example:

```
use UCD\Unicode\Character;
use UCD\Unicode\Character\Properties\General\Block;
use UCD\Database;

$filter = function (Character $character) {
    $properties = $character->getProperties();
    $general = $properties->getGeneral();
    $block = $general->getBlock();

    return $properties->isNumeric()
        && !$block->equals(Block::fromValue(Block::BASIC_LATIN));
};

$dumper = function (Character $character) {
    $codepoint = $character->getCodepoint();
    $properties = $character->getProperties();
    $numerity = $properties->getNumericity();
    $number = $numerity->getNumber();
    $utf8 = $codepoint->toUTF8();

    printf("%s: %s (~ %s)\n", $codepoint, $utf8, $number);
};

Database::fromDisk()
    ->onlyCharacters()
    ->filterWith($filter)
    ->traverseWith($dumper);

// outputting:
//  U+B2: ² (~ 2)
//  U+B3: ³ (~ 3)
//  U+B9: ¹ (~ 1)
//  U+BC: ¼ (~ 1/4)
//  U+BD: ½ (~ 1/2)
//  U+BE: ¾ (~ 3/4)
//  U+660: ٠ (~ 0)
//  U+661: ١ (~ 1)
//  U+662: ٢ (~ 2)
//  U+663: ٣ (~ 3)
//
```

### Codepoint Lookup

[](#codepoint-lookup)

Locating an individual character by its codepoint value is trivial:

```
use UCD\Database;
use UCD\Unicode\Codepoint;

$database = Database::fromDisk();
$codepoint = Codepoint::fromInt(9731);
// ..or $codepoint = Codepoint::fromHex('2603');
// ..or $codepoint = Codepoint::fromUTF8('☃');
$character = $collection->getCharacterByCodepoint($codepoint);
$properties = $character->getProperties();
$general = $properties->getGeneral();
$names = $general->getNames();

// prints "U+2603: SNOWMAN"
printf("%s: %s\n", $character->getCodepoint(), $names->getPrimary());
```

It is just as trivial to interrogate multiple codepoints. For example, you could print the name of every codepoint residing within a string:

```
use UCD\Database;
use UCD\Unicode\Codepoint;

$database = Database::fromDisk();
$string = 'abc';
$codepoints = Codepoint\Collection::fromUTF8($string);
$assigned = $database->getByCodepoints($codepoints);

foreach ($assigned->getCharacters() as $character) {
    $properties = $character->getProperties();
    $general = $properties->getGeneral();
    $names = $general->getNames();

    printf("%s: %s\n", $character->getCodepoint(), $names->getPrimary());
}

// outputting:
//  U+61: LATIN SMALL LETTER A
//  U+62: LATIN SMALL LETTER B
//  U+63: LATIN SMALL LETTER C
```

Factory methods are available on the `Codepoint` and `Codepoint\Collection` classes to construct instances based on UTF-8, UTF-16BE, UTF-16LE, UTF-32BE and UTF-32LE encoded character(s).

### Regex Building

[](#regex-building)

The library provides a means to build regular expression characters classes based codepoints that have been extracted or aggregated from a collection of characters. For example, if you wanted to produce a regular expression that matched numeric flavour bengali characters, then you could use something along the lines of:

```
use UCD\Database;
use UCD\Unicode\Character;
use UCD\Unicode\Character\Properties\General\Block;

$filter = function (Character $character) {
    $properties = $character->getProperties();
    $general = $properties->getGeneral();
    $block = $general->getBlock();

    return $properties->isNumeric()
        && $block->equals(Block::fromValue(Block::BENGALI));
};

$cc = Database::fromDisk()
    ->onlyCharacters()
    ->filterWith($filter)
    ->extractCodepoints()
    ->aggregate()
    ->toRegexCharacterClass();

$regex = sprintf('/^%s$/u', $cc);

var_dump($regex); // string(37) "/^[\x{9E6}-\x{9EF}\x{9F4}-\x{9F9}]$/u"
var_dump(preg_match($regex, '১')); // int(1)
var_dump(preg_match($regex, '1')); // int(0)
```

### Map Building

[](#map-building)

This library can be used for building maps for various purposes. One such example is building a lowercase → uppercase character map. This is relatively simple to achieve; interrogate the properties of each character to check whether a mapping to a different character exists - if one does, print it out in PHP syntax:

```
use UCD\Database;

$characters = Database::fromDisk()
    ->onlyCharacters();

echo 'static $map = [' . PHP_EOL;

foreach ($characters as $character) {
    $codepoint = $character->getCodepoint();
    $properties = $character->getProperties();
    $case = $properties->getLetterCase();
    $mappings = $case->getMappings();
    $upperMapping = $mappings->getUppercase();
    $upper = $upperMapping->getSimple();

    if (!$upper->equals($codepoint)) {
        $from = $codepoint->toUnicodeEscape();
        $to = $upper->toUnicodeEscape();
        printf('    "%s" => "%s",', $from, $to);
        echo PHP_EOL;
    }
}

echo '];';

// outputting:
//  static $map = [
//      "\u{61}" => "\u{41}",
//      "\u{62}" => "\u{42}",
//      "\u{63}" => "\u{43}",
//
//      "\u{1E942}" => "\u{1E920}",
//      "\u{1E943}" => "\u{1E921}",
//  ];
```

This can then be leveraged as follow:

```
$lower = 'aς!';
$upper = '';

for ($i = 0; $i < mb_strlen($lower); $i++) {
    $char = mb_substr($lower, $i, 1);
    $upper .= $map[$char] ?? $char;
}

var_dump($upper); // string(4) "AΣ!"
```

Executable
----------

[](#executable)

The primary intention of this project is to act as a library, however a small utility command is available for testing and database generation/manipulation purposes. `bin/ucd search ` will dump character information, and `bin/ucd repository-transfer  ` will transfer characters from one repository implementation to another. Please run `bin/ucd` for more detailed help.

Properties
----------

[](#properties)

The intention of the most interesting of the available character properties, as described in [Unicode Standard Annex #44, Unicode Character Database - Properties](http://www.unicode.org/reports/tr44/), available for interrogation. There are, however, a good quantity of them, so this remains work in progress. The following are currently covered:

- Name
- Block
- Age
- General Category
- Numeric Value
- Numeric Type
- Normalization
- Canonical Combining Class
- Decomposition Mapping
- Decomposition Type
- Join Control
- Joining Group
- Joining Type
- Bidi Class
- Bidi Control
- Bidi Mirrored
- Bidi Mirroring Glyph
- Bidi Paired Bracket
- Bidi Paired Bracket Type

Tests
-----

[](#tests)

[PhpSpec](http://www.phpspec.net/) class specifications and [PHPUnit](https://phpunit.de/) backed integration tests are provided. The easiest way to run them is via the Makefile; simply run `make test`.

###  Health Score

29

—

LowBetter than 57% of packages

Maintenance13

Infrequent updates — may be unmaintained

Popularity13

Limited adoption so far

Community11

Small or concentrated contributor base

Maturity67

Established project with proven stability

 Bus Factor1

Top contributor holds 99.1% of commits — single point of failure

How is this calculated?**Maintenance (25%)** — Last commit recency, latest release date, and issue-to-star ratio. Uses a 2-year decay window.

**Popularity (30%)** — Total and monthly downloads, GitHub stars, and forks. Logarithmic scaling prevents top-heavy scores.

**Community (15%)** — Contributors, dependents, forks, watchers, and maintainers. Measures real ecosystem engagement.

**Maturity (30%)** — Project age, version count, PHP version support, and release stability.

###  Release Activity

Cadence

Every ~112 days

Recently: every ~355 days

Total

16

Last Release

2240d ago

Major Versions

v0.6.0 → v1.0.02016-06-02

v1.1.0 → v2.0.02017-11-25

v2.0.0 → v3.0.02017-11-28

PHP version history (2 changes)v0.0.1PHP &gt;=5.5

v2.0.0PHP &gt;=5.6

### Community

Maintainers

![](https://www.gravatar.com/avatar/759dcc9fc48dd4d80afbca495caef56504bcbe4a69ff696ab16e7f1fe57a74c5?d=identicon)[nick-jones](/maintainers/nick-jones)

---

Top Contributors

[![nick-jones](https://avatars.githubusercontent.com/u/350792?v=4)](https://github.com/nick-jones "nick-jones (220 commits)")[![fmunch](https://avatars.githubusercontent.com/u/963624?v=4)](https://github.com/fmunch "fmunch (2 commits)")

---

Tags

unicodedatabasecharacterucd

###  Code Quality

TestsPHPUnit

### Embed Badge

![Health badge](/badges/nick-jones-php-ucd/health.svg)

```
[![Health](https://phpackages.com/badges/nick-jones-php-ucd/health.svg)](https://phpackages.com/packages/nick-jones-php-ucd)
```

###  Alternatives

[doctrine/dbal

Powerful PHP database abstraction layer (DBAL) with many features for database schema introspection and management.

9.7k605.0M6.8k](/packages/doctrine-dbal)[doctrine/migrations

PHP Doctrine Migrations project offer additional functionality on top of the database abstraction layer (DBAL) for versioning your database schema and easily deploying changes to it. It is a very easy to use and a powerful tool.

4.8k217.3M548](/packages/doctrine-migrations)[doctrine/data-fixtures

Data Fixtures for all Doctrine Object Managers

2.9k143.6M586](/packages/doctrine-data-fixtures)[mongodb/mongodb

MongoDB driver library

1.6k67.9M625](/packages/mongodb-mongodb)[matomo/matomo

Matomo is the leading Free/Libre open analytics platform

21.7k38.9k](/packages/matomo-matomo)[cycle/database

DBAL, schema introspection, migration and pagination

71777.8k53](/packages/cycle-database)

PHPackages © 2026

[Directory](/)[Categories](/categories)[Trending](/trending)[Changelog](/changelog)[Analyze](/analyze)
