PHPackages                             joshdaugherty/ipa-unicode-inventory - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [Validation &amp; Sanitization](/categories/validation)
4. /
5. joshdaugherty/ipa-unicode-inventory

ActiveLibrary[Validation &amp; Sanitization](/categories/validation)

joshdaugherty/ipa-unicode-inventory
===================================

Versioned IPA and extIPA Unicode allowlist (JSON + JSON Schema) with optional normalization rules.

v1.6.2(2w ago)0208↑25%[1 issues](https://github.com/joshdaugherty/ipa-unicode-inventory/issues)MITPHPPHP ^8.1CI passing

Since Apr 12Pushed 2w agoCompare

[ Source](https://github.com/joshdaugherty/ipa-unicode-inventory)[ Packagist](https://packagist.org/packages/joshdaugherty/ipa-unicode-inventory)[ Docs](https://github.com/joshdaugherty/ipa-unicode-inventory)[ RSS](/packages/joshdaugherty-ipa-unicode-inventory/feed)WikiDiscussions main Synced 1w ago

READMEChangelog (9)Dependencies (2)Versions (12)Used By (0)

IPA Unicode Inventory
=====================

[](#ipa-unicode-inventory)

Standalone, language-agnostic **source data** and **generated artifacts** for Unicode scalars treated as IPA-relevant under an explicit, documented policy. Use this instead of ad hoc regex allowlists inside apps.

- **Canonical data:** `data/inventory.json` (**corpus\_inclusive**), `data/inventory.phonetic-strict.json` (**phonetic\_strict**), optional `data/normalization.json`
- **Schemas:** `schema/` (JSON Schema draft 2020-12)
- **PHP:** Composer package `joshdaugherty/ipa-unicode-inventory` (see Consumer quick start)
- **Build:** Node.js 18+ — `npm ci` then `npm run build` → `build/output/`

Versioning and data shape for `schema_version`, `dataset_version`, and categories are defined in `schema/*.schema.json` and `data/inventory.json` → `meta`.

Policy (current release)
------------------------

[](#policy-current-release)

FieldValue`policy_id``ipa-extipa-corpus-inclusive` (default bundle)`profile_id``corpus_inclusive` (in `inventory.json` meta)`dataset_version``1.6.2``schema_version``1.0.0``dataset_version` **1.6.2** matches the current npm/Composer release line. **PHP:** install via Composer on [Packagist](https://packagist.org/) as `joshdaugherty/ipa-unicode-inventory` (submit the repo and tag releases such as **`v1.6.2`**).

The inventory covers **core IPA** and **extIPA-oriented Unicode** (as above) plus **in-band transcription and corpus punctuation**:

- parentheses, square brackets, slashes, braces, angle brackets (ASCII and U+27E8/U+27E9), comma, full stop, pipe, colon, hyphen, equals, plus, underscore, quotes (ASCII and common typographic), guillemets, ellipsis, and similar tier markers, all tagged **`delimiter`** where applicable;
- **ASCII digits and space** are **`other`** for tone indices, timing labels, and running text;
- **ASCII Latin** beside IPA is **lowercase only** (uppercase **A–Z** are not listed).

Consumers can **strip `delimiter`** (and optionally space/digits) for phonetic-only checks, or load the bundled **`phonetic_strict`** inventory (below), which omits those rows. It still **does not** assert phonological well-formedness or a Unicode `Is_IPA` property — see the policy paragraph above and [Extensions to the IPA](https://en.wikipedia.org/wiki/Extensions_to_the_International_Phonetic_Alphabet) for the clinical symbol set.

### Policy profiles

[](#policy-profiles)

`profile_id`File`policy_id`Role`corpus_inclusive``data/inventory.json``ipa-extipa-corpus-inclusive`Default: phonetic symbols **plus** delimiter rows and ASCII space/digits for transcriptions and corpora.`phonetic_strict``data/inventory.phonetic-strict.json``ipa-extipa-phonetic-strict`Subset: same phonetic Unicode rows **without** `delimiter` category, ASCII space, or ASCII digits (ASCII Latin **lowercase** remains for mixed orthography). Normalization targets (e.g. U+02BC) stay valid.**`meta.dataset_version`**, **`schema_version`**, and **`unicode_version_min`** match across both profiles and **`normalization.json`**. **`MetaConstants`** reflects the **default** (`corpus_inclusive`) bundle only. PHP: **`Resources::inventoryJsonPathForProfile(PolicyProfile::PHONETIC_STRICT)`** (or **`CORPUS_INCLUSIVE`**) and **`composer.json` → `extra.ipa-unicode-inventory.paths.profiles`**.

Consumer quick start
--------------------

[](#consumer-quick-start)

1. **JSON:** Read `data/inventory.json` (or **`inventory.phonetic-strict.json`**) or the minified `build/output/inventory.min.json` / **`inventory.phonetic-strict.min.json`**. Build a `Set` of `cp` integers in memory.
2. **PCRE (UTF-8 + `/u`):** Insert `build/output/pcre-class-fragment.txt` or **`pcre-class-fragment.phonetic-strict.txt`** inside a character class, e.g. `/^[...fragment...]+$/u` — the fragment uses `\x{H...}` escapes only (no surrounding `[` `]`).
3. **PHP (Composer):** `composer require joshdaugherty/ipa-unicode-inventory`, then use `JoshDaugherty\IpaUnicodeInventory\Resources` for paths to the bundled JSON and `InventoryLoader::loadInventory()` / `InventoryLoader::codePointLookup()` for decoded data. **Tooling:** `composer.json` → **`extra.ipa-unicode-inventory.paths`** lists **`inventory_json`**, **`normalization_json`**, **`schema_directory`**, and **`profiles`** (`corpus_inclusive`, `phonetic_strict`) relative to the package root. **`MetaConstants`** exposes **`DATASET_VERSION`**, **`POLICY_ID`**, **`PROFILE_ID`**, and **`SCHEMA_VERSION`** from the default `inventory.json` → `meta` (generated into `src/MetaConstants.php` by **`npm run build`**; **`npm test`** checks it stays in sync). For a **cached scalar allowlist**, use `Inventory::fromDisk()` (optional path) and `isScalarAllowed(int $cp)` — surrogates and out-of-range code points return false. **`TranscriptionValidator::fromDisk()`** runs delimiter stripping (**none**, inventory **`delimiter`** rows, **custom** code points, or **`STRIP_DELIMITERS_WIKIMEDIA_SLASH_BRACKETS`** for Wikimedia **`$stripRegex`** — only `/` `[` `]`), optional **`normalization.json`** (longest `from` first), optional **Wikimedia-style ASCII** (`'`→ˈ, `:`→ː, `,`→ˌ), optional **Google/TTS** normalization (parentheses removal, modifier-letter → ASCII map, then strip **U+0300–U+036F**; requires **`wikimediaLegacyAscii`**), optional **`SEGMENT_GRAPHEME_CLUSTER`** final walk (**`ext-intl`**, **`IntlBreakIterator`** — same per-scalar allowlist as default), then **`isValid()`** — requires **`ext-mbstring`**; grapheme mode additionally suggests **`ext-intl`**. Delimiter stripping runs *before* legacy ASCII; **`'`** is an inventory **`delimiter`**, so **`STRIP_DELIMITERS_INVENTORY`** removes it before **`'`→ˈ** — use **`STRIP_DELIMITERS_NONE`**, **`CUSTOM`**, or **`WIKIMEDIA_SLASH_BRACKETS`** if you need that mapping. For **phonetic-only** validation without corpus delimiters, use **`Resources::inventoryJsonPathForProfile(PolicyProfile::PHONETIC_STRICT)`**. Submit the Git repo to [Packagist](https://packagist.org/) and tag a release (e.g. **`v1.6.2`**) so the package resolves.
4. **PHP (generated array):** After `npm run build`, include `build/output/php/AllowedCodePoints.php` or **`AllowedCodePoints.phonetic-strict.php`** for a `0xNNN => true` map (generated only; not committed).
5. **Integrity:** Check `build/output/manifest.json` SHA-256 digests after downloading release assets.

### Distribution: Composer archives, git clones, and release assets

[](#distribution-composer-archives-git-clones-and-release-assets)

**Packagist / Composer dist** (what you get from `composer require joshdaugherty/ipa-unicode-inventory`) is a **slim zip** defined by **`composer.json` → `archive.exclude`** and **`.gitattributes` → `export-ignore`**. It **includes** at least:

- **`src/`** — PHP (`Inventory`, `TranscriptionValidator`, `Resources`, `MetaConstants`, etc.)
- **`data/`** — `inventory.json`, `inventory.phonetic-strict.json`, `normalization.json`
- **`schema/`** — JSON Schemas for strict validation
- **`docs/`** — e.g. `mediawiki-parity.md`
- Root **`composer.json`**, **`README.md`**, **`LICENSE`**, **`CONTRIBUTING.md`**, **`CHANGELOG.md`**, **`phpunit.xml.dist`** (present in the archive even though tests are omitted)

It **omits** **`tests/`**, **`scripts/`**, **`package.json`**, **`package-lock.json`**, **`.github/`**, **`.gitignore`**, and **`node_modules/`** (not committed). **`build/output/`** is **not** in the git tag at all (`build/` is gitignored), so **`pcre-class-fragment.txt`**, **`inventory.min.json`**, **`manifest.json`**, and generated **`AllowedCodePoints*.php`** under **`build/output/`** do **not** ship with Composer. **PHP-only consumers** who want the PCRE fragment or minified JSON should **build locally** (`npm ci && npm run build`) or **download release assets** (below).

**GitHub “Source code”** archives (zip/tarball on a tag) are the **full repository tree** at that revision: same as a `git clone` without unpublished files. They still **exclude** generated **`build/output/`** unless you commit it (this project does not).

**GitHub Releases — attached binaries:** This repository’s **maintainer checklist** is to attach **`npm run build`** outputs for consumers who do not use Node, for example **`inventory.min.json`**, **`inventory.phonetic-strict.min.json`**, **`manifest.json`**, **`pcre-class-fragment.txt`**, **`pcre-class-fragment.phonetic-strict.txt`**, **`code_points.txt`**, **`code_points.phonetic-strict.txt`**, and optionally **`php/AllowedCodePoints.php`** / **`php/AllowedCodePoints.phonetic-strict.php`**. Published releases also receive **`mediawiki-parity.md`** (and **`.log`**) from **`.github/workflows/release-parity.yml`** automatically.

### Normalization

[](#normalization)

If you apply `data/normalization.json`, apply rules **longest-`from` first**, then validate scalars against the inventory. **U+2018** and **U+2019** map to MODIFIER LETTER APOSTROPHE (U+02BC). Both are also listed as in-band **delimiters**, so strings may validate without normalization; use normalization when you want a single preferred glottal apostrophe scalar.

### Optional strict JSON Schema validation (PHP)

[](#optional-strict-json-schema-validation-php)

By default, **`InventoryLoader`** only checks that **`meta`** and **`code_points`** / **`rules`** exist. To validate the full document against the bundled **draft 2020-12** wrappers under **`schema/`**:

1. Install the optional dependency: **`composer require justinrainbow/json-schema`** (see **`suggest`** in this package’s `composer.json`). The repo’s **`require-dev`** includes it so **`composer test`** can cover strict mode.
2. Pass **`true`** for schema validation where each API documents it: **`InventoryLoader::loadInventory($path, true)`**, **`loadNormalization($path, true)`**, **`codePointLookup($path, true)`**, **`delimiterScalarSet($path, true)`**; **`Inventory::fromDisk($path, true)`**; **`TranscriptionValidator::fromDisk(..., $segmentationMode: TranscriptionValidator::SEGMENT_SCALARS, $validateSchema: true)`** (last parameter is **`$validateSchema`**).
3. Or decode JSON yourself and call **`BundleSchemaValidator::assertInventoryDocumentValid($data)`** / **`assertNormalizationDocumentValid($data)`**. Use **`BundleSchemaValidator::isAvailable()`** if you need to branch before requiring the package.

If strict mode is requested but **`justinrainbow/json-schema`** is not installed, a **`RuntimeException`** explains how to add it. Validation failures throw **`RuntimeException`** with schema error details.

### Validation model: Unicode scalars (default) and optional grapheme-cluster walk

[](#validation-model-unicode-scalars-default-and-optional-grapheme-cluster-walk)

**`Inventory::isScalarAllowed()`** is always **per Unicode scalar** (code point).

**`TranscriptionValidator::isValid()`** (default **`SEGMENT_SCALARS`**) walks the post-pipeline string **scalar-by-scalar** with `mb_str_split` / `mb_ord` — each scalar is checked against the allowlist independently.

**Optional `SEGMENT_GRAPHEME_CLUSTER`** (requires **`ext-intl`**, **`TranscriptionValidator::graphemeSegmentationAvailable()`**): the final pass uses **`IntlBreakIterator::createCharacterInstance()`** (ICU **extended grapheme cluster** boundaries) as the iteration unit, but the rule is still **Option A — every scalar inside each cluster must be allowlisted** (same outcome as scalar mode for typical IPA where one grapheme is one scalar, but alignment matches UI / copy-paste segmentation).

- **In scope:** Supplementary planes, BMP letters, combining marks (e.g. U+0301) as separate inventory rows; delimiter code points; optional EGC **walk** without changing the scalar allowlist.
- **Out of scope:** **Cluster-as-token** allowlists (only whole clusters listed), tailored locale collation, or NFC/NFD as a built-in normalization rule. Precomposed vs decomposed encodings of the “same” letter still require each involved scalar to appear in the inventory (or a prior normalization step).
- **PCRE:** A pattern like `/^[…fragment…]+$/u` is **per UTF-8 code point**, not per extended grapheme cluster — use **`TranscriptionValidator`** grapheme mode if you need ICU-consistent cluster boundaries in PHP.

### Migrating from Wikimedia IPAValidator

[](#migrating-from-wikimedia-ipavalidator)

Upstream library: [`mediawiki-libs-IPAValidator`](https://github.com/wikimedia/mediawiki-libs-IPAValidator) ([Packagist `wikimedia/ipa-validator`](https://packagist.org/packages/wikimedia/ipa-validator)). It validates against a single **`$ipaRegex`** (whole string must match after optional strip/normalize). This repository is **policy data + optional PHP helpers**; behavior overlaps but is not identical.

TopicWikimedia `IPAValidator\Validator`This package**Primary check**`preg_match` on normalized string vs `$ipaRegex`Scalar allowlist from `data/inventory.json` (or generated PCRE class fragment for whole-string regex)**Normalization**Optional: ASCII `'`→ˈ, `:`→ː, `,`→ˌ (`$normalize`)`normalization.json` (e.g. U+2018/U+2019→U+02BC, longest-`from` first) **plus** optional same ASCII map in `TranscriptionValidator` (`wikimediaLegacyAscii`)**Delimiter handling**Optional `stripRegex` when `$strip`Inventory **`delimiter`** rows, custom code points, none, or **`STRIP_DELIMITERS_WIKIMEDIA_SLASH_BRACKETS`** (only `/` `[` `]` — Wikimedia **`$stripRegex`**)**Pipeline order**Strip → normalize (normalize may strip again)Strip delimiters → `normalization.json` → optional Wikimedia ASCII → optional Google/TTS → scalar checks (optional EGC walk via **`SEGMENT_GRAPHEME_CLUSTER`**)**Google / TTS mode**`$google` (extra replacements + diacritic stripping)**`TranscriptionValidator::fromDisk(..., $wikimediaLegacyAscii: true, $googleTtsNormalization: true)`** — same char map + **U+0300–U+036F** removal as upstream; **requires** legacy ASCII enabled**`@` (U+0040)**Not in `$ipaRegex` (fails validation if present)In inventory as **`delimiter`** (allowed in-band unless you strip delimiters)**Ligatures / digraph letters**Allowed only if in `$ipaRegex`Allowed if listed (e.g. **ʧ** U+02A7); no special “decompose ligature” step**Parity tooling**—`npm run compare:mediawiki` diffs **regex class** vs inventory (not full PHP behavior)Start from **`TranscriptionValidator::fromDisk()`** if you want strip + normalize + scalar checks in one place. For Wikimedia **`$strip`** parity on **`/` `[` `]`** only, use **`STRIP_DELIMITERS_WIKIMEDIA_SLASH_BRACKETS`** (equivalent to **`preg_replace('/[\/\[\]]/u', '', $s)`** on well-formed UTF-8); that keeps ASCII **`'`** so you can enable **`wikimediaLegacyAscii`** for **`'`→ˈ** without **`STRIP_DELIMITERS_NONE`**. **`STRIP_DELIMITERS_INVENTORY`** removes every inventory **`delimiter`**, including **`'`**, so it does **not** match upstream **`$stripRegex`**. For **`$google`**, pass **`googleTtsNormalization: true`** after **`wikimediaLegacyAscii: true`**; Google **strips combining marks in U+0300–U+036F**, so validate policy implications for narrow IPA.

Authoring fixtures and source files with correct UTF-8
------------------------------------------------------

[](#authoring-fixtures-and-source-files-with-correct-utf-8)

This inventory only works if the consuming code stores IPA characters as canonical UTF-8 — the same byte sequences the inventory itself uses. The most common authoring mistake is **Windows-1252 double-encoding**: an editor (or a `git`/`Composer`/CI step) reads UTF-8 bytes as cp1252, then re-encodes the (now wrong) codepoints back to UTF-8. The result is a string that *looks* normal to a tolerant PHP/Node runtime but is silently mojibake at the byte level, and stricter runtimes (Ubuntu PHP 8.4, ICU-backed `preg_match`, etc.) reject it. See [issue #1](https://github.com/joshdaugherty/ipa-unicode-inventory/issues/1) for the originating downstream incident.

### Canonical byte reference for common IPA scalars

[](#canonical-byte-reference-for-common-ipa-scalars)

If you copy these characters into a test fixture, the on-disk bytes must match the **Canonical UTF-8** column exactly. The **Mojibake** column lists the byte sequence you would see if the file was double-encoded via cp1252 — that is what a CI guard should reject.

ScalarCodepointCanonical UTF-8cp1252-double-encoded mojibake`ʰ`U+02B0`CA B0``C3 8A C2 B0``ʤ`U+02A4`CA A4``C3 8A C2 A4``ʊ`U+028A`CA 8A``C3 8A CB 86``ɪ`U+026A`C9 AA``C3 89 C2 AA``ə`U+0259`C9 99``C3 89 E2 84 A2``ɚ`U+025A`C9 9A``C3 89 C5 A1``ɛ`U+025B`C9 9B``C3 89 E2 80 BA``ɑ`U+0251`C9 91``C3 89 E2 80 98``ˈ`U+02C8`CB 88``C3 8B CB 86``ː`U+02D0`CB 90``C3 8B C2 90``̥`U+0325`CC A5``C3 8C C2 A5``̊`U+030A`CC 8A``C3 8C CB 86`For example, the worked downstream string `pʰə̥ˈkj̊uːliɚ` is canonically `70 CA B0 C9 99 CC A5 CB 88 6B 6A CC 8A 75 CB 90 6C 69 C9 9A` on disk; any deviation (extra `C3 8A`, `C3 89`, `C3 8B`, `C3 8C` bytes, or the longer `E2 …` triplets shown above) means the file has been double-encoded.

### How to verify a file's bytes

[](#how-to-verify-a-files-bytes)

```
# PowerShell (Windows): dump a file's bytes as hex
Format-Hex .\path\to\fixture.php

# Or just the IPA characters of interest
[System.Text.Encoding]::UTF8.GetBytes('ʰə̥') | ForEach-Object { '{0:X2}' -f $_ }
```

```
# POSIX: same with xxd
xxd path/to/fixture.php | head
printf 'ʰə̥' | xxd
```

A correct fixture will show `CA B0 C9 99 CC A5` for `ʰə̥`. A double-encoded one will show the longer `C3 8A C2 B0 C3 89 …` runs from the table above.

### Editor configuration

[](#editor-configuration)

The corruption almost always originates at editor-save time on a Windows host with a non-UTF-8 default. Set the file encoding explicitly:

- **VS Code:** `"files.encoding": "utf8"` and `"files.autoGuessEncoding": false` in workspace settings; the status-bar encoding indicator should read **UTF-8** (not "Windows 1252" or "ISO 8859-1").
- **PhpStorm / IntelliJ:** *Settings → Editor → File Encodings* → **Project Encoding = UTF-8**, **BOM policy = "do not use BOM"**.
- **Notepad / generic editors:** avoid; if unavoidable, "Save As → UTF-8 (without BOM)".

If you suspect an existing fixture is corrupt, the one-shot repair is:

```
file_put_contents($path, mb_convert_encoding(file_get_contents($path), 'ISO-8859-1', 'UTF-8'));
```

This interprets the file's existing UTF-8 bytes as codepoints, then writes those codepoints as Latin-1 — producing the original (pre-corruption) UTF-8 byte stream.

### Pre-commit / CI hook

[](#pre-commit--ci-hook)

A guard that greps tracked files for the byte sequences in the **Mojibake** column above will catch the regression at commit / CI time. The patterns are not IPA-specific — the same cp1252 round-trip mangles em-dashes, curly quotes, currency symbols, modifier letters, and combining marks — so the same guard is reusable in any downstream repo that authors UTF-8 fixtures on Windows.

Development
-----------

[](#development)

```
npm ci
npm test        # validate schemas, meta alignment, build, fixture tests, manifest digests
npm run build   # write build/output/ and src/MetaConstants.php from inventory meta
npm run compare:mediawiki   # optional; needs network
npm run compare:ipa-chart   # optional; needs network
```

**PHP (Composer package):** `composer install` then `composer test` (PHPUnit golden strings under `tests/`).

### Compare to Wikimedia IPAValidator

[](#compare-to-wikimedia-ipavalidator)

To diff this repo’s allowlist against the character class baked into [mediawiki-libs-IPAValidator `Validator.php`](https://github.com/wikimedia/mediawiki-libs-IPAValidator/blob/main/src/Validator.php) (`$ipaRegex`):

```
npm run compare:mediawiki
```

The script loads `data/inventory.json`, fetches the upstream PHP file when network is available (otherwise uses an embedded snapshot of the class body), expands regex ranges inside `[...]`, and prints any **MediaWiki scalars missing from our inventory**, plus how many **extra** scalars we allow (this project is usually a **superset**). Use `node scripts/compare-mediawiki-validator.mjs --strict` if you want a non-zero exit code when parity is incomplete (for example in a custom CI check).

**Markdown report (checked in):** [`docs/mediawiki-parity.md`](docs/mediawiki-parity.md) — regenerate with `npm run compare:mediawiki:doc` (or `--write-markdown `). **CI** uploads `mediawiki-parity.md` and a console **`.log`** as workflow artifacts (`mediawiki-parity`). **Published releases** attach the same files via `.github/workflows/release-parity.yml`.

That validator also applies `stripRegex` and optional normalization before matching; the comparison is **only** the static allowlist implied by `$ipaRegex`, not full PHP behavior. With corpus delimiters included, the MediaWiki class should report **zero** missing scalars (full literal parity).

### Compare to westonruter/ipa-chart

[](#compare-to-westonruteripa-chart)

[westonruter/ipa-chart](https://github.com/westonruter/ipa-chart) is a Unicode IPA chart (and keyboard) in HTML. Code points are declared on pickable symbols as `title="U+XXXX: …"`. To list any chart scalars **missing** from our inventory (and how many of ours are **not** on that chart):

```
npm run compare:ipa-chart
```

This fetches `index.html` and `accessiblechart.html` from the default branch. Use `--strict` for a non-zero exit if anything is missing. Our inventory is expected to be a **superset** of the 2005 chart glyphs; gaps usually mean a deliberate policy choice or a chart update worth reviewing.

**Runtime:** Node **18+** for `scripts/build.js`, `scripts/validate-schemas.mjs`, and tests. **Python 3** is optional, for `scripts/gen-inventory.py` when regenerating the default inventory from Unicode ranges.

`build/output/` is gitignored; CI builds on every push/PR. See **Consumer quick start → Distribution** for what Composer ships vs what to attach to **GitHub Releases**.

Sources, authorities, and why this repo is still “policy-defined”
-----------------------------------------------------------------

[](#sources-authorities-and-why-this-repo-is-still-policy-defined)

No live service exposes a complete, normative **`Is_IPA`** property the way the [UCD](https://www.unicode.org/ucd/) exposes character properties. What you can cite and trace is a **small set of authorities**, then **this repository applies policy** wherever Unicode is ambiguous or broader than you want.

### Strong primary sources (by role)

[](#strong-primary-sources-by-role)

1. **[International Phonetic Association (IPA)](https://www.internationalphoneticassociation.org/)** — The official chart and handbook define which *linguistic* symbols count as IPA. That is the right basis for “is this symbol on the chart / in official extensions?” The IPA does not typically ship a maintained, machine-readable Unicode table; you still map chart cells to scalars yourself.
2. **[Unicode Consortium](https://www.unicode.org/charts/)** — [Code charts](https://www.unicode.org/charts/) and the [UCD](https://www.unicode.org/ucd/) give **objective encoding**: assigned code points, names, and block boundaries (e.g. **IPA Extensions** U+0250–U+02AF, **Phonetic Extensions**, and related blocks). Treat that as a **mechanical superset**, not “IPA-only,” because those blocks can include historical or non–chart-specific letters until you filter.
3. **[SIL International](https://www.sil.org/)** — Widely used IPA ↔ Unicode reference material (charts, keyboards, which code point corresponds to which glyph). Strong **practical** alignment with what linguists type; still curated documentation, not an `Is_IPA` API.

Together, **IPA (symbols)** + **Unicode (code points)** + optionally **SIL (mapping practice)** give a **documented basis** for “IPA-relevant.” This inventory remains **policy-defined** in the narrower sense: which blocks, which chart edition, whether extIPA, digraphs, or delimiters are in scope, and how you normalize (e.g. ligature vs decomposed spelling), even when each line traces back to those sources.

### Secondary references (useful, not normative alone)

[](#secondary-references-useful-not-normative-alone)

- Wikipedia’s IPA and Unicode articles — quick cross-checks, not a standards body.
- Community tools such as [westonruter/ipa-chart](https://westonruter.github.io/ipa-chart/) — helpful for UX and spot checks; your **policy** should still point at IPA + Unicode (and SIL if you want a third leg).

### Beyond “textbook” IPA

[](#beyond-textbook-ipa)

**extIPA** and similar extensions are defined by clinical / phonetic communities (separate charts and guidelines), not by a single Unicode flag. Treat them as an **additional documented policy layer** on top of core IPA if you need them.

### Attribution (encoding facts)

[](#attribution-encoding-facts)

Scalar identities and names follow the [Unicode Standard](https://www.unicode.org/versions/latest/). This dataset is **not** an official Unicode “IPA property”; it is a versioned, machine-readable allowlist plus optional normalization rules under explicit policy in `data/inventory.json` → `meta`.

Maintainer
----------

[](#maintainer)

[Josh Daugherty](https://github.com/joshdaugherty)

License
-------

[](#license)

SPDX: **MIT** — see `LICENSE`.

###  Health Score

44

—

FairBetter than 90% of packages

Maintenance96

Actively maintained with recent releases

Popularity16

Limited adoption so far

Community6

Small or concentrated contributor base

Maturity49

Maturing project, gaining track record

 Bus Factor1

Top contributor holds 100% of commits — single point of failure

How is this calculated?**Maintenance (25%)** — Last commit recency, latest release date, and issue-to-star ratio. Uses a 2-year decay window.

**Popularity (30%)** — Total and monthly downloads, GitHub stars, and forks. Logarithmic scaling prevents top-heavy scores.

**Community (15%)** — Contributors, dependents, forks, watchers, and maintainers. Measures real ecosystem engagement.

**Maturity (30%)** — Project age, version count, PHP version support, and release stability.

###  Release Activity

Cadence

Every ~5 days

Total

8

Last Release

19d ago

### Community

Maintainers

![](https://www.gravatar.com/avatar/c1dc6ba5b4116150b6d478982b4ee6d9d47fac2e379f53bd27dcd9c28720dee8?d=identicon)[joshdaugherty](/maintainers/joshdaugherty)

---

Top Contributors

[![joshdaugherty](https://avatars.githubusercontent.com/u/15675878?v=4)](https://github.com/joshdaugherty "joshdaugherty (39 commits)")

---

Tags

extipaipajson-schemalinguisticsnodejsopen-dataphoneticstranscriptionunicodevalidationjson-schemaunicodeipaTranscriptionPhoneticsextipa

###  Code Quality

TestsPHPUnit

### Embed Badge

![Health badge](/badges/joshdaugherty-ipa-unicode-inventory/health.svg)

```
[![Health](https://phpackages.com/badges/joshdaugherty-ipa-unicode-inventory/health.svg)](https://phpackages.com/packages/joshdaugherty-ipa-unicode-inventory)
```

###  Alternatives

[opis/json-schema

Json Schema Validator for PHP

64841.2M257](/packages/opis-json-schema)[geraintluff/jsv4

A (coercive) JSON Schema v4 Validator for PHP

115461.4k4](/packages/geraintluff-jsv4)

PHPackages © 2026

[Directory](/)[Categories](/categories)[Trending](/trending)[Changelog](/changelog)[Analyze](/analyze)
