PHPackages                             shoutenji/mkdict - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. shoutenji/mkdict

ActiveLibrary

shoutenji/mkdict
================

PHP application that generates the JMDict database used in the Android app Manakyun.

20[2 issues](https://github.com/shoutenji/MKDict/issues)PHP

Since Jul 1Pushed 4y ago1 watchersCompare

[ Source](https://github.com/shoutenji/MKDict)[ Packagist](https://packagist.org/packages/shoutenji/mkdict)[ RSS](/packages/shoutenji-mkdict/feed)WikiDiscussions master Synced 4w ago

READMEChangelogDependenciesVersions (1)Used By (0)

[![Minimum PHP Version](https://camo.githubusercontent.com/90eed33e7df559b70b174e97d37a4907946803c7ab691640166d2518d8cd2118/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f7068702d253345253344253230372e302d3838393242462e7376673f7374796c653d666c61742d737175617265)](https://php.net/)

MKDict
======

[](#mkdict)

A Japanese-English dictionary based off Jim Breen's JMDict [http://www.edrdg.org/jmdict/j\_jmdict.html](http://www.edrdg.org/jmdict/j_jmdict.html)

Overview
--------

[](#overview)

A stand alone PHP application that generates the database used in the Android app Manakyun. This app will download the latest JMDict file, parse it, and insert any changes relative to the previous version into the database. This app aims to be a complete, self-contained, and open wrapper for the JMDict file that aids the development of apps based off Breen's dictionary (like Manakyun).

Features
--------

[](#features)

- Requires PHP 7.0
- A reg-exp based DTD parser (if the DTD changes on JMDict, MKDict will notice)
- Pollyfill Unicode based normalization for string data, etc. (the installation process will download the needed Unicode data files)
- Export the dictionary in an XML format that has several advantages over the JMDict format.
- Low memory footprint (xml parsing, checksum verification, etc. do not require loading the entire xml document as a string)
- The database layer is buffered which yields a net processing time of about 1 hour (previously without buffering processing time took a whopping 72 hours)
- log file detailing any errors or elements which failed to import due to invalid data (which is a good way to catch errors in the original JMDict file)

Upcoming Features
-----------------

[](#upcoming-features)

- Sentence examples (from tatoeba.org and the JEITA corpus)
- Collocations (done with NLTK in Python)

The XML export format
---------------------

[](#the-xml-export-format)

See the XSD document in the Exporter folder. One of the main advantages of this format is that every element is given a unique id and cross references reference this id. Another advantage is that in addition to the raw string being reproduced, Unicode normal forms are also given. Hence each element has a proper canonical form which can be relied upon for searching and sorting. In otherwords this XML file is basically JMDict again but cleaned up and better organized.

JMDict Errors and data integrity
--------------------------------

[](#jmdict-errors-and-data-integrity)

The main JMDict file has always had several errors upon each iteration, such as a cross reference that references a previously removed element. These errors are not propagated into the db. Also, given the data-reliant nature of a dictionary app, assurances of the data's integrity are mandatory which is why this app does a lot to filter and sanitize the original JMDict file. Namely, with this database you can be sure that:

- Cross references are not broken
- Numbers are numbers, and strings are strings (types are as they should be, and both always contain reasonable values)
- The problematic "・" character to separate reading and kanji entries no longer denotes such a delimitation
- No duplicate or invalid Sequence ids
- No invalid UTF8

I hope to add config options related to how MKDict will react to invalid data from the JMDict file (ie do you want to truncate an excessively long string or safely ignore the element containing that long string)

Installer Options
-----------------

[](#installer-options)

(i will remove bash scripts and create one install/import phar file)

### --create-db

[](#--create-db)

Create the manakyun database

### --test-db

[](#--test-db)

For Development. Creates a temp table, populates it with utf8 data, and queries it. Uses DB library.

### --utf-tests

[](#--utf-tests)

TODO

--generate-utf-data
-------------------

[](#--generate-utf-data)

TODO

Importer Options
----------------

[](#importer-options)

### --local-copy

[](#--local-copy)

Use a local copy of JMDict instead of downloading it. The local file should be placed in MANAKYUN\_DIR/var/data and should be a gz file. Must use --gz-file to specify the filename

### --gz-file

[](#--gz-file)

Specify the local JMDict file to use e.g. --gz-file=20D06965F4FEE90A8\_1620819068.gz

### --parse-dictionary

[](#--parse-dictionary)

TODO

### --version-dictionary

[](#--version-dictionary)

TODO

### --validate-crc32

[](#--validate-crc32)

TODO

### --validate-utf8

[](#--validate-utf8)

TODO

### --with-rollback

[](#--with-rollback)

TODO

Exporter Options
----------------

[](#exporter-options)

### --export-version

[](#--export-version)

TODO

### --export-type

[](#--export-type)

TODO

General Options
---------------

[](#general-options)

### --debug-version

[](#--debug-version)

For Development. Turns on error\_reporting(E\_ALL) and libxml\_use\_internal\_errors(true)

###  Health Score

11

—

LowBetter than 0% of packages

Maintenance0

Infrequent updates — may be unmaintained

Popularity3

Limited adoption so far

Community7

Small or concentrated contributor base

Maturity29

Early-stage or recently created project

 Bus Factor1

Top contributor holds 100% of commits — single point of failure

How is this calculated?**Maintenance (25%)** — Last commit recency, latest release date, and issue-to-star ratio. Uses a 2-year decay window.

**Popularity (30%)** — Total and monthly downloads, GitHub stars, and forks. Logarithmic scaling prevents top-heavy scores.

**Community (15%)** — Contributors, dependents, forks, watchers, and maintainers. Measures real ecosystem engagement.

**Maturity (30%)** — Project age, version count, PHP version support, and release stability.

### Community

Maintainers

![](https://www.gravatar.com/avatar/13e195118db2241d2a97f819079020a3cd82ab931d10dc2ee7929e0f77945eb9?d=identicon)[taylorbr](/maintainers/taylorbr)

---

Top Contributors

[![shoutenji](https://avatars.githubusercontent.com/u/6909704?v=4)](https://github.com/shoutenji "shoutenji (45 commits)")

### Embed Badge

![Health badge](/badges/shoutenji-mkdict/health.svg)

```
[![Health](https://phpackages.com/badges/shoutenji-mkdict/health.svg)](https://phpackages.com/packages/shoutenji-mkdict)
```

PHPackages © 2026

[Directory](/)[Categories](/categories)[Trending](/trending)[Changelog](/changelog)[Analyze](/analyze)
