PHPackages                             farzai/thai-word - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [Utility &amp; Helpers](/categories/utility)
4. /
5. farzai/thai-word

ActiveLibrary[Utility &amp; Helpers](/categories/utility)

farzai/thai-word
================

Thai word segmentation library for PHP

1.1.0(11mo ago)4381[3 PRs](https://github.com/parsilver/thai-word-php/pulls)MITPHPPHP ^8.4CI passing

Since Jun 10Pushed 1mo agoCompare

[ Source](https://github.com/parsilver/thai-word-php)[ Packagist](https://packagist.org/packages/farzai/thai-word)[ Docs](https://github.com/parsilver/thai-word-php)[ GitHub Sponsors](https://github.com/parsilver)[ RSS](/packages/farzai-thai-word/feed)WikiDiscussions main Synced 1mo ago

READMEChangelog (2)Dependencies (10)Versions (10)Used By (0)

Thai Word Segmentation - PHP Library
====================================

[](#thai-word-segmentation---php-library)

[![Latest Version on Packagist](https://camo.githubusercontent.com/f2ac3c5bf7127e420fad198c36dece4d5555c2d9d48aa83ef2e489510656b2f6/68747470733a2f2f696d672e736869656c64732e696f2f7061636b61676973742f762f6661727a61692f746861692d776f72642e7376673f7374796c653d666c61742d737175617265)](https://packagist.org/packages/farzai/thai-word)[![Tests](https://camo.githubusercontent.com/6f6ed87e40101f74bd6e90d538c4cd40af29a9a9735463dbbffe6303c1436e5f/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f616374696f6e732f776f726b666c6f772f7374617475732f70617273696c7665722f746861692d776f72642d7068702f72756e2d74657374732e796d6c3f6272616e63683d6d61696e266c6162656c3d7465737473267374796c653d666c61742d737175617265)](https://github.com/parsilver/thai-word-php/actions/workflows/run-tests.yml)[![codecov](https://camo.githubusercontent.com/da20a980e43b5832c3e7b121d79889a4221f6b6c180198fc1d07a4e809d7065a/68747470733a2f2f636f6465636f762e696f2f67682f70617273696c7665722f746861692d776f72642d7068702f6272616e63682f6d61696e2f67726170682f62616467652e737667)](https://codecov.io/gh/parsilver/thai-word-php)[![Total Downloads](https://camo.githubusercontent.com/d13f5ee6cd94e83565796d554c34f1db1d78f8cce521b03821c3c2ae92e944e7/68747470733a2f2f696d672e736869656c64732e696f2f7061636b61676973742f64742f6661727a61692f746861692d776f72642e7376673f7374796c653d666c61742d737175617265)](https://packagist.org/packages/farzai/thai-word)

A library for Thai word segmentation in PHP.

Features
--------

[](#features)

- **Thai word segmentation** with high accuracy
- **Word suggestions** for typos and misspellings
- **Dictionary loading** from local file, remote file, and remote URL
- **Performance optimizations** with caching and memory management
- **Batch processing** for large text volumes
- **Custom configuration** with caching, memory limit, and batch size
- **Mixed content support** (Thai, English, numbers, punctuation)

Requirements
------------

[](#requirements)

- PHP 8.4+
- Composer

Installation
------------

[](#installation)

You can install the package via composer:

```
composer require farzai/thai-word
```

Basic Usage
-----------

[](#basic-usage)

### Using the Facade (Recommended)

[](#using-the-facade-recommended)

```
use Farzai\ThaiWord\Composer;

// Simple text segmentation
$words = Composer::segment('สวัสดีครับผมชื่อสมชาย');
// Result: ['สวัสดี', 'ครับ', 'ผม', 'ชื่อ', 'สมชาย']

// Segment with custom delimiter
$text = Composer::segmentToString('สวัสดีครับผมชื่อสมชาย', ' ');
// Result: 'สวัสดี ครับ ผม ชื่อ สมชาย'

// Batch processing for multiple texts
$results = Composer::segmentBatch(['สวัสดีครับ', 'ขอบคุณค่ะ']);
// Result: [['สวัสดี', 'ครับ'], ['ขอบคุณ', 'ค่ะ']]

// Enable word suggestions via facade
// Use threshold 0.4-0.5 for single characters, 0.6-0.7 for multi-character words
Composer::enableSuggestions(['threshold' => 0.5]);

// Get suggestions for misspelled words
$suggestions = Composer::suggest('สวัสด');
// Result: [
//     ['word' => 'สวัสดี', 'score' => 0.833],
//     ['word' => 'สวัสดิ์', 'score' => 0.714],
//     ['word' => 'สวัสติ', 'score' => 0.667]
// ]

// Segment with automatic suggestions for single unrecognized characters
$result = Composer::segmentWithSuggestions('โอเคอไร');
// Result: [
//     ['word' => 'โอเค'],
//     ['word' => 'อ', 'suggestions' => [
//         ['word' => 'กอ', 'score' => 0.5],
//         ['word' => 'ขอ', 'score' => 0.5],
//         ['word' => 'คอ', 'score' => 0.5]
//     ]],
//     ['word' => 'ไร']
// ]

// Get performance statistics
$stats = Composer::getStats();
```

### Using ThaiSegmenter Directly

[](#using-thaisegmenter-directly)

```
use Farzai\ThaiWord\Segmenter\ThaiSegmenter;

$segmenter = new ThaiSegmenter();
$words = $segmenter->segment('สวัสดีครับผมชื่อสมชาย');

// Result: ['สวัสดี', 'ครับ', 'ผม', 'ชื่อ', 'สมชาย']
```

### Word Suggestions for Typos

[](#word-suggestions-for-typos)

```
use Farzai\ThaiWord\Segmenter\ThaiSegmenter;

$segmenter = new ThaiSegmenter();

// Enable word suggestions
$segmenter->enableSuggestions([
    'threshold' => 0.5,        // Minimum similarity score (0.0-1.0)
    'max_suggestions' => 5     // Maximum suggestions per word
]);

// Get suggestions for a misspelled word
$suggestions = $segmenter->suggest('สวัสด'); // Missing last character
// Result: [
//     ['word' => 'สวัสดี', 'score' => 0.833],
//     ['word' => 'สวัสดิ์', 'score' => 0.714],
//     ['word' => 'สวัสติ', 'score' => 0.667]
// ]

// Segment text with automatic suggestions for single unrecognized characters
$result = $segmenter->segmentWithSuggestions('ชื่ออไรนะ'); // 'อ' is unrecognized single character
// Result: [
//     ['word' => 'ชื่อ'],
//     ['word' => 'อ', 'suggestions' => [
//         ['word' => 'กอ', 'score' => 0.5],
//         ['word' => 'ขอ', 'score' => 0.5],
//         ['word' => 'คอ', 'score' => 0.5]
//     ]],
//     ['word' => 'ไร'],
//     ['word' => 'นะ']
// ]
```

How It Works
------------

[](#how-it-works)

This library segments Thai text into words and provides intelligent word suggestions through a highly optimized process. Here's how it works step by step:

### Step 1: Text Input &amp; Validation

[](#step-1-text-input--validation)

- You provide Thai text as a string to the `ThaiSegmenter`
- Example: `'สวัสดีครับผมชื่อสมชาย'`
- The library validates UTF-8 encoding and handles empty strings

### Step 2: Dictionary Loading (Automatic)

[](#step-2-dictionary-loading-automatic)

The library automatically loads Thai words using several sources with intelligent fallback:

- **LibreOffice Thai Dictionary**: Downloads from official LibreOffice repository (primary source)
- **Local Dictionary Files**: Falls back to local dictionary files if available
- **Basic Dictionary**: Uses built-in common Thai words as last resort

The dictionary is stored in a `HashDictionary` with **O(1) lookup performance**.

### Step 3: Smart Text Processing

[](#step-3-smart-text-processing)

The `LongestMatchingStrategy` algorithm processes text intelligently:

**Character Classification**:

- **Thai characters**: Unicode range 0x0E00-0x0E7F for fast detection
- **English words**: Handled as complete word units
- **Numbers**: Processed as number sequences (with decimals, commas)
- **Punctuation**: Handled appropriately with whitespace normalization

### Step 4: Longest Matching Algorithm

[](#step-4-longest-matching-algorithm)

```
Input: สวัสดีครับผมชื่อสมชาย
       ↓
Position 0: Check สวัสดี (6 chars) → Found in dictionary ✓
Position 6: Check ครับ (4 chars) → Found in dictionary ✓
Position 10: Check ผม (2 chars) → Found in dictionary ✓
Position 12: Check ชื่อ (3 chars) → Found in dictionary ✓
Position 15: Check สมชาย (5 chars) → Found in dictionary ✓
       ↓
Output: ['สวัสดี', 'ครับ', 'ผม', 'ชื่อ', 'สมชาย']

```

### Step 5: Word Suggestion System (Optional)

[](#step-5-word-suggestion-system-optional)

When enabled, the library can suggest corrections for typos using advanced similarity algorithms:

**Levenshtein Distance Algorithm**:

```
Input: สวัสด (missing last character)
       ↓
1. Filter dictionary words by length similarity (±3 characters)
2. Calculate Unicode-aware Levenshtein distance for each candidate
3. Convert distance to similarity score (0.0 to 1.0)
4. Filter by threshold (default 0.6) and sort by score
       ↓
Output: [
    ['word' => 'สวัสดี', 'score' => 0.833],  // 1 character difference
    ['word' => 'สวัสดิ์', 'score' => 0.714], // 2 character difference
    ['word' => 'สวัสติ', 'score' => 0.667]  // 2 character difference
]

```

**Smart Suggestion Integration**:

- **Single-character only**: `segmentWithSuggestions()` only provides suggestions for single-character segments that are NOT in the dictionary
- **Multi-character words**: Use `suggest()` method directly for multi-character word suggestions
- **Threshold requirements**: Single-character similarities max out at 0.5, so use threshold ≤ 0.5 for best results
- **Configurable similarity thresholds**: 0.4-0.5 for single characters, 0.6-0.7 for multi-character words
- **Performance-optimized**: Caching and length-based filtering for large dictionaries
- Unicode-aware for proper Thai character handling

### Step 6: Performance Optimizations

[](#step-6-performance-optimizations)

The library includes several optimizations:

- **Caching**: Recently segmented texts are cached for faster repeat processing
- **Batch Processing**: Large texts are processed in chunks to manage memory
- **Memory Management**: Automatic garbage collection and memory optimization
- **Adaptive Processing**: Different strategies for short, medium, and long texts
- **Suggestion Caching**: Distance calculations cached for repeated similarity checks

### Step 7: Mixed Content Handling

[](#step-7-mixed-content-handling)

```
$segmenter = new ThaiSegmenter();
$result = $segmenter->segment('ผมใช้ Computer ทำงาน');
// Result: ['ผม', 'ใช้', 'Computer', 'ทำงาน']
```

- Thai words are processed with dictionary lookup
- English words are kept as complete units
- Numbers and punctuation are handled appropriately

### Key Components

[](#key-components)

1. **ThaiSegmenter**: Main orchestrator with performance monitoring and suggestion integration
2. **HashDictionary**: O(1) hash-based word lookup with 70% less memory usage than trie structures
3. **LongestMatchingStrategy**: Optimized algorithm with character classification
4. **LevenshteinSuggestionStrategy**: Unicode-aware word suggestion algorithm with caching
5. **DictionaryLoaderService**: Handles loading from files, URLs, and remote sources

### Performance Features

[](#performance-features)

- **3-5x faster** processing speed with optimized algorithms
- **50% lower memory** usage with hash-based dictionary
- **Intelligent suggestions** with configurable accuracy thresholds
- **Automatic optimization** based on text characteristics
- **Built-in statistics** for performance monitoring

### Real Usage Examples

[](#real-usage-examples)

**Using the Facade (Simple &amp; Clean)**

```
use Farzai\ThaiWord\Composer;

// Basic segmentation
$words = Composer::segment('สวัสดีครับผมชื่อสมชาย');
// Result: ['สวัสดี', 'ครับ', 'ผม', 'ชื่อ', 'สมชาย']

// Get performance statistics
$stats = Composer::getStats();
echo "Processing time: {$stats['avg_processing_time']}ms";

// Add custom words
Composer::getDictionary()->add('คำใหม่');

// Batch processing for multiple texts
$results = Composer::segmentBatch(['ข้อความ1', 'ข้อความ2']);

// Custom configuration
Composer::updateConfig([
    'enable_caching' => true,
    'memory_limit_mb' => 200
]);
```

**Using ThaiSegmenter Directly (Advanced Control)**

```
use Farzai\ThaiWord\Segmenter\ThaiSegmenter;

// Create segmenter with custom configuration
$segmenter = new ThaiSegmenter(null, null, [
    'enable_caching' => true,
    'batch_size' => 500
]);

// Or use the facade to create custom instances
$customSegmenter = Composer::create(null, null, ['memory_limit_mb' => 150]);

// Set custom segmenter for facade
Composer::setSegmenter($customSegmenter);
```

This architecture ensures both accuracy and performance while remaining simple to use.

Advanced Usage
--------------

[](#advanced-usage)

### Custom Suggestion Strategies

[](#custom-suggestion-strategies)

```
use Farzai\ThaiWord\Segmenter\ThaiSegmenter;
use Farzai\ThaiWord\Suggestions\Strategies\LevenshteinSuggestionStrategy;

// Create custom suggestion strategy
$suggestionStrategy = new LevenshteinSuggestionStrategy;
$suggestionStrategy->setThreshold(0.8)              // Higher accuracy
                   ->setMaxWordLengthDiff(2);       // Stricter length filtering

// Initialize segmenter with custom strategy
$segmenter = new ThaiSegmenter(null, null, $suggestionStrategy);

// Or set strategy later
$segmenter->setSuggestionStrategy($suggestionStrategy);
```

### Performance Monitoring with Suggestions

[](#performance-monitoring-with-suggestions)

```
$segmenter = new ThaiSegmenter();
$segmenter->enableSuggestions();

// Process text
$result = $segmenter->segmentWithSuggestions('สวัสดีครบผมชื่อโจน');

// Get detailed statistics
$stats = $segmenter->getStats();
echo "Cache hit ratio: " . ($stats['cache_hit_ratio'] * 100) . "%\n";

// Get suggestion-specific statistics
$suggestionStrategy = $segmenter->getSuggestionStrategy();
if ($suggestionStrategy instanceof LevenshteinSuggestionStrategy) {
    $cacheStats = $suggestionStrategy->getCacheStats();
    echo "Suggestion cache size: " . $cacheStats['cache_size'] . "\n";
    echo "Memory usage: " . $cacheStats['memory_usage_mb'] . "MB\n";
}
```

### Batch Processing with Suggestions

[](#batch-processing-with-suggestions)

```
$texts = [
    'สวัสดีครบ',      // Contains typo
    'ขอบคนครับ',      // Contains typo
    'ผมชื่อโจน'       // Might need suggestions
];

$segmenter = new ThaiSegmenter();
$segmenter->enableSuggestions(['threshold' => 0.7]);

foreach ($texts as $text) {
    $result = $segmenter->segmentWithSuggestions($text);

    foreach ($result as $item) {
        if (isset($item['suggestions'])) {
            echo "'{$item['word']}' → Suggested: '{$item['suggestions'][0]['word']}'\n";
        }
    }
}

// Example output:
// 'ครบ' → Suggested: 'ครับ'
// 'คน' → Suggested: 'คุณ'
// 'โจน' → Suggested: 'โจ้'
```

### Understanding Suggestion Behavior

[](#understanding-suggestion-behavior)

**Important**: The `segmentWithSuggestions()` method only provides suggestions for **single-character segments** that are NOT found in the dictionary.

```
$segmenter = new ThaiSegmenter();
$segmenter->enableSuggestions(['threshold' => 0.5]);

// ✅ Will get suggestions - 'อ' is single character not in dictionary
$result = $segmenter->segmentWithSuggestions('โอเคอไร');
// 'อ' gets suggestions: ['กอ', 'ขอ', 'คอ', ...]

// ❌ Won't get suggestions - 'ครบ' is multi-character and in dictionary
$result = $segmenter->segmentWithSuggestions('สวัสดีครบ');
// 'ครบ' gets NO suggestions (even though 'ครับ' might be intended)

// ✅ For multi-character suggestions, use suggest() directly
$suggestions = $segmenter->suggest('ครบ');
// Returns: ['ครับ', 'ครอบ', 'คราบ', ...]
```

**Threshold Guidelines**:

- **Single characters**: Use 0.4-0.5 (similarities max out at 0.5)
- **Multi-character words**: Use 0.6-0.7 (higher precision possible)

### Configuration Options

[](#configuration-options)

```
$segmenter = new ThaiSegmenter();

// Enable suggestions with proper threshold for single characters
$segmenter->enableSuggestions([
    'threshold' => 0.5,         // Optimal for single characters
    'max_suggestions' => 3      // Maximum suggestions per word
]);

// Update segmenter configuration
$segmenter->updateConfig([
    'enable_caching' => true,
    'memory_limit_mb' => 150,
    'suggestion_threshold' => 0.5,  // Adjusted for single characters
    'max_suggestions' => 5
]);

// Disable suggestions when not needed
$segmenter->disableSuggestions();
```

Testing
-------

[](#testing)

```
composer test
```

Changelog
---------

[](#changelog)

Please see [CHANGELOG](CHANGELOG.md) for more information on what has changed recently.

Contributing
------------

[](#contributing)

Please see [CONTRIBUTING](https://github.com/parsilver/.github/blob/main/CONTRIBUTING.md) for details.

Security Vulnerabilities
------------------------

[](#security-vulnerabilities)

Please review [our security policy](../../security/policy) on how to report security vulnerabilities.

Credits
-------

[](#credits)

- [parsilver](https://github.com/parsilver)
- [All Contributors](../../contributors)

### Data Sources

[](#data-sources)

- [LibreOffice Thai Dictionary](https://github.com/LibreOffice/dictionaries/tree/master/th_TH) - Primary Thai word dictionary source

License
-------

[](#license)

The MIT License (MIT). Please see [License File](LICENSE.md) for more information.

###  Health Score

42

—

FairBetter than 90% of packages

Maintenance74

Regular maintenance activity

Popularity13

Limited adoption so far

Community10

Small or concentrated contributor base

Maturity60

Established project with proven stability

 Bus Factor1

Top contributor holds 83.6% of commits — single point of failure

How is this calculated?**Maintenance (25%)** — Last commit recency, latest release date, and issue-to-star ratio. Uses a 2-year decay window.

**Popularity (30%)** — Total and monthly downloads, GitHub stars, and forks. Logarithmic scaling prevents top-heavy scores.

**Community (15%)** — Contributors, dependents, forks, watchers, and maintainers. Measures real ecosystem engagement.

**Maturity (30%)** — Project age, version count, PHP version support, and release stability.

###  Release Activity

Cadence

Every ~1 days

Total

2

Last Release

333d ago

### Community

Maintainers

![](https://www.gravatar.com/avatar/53c7b44c6e0e6687d361e849f817ba17ef1815ca7973c6ffe70e7fa4655e02ec?d=identicon)[parsilver](/maintainers/parsilver)

---

Top Contributors

[![parsilver](https://avatars.githubusercontent.com/u/4928451?v=4)](https://github.com/parsilver "parsilver (51 commits)")[![dependabot[bot]](https://avatars.githubusercontent.com/in/29110?v=4)](https://github.com/dependabot[bot] "dependabot[bot] (8 commits)")[![github-actions[bot]](https://avatars.githubusercontent.com/in/15368?v=4)](https://github.com/github-actions[bot] "github-actions[bot] (2 commits)")

---

Tags

farzaiparsilverthai-word

###  Code Quality

TestsPest

Code StyleLaravel Pint

### Embed Badge

![Health badge](/badges/farzai-thai-word/health.svg)

```
[![Health](https://phpackages.com/badges/farzai-thai-word/health.svg)](https://phpackages.com/packages/farzai-thai-word)
```

###  Alternatives

[civicrm/civicrm-core

Open source constituent relationship management for non-profits, NGOs and advocacy organizations.

728272.9k20](/packages/civicrm-civicrm-core)[swisnl/json-api-client

A PHP package for mapping remote JSON:API resources to Eloquent like models and collections.

211473.2k12](/packages/swisnl-json-api-client)[laudis/neo4j-php-client

Neo4j-PHP-Client is the most advanced PHP Client for Neo4j

184616.9k31](/packages/laudis-neo4j-php-client)[anthropic-ai/sdk

Anthropic PHP SDK

129134.7k5](/packages/anthropic-ai-sdk)[shopware/app-php-sdk

Shopware App SDK for PHP

1577.8k1](/packages/shopware-app-php-sdk)[neos/flow-development-collection

Flow packages in a joined repository for pull requests.

144179.3k3](/packages/neos-flow-development-collection)

PHPackages © 2026

[Directory](/)[Categories](/categories)[Trending](/trending)[Changelog](/changelog)[Analyze](/analyze)
