PHPackages                             rlerdorf/ext-llama - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [Utility &amp; Helpers](/categories/utility)
4. /
5. rlerdorf/ext-llama

ActivePhp-ext[Utility &amp; Helpers](/categories/utility)

rlerdorf/ext-llama
==================

PHP extension for running GGUF models via llama.cpp

v0.1.1(2mo ago)1031PHP-3.01CPHP &gt;=8.4

Since Apr 7Pushed 2mo ago1 watchersCompare

[ Source](https://github.com/rlerdorf/ext-llama)[ Packagist](https://packagist.org/packages/rlerdorf/ext-llama)[ Docs](https://github.com/rlerdorf/ext-llama)[ RSS](/packages/rlerdorf-ext-llama/feed)WikiDiscussions main Synced 2w ago

READMEChangelog (2)DependenciesVersions (3)Used By (0)

ext-llama
=========

[](#ext-llama)

A PHP extension for running GGUF large language models directly in PHP using [llama.cpp](https://github.com/ggml-org/llama.cpp). No HTTP servers, no exec(), no Python. Just load a model and generate text from your PHP script.

Why not just use llama-server?
------------------------------

[](#why-not-just-use-llama-server)

For larger models and high-concurrency workloads, you probably should. llama.cpp ships with [llama-server](https://github.com/ggml-org/llama.cpp/tree/master/tools/server), an HTTP server that exposes an OpenAI-compatible API. You can talk to it from PHP with any HTTP client. llama-server is the better choice when:

- **High concurrency.** llama-server holds a single copy of the model and handles parallel requests via slots. With ext-llama, each PHP-FPM worker creates its own inference context. Model *weights* are shared across workers via mmap (no duplication in system RAM), but GPU (CUDA/Metal) memory is per-process. If you're offloading a 7B model to GPU and running 4 FPM workers, that's 4x the VRAM. A dedicated llama-server avoids this entirely.
- **Large models.** For 13B+ models on GPU, the single-process architecture of llama-server is more memory-efficient.
- **Multi-language / multi-app.** If other services besides PHP need the same model, a shared server makes more sense than loading it in every process.

ext-llama is a better fit for **embedded / low-concurrency setups** where simplicity matters:

- Small to medium models (1-7B) running on CPU, or on GPU with a single or very few FPM workers where the per-worker VRAM cost is acceptable
- Dedicated appliances, IoT, edge servers, or internal tools where you want one less daemon to manage
- Use cases like RAG, structured extraction, or chat where a single PHP process handles the request end-to-end
- LoRA hot-swapping per request, allowing you to switch "personalities" in sub-millisecond time without touching a server config

ext-llamallama-server + HTTP clientMoving partsJust PHPPHP + separate server processDeployment`extension=llama` in php.iniManage a sidecar daemonLatencyDirect C callsHTTP round-trip (~1ms loopback)Model memory (CPU)mmap shared across workersSingle processModel memory (GPU)Per-worker VRAM allocationSingle VRAM allocationLoRA hot-swapSub-millisecond, per-requestServer restart or API callStreamingNative PHP `Iterator`SSE parsingConcurrencyLimited by FPM workersBuilt-in parallel slotsRequirements
------------

[](#requirements)

- PHP 8.4+
- llama.cpp built with shared libraries

Installation
------------

[](#installation)

### 1. Build llama.cpp

[](#1-build-llamacpp)

```
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && mkdir build && cd build
cmake .. -DBUILD_SHARED_LIBS=ON
make -j$(nproc) llama ggml common
sudo make install  # installs libllama.so and headers to /usr/local
```

For CUDA (NVIDIA GPU) support, add `-DGGML_CUDA=ON` to the cmake line. Other backends like Vulkan (`-DGGML_VULKAN=ON`) and Metal (macOS, enabled by default) work the same way. The PHP extension does not need to be recompiled when switching backends. Only libllama does.

### 2. Build the extension

[](#2-build-the-extension)

Point `--with-llama` at the **llama.cpp source tree** (not the install prefix). This is important because it gives the build system access to `libcommon.a` and the vendored `nlohmann/json` headers, which are needed for JSON schema constrained generation. These files are not installed by `make install`.

Via PIE:

```
pie install rlerdorf/ext-llama --with-llama=/path/to/llama.cpp
```

Or manually:

```
git clone https://github.com/rlerdorf/ext-llama
cd ext-llama
phpize
./configure --with-llama=/path/to/llama.cpp
make
sudo make install
```

If you point `--with-llama` at a system prefix like `/usr/local` instead of the source tree, the extension will still build and work, but the `json_schema` option will not be available. GBNF grammars (the `grammar` option) always work regardless. The configure output will tell you which features are enabled:

```
checking for llama.cpp common library (json-schema-to-grammar)... yes

```

### 3. Enable the extension

[](#3-enable-the-extension)

Add to your `php.ini`:

```
extension=llama
```

Quick Start
-----------

[](#quick-start)

```
$model = new Llama\Model('/path/to/model.gguf');
$ctx = new Llama\Context($model, ['n_ctx' => 2048]);

echo $ctx->complete("The capital of France is", ['max_tokens' => 32]);
```

API
---

[](#api)

### Llama\\Model

[](#llamamodel)

```
// Load a GGUF model (cached across requests in PHP-FPM)
$model = new Llama\Model('/path/to/model.gguf', [
    'n_gpu_layers' => -1,    // offload all layers to GPU (-1=all, 0=CPU only)
    'use_mmap'     => true,  // default: true
    'use_mlock'    => true,  // default: true, pin pages in RAM
]);

$model->desc();              // "llama 3B Q4_K - Medium"
$model->size();              // model file size in bytes
$model->nParams();           // parameter count
$model->nEmbd();             // embedding dimensions
$model->nLayer();            // layer count
$model->chatTemplate();      // built-in Jinja chat template, or null
$model->meta('general.name');// read GGUF metadata by key

$model->tokenize("Hello");       // [1, 15043]
$model->detokenize([1, 15043]);  // " Hello"
```

### Llama\\Context

[](#llamacontext)

```
$ctx = new Llama\Context($model, [
    'n_ctx'      => 2048,  // context size
    'n_batch'    => 512,   // batch size
    'n_threads'  => 4,     // CPU threads
    'embeddings' => false, // set true for embed()
    'flash_attn' => false, // flash attention
]);
```

**Text completion:**

```
$text = $ctx->complete("Once upon a time", [
    'max_tokens'     => 256,
    'temperature'    => 0.8,
    'top_k'          => 40,
    'top_p'          => 0.95,
    'min_p'          => 0.05,
    'repeat_penalty' => 1.1,
    'seed'           => 42,
]);
```

**Chat** (applies the model's built-in chat template):

```
$reply = $ctx->chat([
    ['role' => 'system', 'content' => 'You are a helpful assistant.'],
    ['role' => 'user',   'content' => 'What is PHP?'],
], ['max_tokens' => 256]);
```

**Streaming** (token by token):

```
foreach ($ctx->stream("Tell me a story", ['max_tokens' => 256]) as $piece) {
    echo $piece;
    flush();
}
```

**Embeddings:**

```
$ctx = new Llama\Context($model, ['embeddings' => true]);
$vector = $ctx->embed("Some text");  // float[]
```

**Constrained generation** with GBNF grammar or JSON schema:

```
// Force yes/no output
$answer = $ctx->complete("Is the sky blue? ", [
    'grammar' => 'root ::= ("yes" | "no")',
]);

// Force valid JSON matching a schema
$json = $ctx->complete("Output a person as JSON:", [
    'json_schema' => json_encode([
        'type' => 'object',
        'properties' => [
            'name' => ['type' => 'string'],
            'age'  => ['type' => 'integer'],
        ],
        'required' => ['name', 'age'],
    ]),
]);
// {"name":"Alice","age":30}
```

### Llama\\LoRA

[](#llamalora)

```
// Load adapters (one-time cost, ~200ms each)
$code = new Llama\LoRA($model, '/path/to/code-lora.gguf');
$chat = new Llama\LoRA($model, '/path/to/chat-lora.gguf');

// Hot-swap in sub-millisecond time
$ctx->applyLoRA($code);
$ctx->applyLoRA($chat);          // replaces previous
$ctx->applyLoRA($chat, 0.5);     // with scale

// Blend multiple LoRAs
$ctx->applyLoRA([$code, $chat], [0.6, 0.4]);

// Remove all adapters
$ctx->clearLoRA();

// Read adapter metadata
$code->meta('general.name');
```

### Llama\\Exception

[](#llamaexception)

All errors throw `Llama\Exception` (extends `\Exception`):

```
try {
    $model = new Llama\Model('/nonexistent.gguf');
} catch (Llama\Exception $e) {
    echo $e->getMessage(); // "Model file not found: /nonexistent.gguf"
}
```

Memory Model
------------

[](#memory-model)

In a PHP-FPM deployment with 10 workers serving a 4GB model:

WhatMemoryLifetimeModel weights (mmap)4GB sharedProcess (shared across all workers)Model metadata~KB per workerWorker (persistent across requests)KV cache~MB per contextRequestLoRA adapters~MB eachWorkerLoRA hot-swap0 bytesInstantLicense
-------

[](#license)

PHP License (same as PHP itself).

###  Health Score

38

—

LowBetter than 83% of packages

Maintenance84

Actively maintained with recent releases

Popularity10

Limited adoption so far

Community8

Small or concentrated contributor base

Maturity42

Maturing project, gaining track record

 Bus Factor1

Top contributor holds 100% of commits — single point of failure

How is this calculated?**Maintenance (25%)** — Last commit recency, latest release date, and issue-to-star ratio. Uses a 2-year decay window.

**Popularity (30%)** — Total and monthly downloads, GitHub stars, and forks. Logarithmic scaling prevents top-heavy scores.

**Community (15%)** — Contributors, dependents, forks, watchers, and maintainers. Measures real ecosystem engagement.

**Maturity (30%)** — Project age, version count, PHP version support, and release stability.

###  Release Activity

Cadence

Every ~0 days

Total

2

Last Release

81d ago

### Community

Maintainers

![](https://www.gravatar.com/avatar/04c4b64e4b0528a79749a79b300ef81c093cbdaeb733b2b2c2a39b4352fa19a3?d=identicon)[rlerdorf](/maintainers/rlerdorf)

---

Top Contributors

[![rlerdorf](https://avatars.githubusercontent.com/u/54641?v=4)](https://github.com/rlerdorf "rlerdorf (10 commits)")

---

Tags

inferenceaimachine learningllmllamaembeddingsloragguf

### Embed Badge

![Health badge](/badges/rlerdorf-ext-llama/health.svg)

```
[![Health](https://phpackages.com/badges/rlerdorf-ext-llama/health.svg)](https://phpackages.com/packages/rlerdorf-ext-llama)
```

###  Alternatives

[rubix/ml

A high-level machine learning and deep learning library for the PHP language.

2.2k1.5M28](/packages/rubix-ml)[cognesy/instructor-php

The complete AI toolkit for PHP: unified LLM API, structured outputs, agents, and coding agent control

318117.1k1](/packages/cognesy-instructor-php)[symfony/ai-platform

PHP library for interacting with AI platform provider.

521.2M216](/packages/symfony-ai-platform)[ardagnsrn/ollama-php

This is a PHP library for Ollama. Ollama is an open-source project that serves as a powerful and user-friendly platform for running LLMs on your local machine. It acts as a bridge between the complexities of LLM technology and the desire for an accessible and customizable AI experience.

21074.5k](/packages/ardagnsrn-ollama-php)[wordpress/ai-provider-for-google

AI Provider for Google for the PHP AI Client SDK. Works as both a Composer package and WordPress plugin.

161.1k](/packages/wordpress-ai-provider-for-google)

PHPackages © 2026

[Directory](/)[Categories](/categories)[Trending](/trending)[Changelog](/changelog)[Analyze](/analyze)