PHPackages                             ghostjat/dna - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. ghostjat/dna

ActiveLibrary

ghostjat/dna
============

Description of project DNA.

00PHP

Since Apr 30Pushed 1mo agoCompare

[ Source](https://github.com/ghostjat/DNA)[ Packagist](https://packagist.org/packages/ghostjat/dna)[ RSS](/packages/ghostjat-dna/feed)WikiDiscussions main Synced 1w ago

READMEChangelogDependenciesVersions (1)Used By (0)

🧬 PHP-ML DNA Classification Tutorial
====================================

[](#-php-ml-dna-classification-tutorial)

> Build a **multi-class DNA sequence classifier** in pure PHP using **PHP-ML** — from raw data to predictions.

---

📌 Introduction
--------------

[](#-introduction)

This tutorial demonstrates how to use the **PHP-ML** library to build a machine learning model that classifies DNA sequences into:

- 🦠 Bacteria
- 🐾 Animal
- 🍄 Fungi
- 🧫 Virus
- 🌿 Plant

You’ll go through the complete pipeline:

- Data preparation
- Exploratory Data Analysis (EDA)
- Model training
- Evaluation
- Prediction

All examples are located in:

```
example/dna/
├── eda.php
├── train.php
└── predict.php

```

---

🧪 Problem Overview
------------------

[](#-problem-overview)

DNA sequences contain patterns that can be used to identify their biological origin. Instead of binary promoter detection, this project performs **multi-class classification** across five organism types.

---

🤖 Why Machine Learning?
-----------------------

[](#-why-machine-learning)

Machine learning helps by:

- Automatically discovering patterns in DNA sequences
- Scaling to large biological datasets
- Providing fast and accurate classification

---

⚙️ Prerequisites
----------------

[](#️-prerequisites)

Ensure you have:

- **PHP ≥ 8.2**
- **Composer**
- Install PHP-ML:

```
composer require ghostjat/pml:*
```

- Basic command-line knowledge

---

📂 Dataset Overview
------------------

[](#-dataset-overview)

### 📊 Summary

[](#-summary)

- **Total Samples:** 244,447
- **Features:** 256 (k-mer frequencies)
- **Classes:** 5

### 🧬 Classes

[](#-classes)

- bacteria
- animal
- fungi
- virus
- plant

### 📁 Storage

[](#-storage)

```
datasets/train_*.csv

```

---

🔍 Step 1: Exploratory Data Analysis (`eda.php`)
-----------------------------------------------

[](#-step-1-exploratory-data-analysis-edaphp)

This script loads and inspects the dataset.

```
$trainFiles = glob(__DIR__ . '/datasets/train_*.csv');
$dataset = loadDna($trainFiles[0]);

for ($i = 1; $i < count($trainFiles); $i++) {
    $dataset = $dataset->stack(loadDna($trainFiles[$i]));
}

$df0 = DataFrame::fromCSV($trainFiles[0], false);
$cols0 = $df0->columns();
$classes = $df0->categories(end($cols0));
```

### 🔎 What it does

[](#-what-it-does)

- Loads multiple CSV files
- Merges them into one dataset
- Extracts class distribution

---

🧠 Step 2: Training &amp; Evaluation (`train.php`)
-------------------------------------------------

[](#-step-2-training--evaluation-trainphp)

Train a neural network using `MLPClassifier`.

```
$pipeline = new Pipeline(
    [new NumericStringConverter(), new ZScaleStandardizer()],
    new MLPClassifier(
        architecture: [32, 16],
        epochs: 10,
        learningRate: 0.01,
        batchSize: 32
    )
);

Dataset::seed(42);
$dataset->randomize();

[$train, $val] = $dataset->split(0.8);

$pipeline->train($train);

$valPreds = $pipeline->predict($val);
$valAcc = (new Accuracy())->score($valPreds, $val->labels());
```

### ⚡ Training Details

[](#-training-details)

- **Train Samples:** 195,558
- **Validation Samples:** 48,889
- **Validation Accuracy:** ~90.07%
- **Training Time:** ~20 seconds

---

🔮 Step 3: Prediction (`predict.php`)
------------------------------------

[](#-step-3-prediction-predictphp)

Use a trained model to classify new DNA sequences.

```
// ── 1. Load model + class map ─────────────────────────────────────────────────
$logger->info('Loading model …');
$pipeline = Pipeline::load($modelDir);
$classes  = json_decode(file_get_contents($modelDir . '/classes.json'), true);
$logger->info('Model loaded', ['classes' => $classes]);

// ── 2. Load unknown CSV ───────────────────────────────────────────────────────
$logger->info('Loading unknown data …');
$df   = DataFrame::fromCSV($unknownCsv, false);
$cols = $df->columns();

// Check if last col is a label (STRING) or a feature (float32)
$dtypes      = $df->dtypes();
$lastCol     = end($cols);
$hasLabels   = ($dtypes[$lastCol] === 'string');

$X       = $df->drop($hasLabels ? [$lastCol] : [])->toTensor();
$dataset = new Dataset($X);
$logger->info('Data ready', ['rows' => $dataset->numRows(), 'features' => $dataset->numColumns()]);

// ── 3. Predict ────────────────────────────────────────────────────────────────
$logger->info('Predicting …');
$predIndices = $pipeline->predict($dataset)->toFlatArray();   // [N] class indices

// ── 4. Evaluate if labels available ──────────────────────────────────────────
if ($hasLabels) {
    $yTrue = $df->castToFloat($lastCol)->col($lastCol)->squeeze();
    $predT = \Pml\Tensor::fromArray($predIndices);
    $acc   = (new Accuracy())->score($predT, $yTrue);
    $logger->info(sprintf('Test accuracy: %.4f  (%.2f%%)', $acc, $acc * 100));
}
```

---

▶️ Running the Example
----------------------

[](#️-running-the-example)

### 🔍 EDA

[](#-eda)

```
php eda.php
```

### 🧠 Training

[](#-training)

```
php train.php  //softmax

php trainMLP.php
```

### 🔮 Prediction

[](#-prediction)

```
php predict.php
```

---

📊 Interpreting Results
----------------------

[](#-interpreting-results)

- **Accuracy** → Overall correctness
- **Multi-class Predictions** → Output label among 5 classes

---

🚀 Extending the Tutorial
------------------------

[](#-extending-the-tutorial)

- Increase epochs for better accuracy
- Try deeper architectures
- Experiment with other classifiers
- Add cross-validation

---

🏁 Conclusion
------------

[](#-conclusion)

You now have a complete workflow for building a **multi-class DNA classifier in PHP**.

---

❤️ Final Note
-------------

[](#️-final-note)

Push PHP beyond traditional limits — even into machine learning.

**Happy coding! 🚀**

###  Health Score

19

—

LowBetter than 10% of packages

Maintenance61

Regular maintenance activity

Popularity0

Limited adoption so far

Community6

Small or concentrated contributor base

Maturity11

Early-stage or recently created project

 Bus Factor1

Top contributor holds 100% of commits — single point of failure

How is this calculated?**Maintenance (25%)** — Last commit recency, latest release date, and issue-to-star ratio. Uses a 2-year decay window.

**Popularity (30%)** — Total and monthly downloads, GitHub stars, and forks. Logarithmic scaling prevents top-heavy scores.

**Community (15%)** — Contributors, dependents, forks, watchers, and maintainers. Measures real ecosystem engagement.

**Maturity (30%)** — Project age, version count, PHP version support, and release stability.

### Community

Maintainers

![](https://www.gravatar.com/avatar/64d55c44db241b5edf53d7f95b9209822731ed613173055ad2535f03137f119c?d=identicon)[ghostjat](/maintainers/ghostjat)

---

Top Contributors

[![ghostjat](https://avatars.githubusercontent.com/u/18235933?v=4)](https://github.com/ghostjat "ghostjat (5 commits)")

### Embed Badge

![Health badge](/badges/ghostjat-dna/health.svg)

```
[![Health](https://phpackages.com/badges/ghostjat-dna/health.svg)](https://phpackages.com/packages/ghostjat-dna)
```

PHPackages © 2026

[Directory](/)[Categories](/categories)[Trending](/trending)[Changelog](/changelog)[Analyze](/analyze)
