PHPackages                             hocvt/php-apache-tika - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [PDF &amp; Document Generation](/categories/documents)
4. /
5. hocvt/php-apache-tika

ActiveLibrary[PDF &amp; Document Generation](/categories/documents)

hocvt/php-apache-tika
=====================

Apache Tika bindings for PHP: extracts text from documents and images (with OCR), metadata and more...

v0.9.2(6y ago)01.8k3MITPHPPHP &gt;=5.4.0

Since Aug 30Pushed 5y agoCompare

[ Source](https://github.com/vuthaihoc/php-apache-tika)[ Packagist](https://packagist.org/packages/hocvt/php-apache-tika)[ RSS](/packages/hocvt-php-apache-tika/feed)WikiDiscussions tmp Synced 1mo ago

READMEChangelogDependencies (1)Versions (30)Used By (3)

[![Current release](https://camo.githubusercontent.com/4b242251ae25b8fa02dc7082db4aef75946f64fead2f58cba5ba592238b87078/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f72656c656173652f7661697465732f7068702d6170616368652d74696b612e737667)](https://github.com/vaites/php-apache-tika/releases/latest)[![Package at Packagist](https://camo.githubusercontent.com/cc3b274dc89a2e63c63c91d538685041d89f37500467e1a1823d86a828894db0/68747470733a2f2f696d672e736869656c64732e696f2f7061636b61676973742f64742f7661697465732f7068702d6170616368652d74696b612e737667)](https://packagist.org/packages/vaites/php-apache-tika)[![Build status](https://camo.githubusercontent.com/8acaefcebb583e41f8c6c9c7e713365783b3d9b1cea76afa236a8a77ecb0647f/68747470733a2f2f7472617669732d63692e6f72672f7661697465732f7068702d6170616368652d74696b612e7376673f6272616e63683d6d6173746572)](https://travis-ci.org/vaites/php-apache-tika)[![Code coverage](https://camo.githubusercontent.com/3e0f3dd6ff254131e98a3a14ba10c9f068203e6859b7f448fe70ee82e336dd9f/68747470733a2f2f696d672e736869656c64732e696f2f636f6465636f762f632f6769746875622f7661697465732f7068702d6170616368652d74696b612e737667)](https://codecov.io/github/vaites/php-apache-tika)[![Code quality](https://camo.githubusercontent.com/220c7994ed1f02fedf98b1de44ec9f12e45445a0f41a708148c3f22b4c35076c/68747470733a2f2f696d672e736869656c64732e696f2f7363727574696e697a65722f7175616c6974792f672f7661697465732f7068702d6170616368652d74696b612e737667)](https://scrutinizer-ci.com/g/vaites/php-apache-tika/)[![Code insight](https://camo.githubusercontent.com/6014074625e8025aad69fa6821c77cbd88b3843b2f2f69eea9331643ca639780/68747470733a2f2f696d672e736869656c64732e696f2f73656e73696f6c6162732f692f65633036363530322d306664652d343435352d396663332d3865396665363836373833342e737667)](https://insight.sensiolabs.com/projects/ec066502-0fde-4455-9fc3-8e9fe6867834)[![License](https://camo.githubusercontent.com/dd51a0f2af19b522b827de53cdee7643705c98a81405d2779e5be6ccc5e7c2e9/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f6c6963656e73652f7661697465732f7068702d6170616368652d74696b612e7376673f636f6c6f723d253233393939393939)](https://github.com/vaites/php-apache-tika/blob/master/LICENSE)

PHP Apache Tika
===============

[](#php-apache-tika)

This tool provides [Apache Tika](https://tika.apache.org) bindings for PHP, allowing to extract text and metadata from documents, images and other formats.

The following modes are supported:

- **App mode**: run app JAR via command line interface
- **Server mode**: make HTTP requests to [JSR 311 network server](https://cwiki.apache.org/confluence/display/TIKA/TikaServer)

Server mode is recommended because is 5 times faster, but some shared hosts don't allow run processes in background.

Although the library contains a list of supported versions, any version of Apache Tika should be compatible as long as backward compatibility is maintained by Tika team. Therefore, it is not necessary to wait for an update of the library to work with the new versions of the tool.

Features
--------

[](#features)

- Simple class interface to Apache Tika features:
    - Text and HTML extraction
    - Metadata extraction
    - OCR recognition
- Standarized metadata for documents
- Support for local and remote resources
- No heavyweight library dependencies
- Compatible with Apache Tika 1.7 or greater
    - Tested up to 1.24

Requirements
------------

[](#requirements)

- PHP 5.4 or greater
    - [Multibyte String support](http://php.net/manual/en/book.mbstring.php)
    - [cURL extension](http://php.net/manual/en/book.curl.php)
- Apache Tika 1.7 or greater
- Oracle Java or OpenJDK
    - Java 6 for Tika up to 1.9
    - Java 7 for Tika 1.10 or greater
- [Tesseract](https://github.com/tesseract-ocr/tesseract) (optional for OCR recognition)

Installation
------------

[](#installation)

Install using Composer:

```
composer require vaites/php-apache-tika
```

If you want to use OCR you must install [Tesseract](https://github.com/tesseract-ocr/tesseract):

- **Fedora/CentOS**: `sudo yum install tesseract` (use dnf instead of yum on Fedora 22 or greater)
- **Debian/Ubuntu**: `sudo apt-get install tesseract-ocr`
- **Mac OS X**: `brew install tesseract` (using [Homebrew](http://brew.sh))

The library assumes `tesseract` binary is in path, so you can compile it yourself or install using any other method.

Usage
-----

[](#usage)

Start Apache Tika server with [caution](http://www.openwall.com/lists/oss-security/2015/08/13/5):

```
java -jar tika-server-x.xx.jar
```

If you are using JRE instead of JDK, you must run if you have Java 9 or greater:

```
java --add-modules java.se.ee -jar tika-server-x.xx.jar
```

Instantiate the class, checking if JAR exists or server is running:

```
$client = \Vaites\ApacheTika\Client::make('localhost', 9998);           // server mode (default)
$client = \Vaites\ApacheTika\Client::make('/path/to/tika-app.jar');     // app mode
```

If you want to use dependency injection, serialize the class or just delay the check:

```
$client = \Vaites\ApacheTika\Client::prepare('localhost', 9998);
$client = \Vaites\ApacheTika\Client::prepare('/path/to/tika-app.jar');
```

You can use an URL too:

```
$client = \Vaites\ApacheTika\Client::make('http://localhost:9998');
$client = \Vaites\ApacheTika\Client::prepare('http://localhost:9998');
```

Use the class to extract text from documents:

```
$language = $client->getLanguage('/path/to/your/document');
$metadata = $client->getMetadata('/path/to/your/document');

$html = $client->getHTML('/path/to/your/document');
$text = $client->getText('/path/to/your/document');
```

Or use to extract text from images:

```
$client = \Vaites\ApacheTika\Client::make($host, $port);
$metadata = $client->getMetadata('/path/to/your/image');

$text = $client->getText('/path/to/your/image');
```

You can use an URL instead of a file path and the library will download the file and pass it to Apache Tika. There's **no need** to add `-enableUnsecureFeatures -enableFileUrl` to command line when starting the server, as described [here](https://wiki.apache.org/tika/TikaJAXRS#Specifying_a_URL_Instead_of_Putting_Bytes).

### Methods

[](#methods)

Here are the full list of available methods

#### Common

[](#common)

Tika file related methods:

```
$client->getMetadata($file);
$client->getRecursiveMetadata($file, 'text');
$client->getLanguage($file);
$client->getMIME($file);
$client->getHTML($file);
$client->getText($file);
$client->getMainText($file);
```

Other Tika related methods:

```
$client->getSupportedMIMETypes();
$client->getAvailableDetectors();
$client->getAvailableParsers();
$client->getVersion();
```

Encoding methods:

```
$client->getEncoding();
$client->setEncoding('UTF-8');
```

Supported versions related methods:

```
$client->getSupportedVersions();
$client->isVersionSupported($version);
```

Set/get a callback for sequential read of response:

```
$client->setCallback($callback);
$client->getCallback();
```

Set/get the chunk size for secuential read:

```
$client->setChunkSize($size);
$client->getChunkSize();
```

Enable/disable the internal remote file downloader:

```
$client->setDownloadRemote(true);
$client->getDownloadRemote();
```

#### Command line client

[](#command-line-client)

Set/get JAR/Java paths (only CLI mode):

```
$client->setPath($path);
$client->getPath();

$client->setJava($java);
$client->getJava();
```

#### Web client

[](#web-client)

Set/get host properties

```
$client->setHost($host);
$client->getHost();

$client->setPort($port);
$client->getPort();

$client->setUrl($url);
$client->getUrl();

$client->setRetries($retries);
$client->getRetries();
```

Set/get [cURL client options](http://php.net/manual/en/function.curl-setopt.php)

```
$client->setOptions($options);
$client->getOptions();
$client->setOption($option, $value);
$client->getOption($option);
```

Set/get cURL client common options:

```
$client->setTimeout($seconds);
$client->getTimeout();
```

Troubleshooting
---------------

[](#troubleshooting)

### Empty responses or unexpected results

[](#empty-responses-or-unexpected-results)

This library is only a *proxy* so if you get an empy responses or unexpected results the most common cause is Tika itself. A simple test is using the GUI to check the response:

1. Run the Tika app without arguments: `java -jar tika-app-x.xx.jar`
2. Drop your file or select it using *File -&gt; Open*
3. Wait until the metadata appears
4. Get the text or HTML using *View* menu

If the results are the same, you must take a look into [Tika's Jira](https://issues.apache.org/jira/projects/TIKA/issues)and open an issue if necessary.

### Encoding

[](#encoding)

By default the returned text is encoded with UTF-8 but there are some issues with the encoding when using the app mode. The `Client::setEncoding()` method allows to set the expected encoding (this will be fixed in the upcoming 1.0 release).

Tests
-----

[](#tests)

Tests are designed to **cover all features for all supported versions** of Apache Tika in app mode and server mode. There are a few samples to test against:

- **sample1**: document metadata and text extraction
- **sample2**: image metadata
- **sample3**: text recognition
- **sample4**: unsupported media
- **sample5**: huge text for callbacks
- **sample6**: remote calls
- **sample7**: text encoding

Known issues
------------

[](#known-issues)

There are some issues found during tests, not related with this library:

- 1.9 version running Java 7 on server mode throws random error 500 (*Unexpected RuntimeException*)
- 1.14 version on server mode throws random errors (*Expected ';', got ','*) when parsing image metadata
- Tesseract slows down document parsing as described in [TIKA-2359](https://issues.apache.org/jira/browse/TIKA-2359)

Integrations
------------

[](#integrations)

- [Symfony2 Bundle](https://github.com/welcoMattic/ApacheTikaBundle)

###  Health Score

31

—

LowBetter than 68% of packages

Maintenance20

Infrequent updates — may be unmaintained

Popularity19

Limited adoption so far

Community16

Small or concentrated contributor base

Maturity59

Maturing project, gaining track record

 Bus Factor1

Top contributor holds 95.2% of commits — single point of failure

How is this calculated?**Maintenance (25%)** — Last commit recency, latest release date, and issue-to-star ratio. Uses a 2-year decay window.

**Popularity (30%)** — Total and monthly downloads, GitHub stars, and forks. Logarithmic scaling prevents top-heavy scores.

**Community (15%)** — Contributors, dependents, forks, watchers, and maintainers. Measures real ecosystem engagement.

**Maturity (30%)** — Project age, version count, PHP version support, and release stability.

###  Release Activity

Cadence

Every ~62 days

Recently: every ~21 days

Total

28

Last Release

2216d ago

Major Versions

v0.7.2 → 1.x-dev2019-12-28

### Community

Maintainers

![](https://avatars.githubusercontent.com/u/2509658?v=4)[Stupid Dev](/maintainers/vuthaihoc)[@vuthaihoc](https://github.com/vuthaihoc)

---

Top Contributors

[![vaites](https://avatars.githubusercontent.com/u/478660?v=4)](https://github.com/vaites "vaites (216 commits)")[![welcoMattic](https://avatars.githubusercontent.com/u/773875?v=4)](https://github.com/welcoMattic "welcoMattic (4 commits)")[![caugner](https://avatars.githubusercontent.com/u/495429?v=4)](https://github.com/caugner "caugner (2 commits)")[![jbleijenberg-esites](https://avatars.githubusercontent.com/u/97018446?v=4)](https://github.com/jbleijenberg-esites "jbleijenberg-esites (2 commits)")[![vuthaihoc](https://avatars.githubusercontent.com/u/2509658?v=4)](https://github.com/vuthaihoc "vuthaihoc (2 commits)")[![norbert-n](https://avatars.githubusercontent.com/u/12446780?v=4)](https://github.com/norbert-n "norbert-n (1 commits)")

---

Tags

pdfdocdocxodtofficeapacheOCRpptxppttikadocuments

###  Code Quality

TestsPHPUnit

### Embed Badge

![Health badge](/badges/hocvt-php-apache-tika/health.svg)

```
[![Health](https://phpackages.com/badges/hocvt-php-apache-tika/health.svg)](https://phpackages.com/packages/hocvt-php-apache-tika)
```

###  Alternatives

[vaites/php-apache-tika

Apache Tika bindings for PHP: extracts text from documents and images (with OCR), metadata and more...

1171.5M2](/packages/vaites-php-apache-tika)[phpoffice/phpword

PHPWord - A pure PHP library for reading and writing word processing documents (OOXML, ODF, RTF, HTML, PDF)

7.6k34.7M186](/packages/phpoffice-phpword)[enzim/tika-wrapper

This is a simple PHP Wrapper for Apache Tika (using the tika-app jar)

6021.3k](/packages/enzim-tika-wrapper)[ninoskopac/php-tika-wrapper

This is a simple PHP Wrapper for Apache Tika (using the tika-app jar)

6011.1k](/packages/ninoskopac-php-tika-wrapper)[nilgems/laravel-textract

A Laravel package to extract text from files like DOC, XL, Image, Pdf and more. I've developed this package by inspiring "npm textract".

195.2k](/packages/nilgems-laravel-textract)

PHPackages © 2026

[Directory](/)[Categories](/categories)[Trending](/trending)[Changelog](/changelog)[Analyze](/analyze)
