PHPackages                             jocoon/parquet - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [Utility &amp; Helpers](/categories/utility)
4. /
5. jocoon/parquet

Abandoned → [codename/parquet](/?search=codename%2Fparquet)ArchivedLibrary[Utility &amp; Helpers](/categories/utility)

jocoon/parquet
==============

Thrift-based PHP implementation for using the Apache Parquet format

v0.5.1(4y ago)18116.7k↓25%[1 issues](https://github.com/Jocoon/php-parquet/issues)MITPHPPHP &gt;=7.3 &lt;=8.0.99

Since Oct 28Pushed 4y ago2 watchersCompare

[ Source](https://github.com/Jocoon/php-parquet)[ Packagist](https://packagist.org/packages/jocoon/parquet)[ RSS](/packages/jocoon-parquet/feed)WikiDiscussions master Synced 1mo ago

READMEChangelogDependencies (5)Versions (18)Used By (0)

php-parquet
===========

[](#php-parquet)

[![Build Status (Github Actions)](https://github.com/Jocoon/php-parquet/actions/workflows/test.yml/badge.svg?branch=master)](https://github.com/Jocoon/php-parquet/actions/workflows/test.yml)

[![GitHub Workflow Status (event)](https://camo.githubusercontent.com/e4f62ae64fd7478ae8f8a04ddc048684611bac1588446e9fda7e5746bd4a5439/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f776f726b666c6f772f7374617475732f6a6f636f6f6e2f7068702d706172717565742f556e697425323054657374733f6576656e743d70757368266c6162656c3d72656c656173652532306275696c64)](https://github.com/Jocoon/php-parquet/actions/workflows/test.yml?query=event%3Apush)[![GitHub Workflow Status (event)](https://camo.githubusercontent.com/8b65a101d233d978f1409152b44bde2635ed8b34a7c971bb645f5b20b04dadf6/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f776f726b666c6f772f7374617475732f6a6f636f6f6e2f7068702d706172717565742f556e697425323054657374733f6576656e743d776f726b666c6f775f6469737061746368266c6162656c3d6465762532306275696c64)](https://github.com/Jocoon/php-parquet/actions/workflows/test.yml?query=event%3Aworkflow_dispatch)

[![Packagist Version](https://camo.githubusercontent.com/09cd7004008995a86d99244928ee4cf5737bf638f72bd9a4fe77c0c347b431b1/68747470733a2f2f696d672e736869656c64732e696f2f7061636b61676973742f762f6a6f636f6f6e2f70617271756574)](https://packagist.org/packages/jocoon/parquet)[![Packagist PHP Version Support](https://camo.githubusercontent.com/c8e2886467625cf89b5aaf50fb35b00e788d3b033e93666560be5f07814d8601/68747470733a2f2f696d672e736869656c64732e696f2f7061636b61676973742f7068702d762f6a6f636f6f6e2f70617271756574)](https://packagist.org/packages/jocoon/parquet)[![Packagist Downloads](https://camo.githubusercontent.com/e5765dd609c35dfde1472617a9bcb244fb649453f8c1c4245d7856db2ffbc733/68747470733a2f2f696d672e736869656c64732e696f2f7061636b61676973742f64742f6a6f636f6f6e2f706172717565743f6c6162656c3d7061636b6167697374253230696e7374616c6c73)](https://packagist.org/packages/jocoon/parquet)

This is the first parquet file format reader/writer implementation in PHP, based on the Thrift sources provided by the Apache Foundation. Extensive parts of the code and concepts have been ported from parquet-dotnet (see  and ). Therefore, thanks go out to Ivan Gavryliuk ().

This package enables you to read and write Parquet files/streams. It has (almost?) 100% test compatibility with parquet-dotnet, regarding the core functionality, done via PHPUnit.

Migration notice
----------------

[](#migration-notice)

As of 2021-11-10 (and `v0.5.1`), this project is moving to a new repository and package due to organizational changes.

---

⭐ **Development continues** at [**github.com/codename-hub/php-parquet**](https://github.com/codename-hub/php-parquet) under the new composer package name [**codename/parquet**](https://packagist.org/packages/codename/parquet). ⭐

---

Please note, this also involves a namespace change from `jocoon\parquet` to `codename\parquet`, but no other breaking changes (you might simply use a search-and-replace method in your code, in case). The project will continue to stick to Semver conventions. The existing package will continue to work, but there will be no bugfixes or new features.

Please use the new package and file issues/PRs there, if any. The new package is on the verge of providing more essential features of the Parquet format, including new format specs while improving the overall implementation, especially regarding nested and repeated fields.

Preamble
--------

[](#preamble)

For some parts of this package, some new patterns had to be invented as I haven't found any implementation that met the requirements. For most cases, there weren't any implementations available, at all.

Some highlights:

- **GZIP Stream Wrappers** (that also write headers and checksums) for usage with fopen() and similar functions
- **Snappy Stream Wrappers** (Snappy compression algorithm) for usage with fopen() and similar functions
- Stream Wrappers that specify/open/wrap a **resource id** instead of (or in addition to) a file path or URI
- **TStreamTransport** as a TTransport implementation for pure streaming Thrift data

Background
----------

[](#background)

I started developing this library due to the fact, there was simply no implementation for PHP.

At my company, we needed a quick solution to archive huge amounts of data from a database in a format that is still queryable, extensible from a schema-perspective and fault-tolerant. We started testing live 'migrations' via AWS DMS to S3, which ended up crashing on certain amounts of data, due to memory limitations. And it simply was too db-oriented, next to the fact it's easy to accidentally delete data from previous loads. As we have a heavily SDS-oriented and platform-agnostic architecture, it is not my preferred way to store data as a 1:1 clone of database, like a dump. Instead, I wanted to have the ability to store data, structured dynamically, like I wanted, in the same way DMS was exporting to S3. Finally, the project died due to the reasons mentioned above.

But I couldn't get the parquet format out of my head..

The TOP 1 search result () looked promising that it would not take that much effort to have a PHP implementation - but in fact, it did take some (about 2 weeks non-consecutive work). For me, as a PHP and C# developer, parquet-dotnet was a perfect starting point - not merely due to the fact the benchmarks are simply too compelling. But I expected the PHP implementation not to meet these levels of performance, as this is an initial implementation, showing the principle. And additionally, no one had done it before.

### Raison d'être

[](#raison-dêtre)

As PHP has a huge share regarding web-related projects, this is a *MUST-HAVE* in times of growing need for big data applications and scenarios. For my personal motivation, this is a way to show PHP has (physically, virtually?) surpassed it's reputation as a 'scripting language'. I think - or at least I hope - there are people out there that will benefit from this package and the message it transports. Not only Thrift objects. Pun intended.

Requirements
------------

[](#requirements)

You'll need several extensions to use this library to the full extent.

- **bcmath** (today, this should be a must-have anyway)
- **gmp** (for working with arbitrary large integers - and indirectly huge decimals!)
- **zlib** (for GZIP (de-)compression)
- **snappy** ( - sadly, not published yet to PECL - you'll have to compile it yourself - see Installation)

This library was originally developed to/using PHP 7.3, but it should work on PHP &gt; 7 and will be tested on 8, when released. At the moment, tests on PHP 7.1 and 7.2 will fail due to some DateTime issues. I'll have a look at it. Tests fully pass on PHP 7.3 and 7.4. At the time of writing also 8.0.0 RC2 is performing well.

This library highly depends on

- **apache/thrift** for working with the Thrift-related objects and data
- **nelexa/buffer** for reading and writing binary data (I decided not to do a C# BinaryWriter clone. (UPDATE 2020-11-04: I just did my own clone, see below.)
- **pear/Math\_BigInteger** for working with binary stored arbitrary-precision decimals (paradox, I know)

As of v0.2, I've also switched to an implementation-agnostic approach of using readers and writers. Now, we're dealing with BinaryReader(Interface) and BinaryWriter(Interface) implementations that abstract the underlying mechanism. I've noticed **mdurrant/php-binary-reader** is just way too slow. I just didn't want to refactor everything just to try out Nelexa's reading powers. Instead, I've made those two interfaces mentioned above to abstract various packages delivering binary reading/writing. This finally leads to an optimal way of testing/benchmarking different implementations - and also mixing, e.g. using wapmorgan's package for reading while using Nelexa's for writing.

As of v0.2.1 I've done the binary reader/writer implementations myself, as no implementation met the performance requirements. Especially for writing, this ultra-lightweight implementation delivers thrice\* the performance of Nelexa's buffer.
\* intended, I love this word

Alternative 3rd party binary reading/writing packages in scope:

- **nelexa/buffer**
- **mdurrant/php-binary-reader** (reading only)
- **wapmorgan/binary-stream**

Installation
------------

[](#installation)

Install this package via composer, e.g.

```
composer require jocoon/parquet
```

---

**Please note:** as of 2021-11-10 I'd like to encourage switching to `codename/parquet`, see migration notice above.

---

The included *Dockerfile* gives you an idea of the needed system requirements. The most important thing to perform, is to clone and install **php-ext-snappy**. At the time of writing, it *has not been published do PECL*, yet.

```
...
# NOTE: this is a dockerfile snippet. Bare metal machines will be a little bit different

RUN git clone --recursive --depth=1 https://github.com/kjdev/php-ext-snappy.git \
  && cd php-ext-snappy \
  && phpize \
  && ./configure \
  && make \
  && make install \
  && docker-php-ext-enable snappy \
  && ls -lna

...
```

Please note: php-ext-snappy is a little bit quirky to compile and install on Windows, so this is just a short information for installation and usage on Linux-based systems. As long as you don't need the snappy compression for reading or writing, you can use php-parquet without compiling it yourself.

Helping tools to make life easier
---------------------------------

[](#helping-tools-to-make-life-easier)

I've found ParquetViewer () by Mukunku to be a great way of looking into the data to be read or verifying some stuff on a Windows desktop machine. At least, this helps understanding certain mechanisms, as it more-or-less visually assists by simply displaying the data as a table.

API
---

[](#api)

Usage is almost the same as parquet-dotnet. Please note, we have no `using ( ... ) { }`, like in C#. So you have to make sure to close/dispose unused resources yourself or let PHP's GC handle it automatically by its refcounting algorithm. (This is the reason why I don't make use of destructors like parquet-dotnet does.)

### General remarks

[](#general-remarks)

As PHP's type system is completely different to C#, we have to make some additions on how to handle certain data types. For example, a PHP integer is nullable, somehow. An **int** in C#, isn't. This is a point I'm still unsure about how to deal with it. For now, I've set int (PHP *integer*) to be nullable - parquet-dotnet is doing this as not-nullable. You can always adjust this behaviour by manually setting `->hasNulls = true;` on your DataField. Additionally, php-parquet uses a dual way of determining a type. In PHP, a primitive has it's own type (integer, bool, float/double, etc.). For class instances (especially DateTime/DateTimeImmutable), the type returned by get\_type() is always object. This is the reason a second property for the DataTypeHandlers exist to match, determine and process it: phpClass.

At the time of writing, not every DataType supported by parquet-dotnet is supported here, too. F.e. I've skipped Int16, SignedByte and some more, but it shouldn't be too complicated to extend to full binary compatibility.

At the moment, this library serves the **core** functionality needed for reading and writing parquet files/streams. It doesn't include parquet-dotnet's Table, Row, Enumerators/helpers from the C# namespace `Parquet.Data.Rows`.

### Reading files

[](#reading-files)

```
use jocoon\parquet\ParquetReader;

// open file stream (in this example for reading only)
$fileStream = fopen(__DIR__.'/test.parquet', 'r');

// open parquet file reader
$parquetReader = new ParquetReader($fileStream);

// get file schema (available straight after opening parquet reader)
// however, get only data fields as only they contain data values
$dataFields = $parquetReader->schema->GetDataFields();

// enumerate through row groups in this file
for($i = 0; $i < $parquetReader->getRowGroupCount(); $i++)
{
  // create row group reader
  $groupReader = $parquetReader->OpenRowGroupReader($i);
  // read all columns inside each row group (you have an option to read only
  // required columns if you need to.
  $columns = [];
  foreach($dataFields as $field) {
    $columns[] = $groupReader->ReadColumn($field);
  }

  // get first column, for instance
  $firstColumn = $columns[0];

  // $data member, accessible through ->getData() contains an array of column data
  $data = $firstColumn->getData();

  // Print data or do other stuff with it
  print_r($data);
}
```

### Writing files

[](#writing-files)

```
use jocoon\parquet\ParquetWriter;

use jocoon\parquet\data\Schema;
use jocoon\parquet\data\DataField;
use jocoon\parquet\data\DataColumn;

//create data columns with schema metadata and the data you need
$idColumn = new DataColumn(
  DataField::createFromType('id', 'integer'), // NOTE: this is a little bit different to C# due to the type system of PHP
  [ 1, 2 ]
);

$cityColumn = new DataColumn(
  DataField::createFromType('city', 'string'),
  [ "London", "Derby" ]
);

// create file schema
$schema = new Schema([$idColumn->getField(), $cityColumn->getField()]);

// create file handle with w+ flag, to create a new file - if it doesn't exist yet - or truncate, if it exists
$fileStream = fopen(__DIR__.'/test.parquet', 'w+');

$parquetWriter = new ParquetWriter($schema, $fileStream);

// create a new row group in the file
$groupWriter = $parquetWriter->CreateRowGroup();

$groupWriter->WriteColumn($idColumn);
$groupWriter->WriteColumn($cityColumn);

// As we have no 'using' in PHP, I implemented finish() methods
// for ParquetWriter and ParquetRowGroupWriter

$groupWriter->finish();   // finish inner writer(s)
$parquetWriter->finish(); // finish the parquet writer last
```

Performance
-----------

[](#performance)

This package also provides the same benchmark as parquet-dotnet. These are the results on **my machine**:

Parquet.Net (.NET Core 2.1)php-parquet (bare metal 7.3)php-parquet (dockerized\* 7.3)Fastparquet (python)parquet-mr (Java)Read255ms1'090ms1'244ms154ms\*\**untested*Write (uncompressed)209ms1'272ms1'392ms237ms\*\**untested*Write (gzip)1'945ms3'314ms3'695ms1'737ms\*\**untested*\* Dockerized on a Windows 10 machine with bind-mounts, which slow down most of those high-IOPS processes.
\*\* It seems fastparquet or Python does some internal caching - the original results on first file opening are way worse (~ 2'700ms) In general, these tests were performed with gzip compression level 6 for php-parquet. It will roughly halve with 1 (minimum compression) and almost double at 9 (maximum compression). Note, the latter might not yield the smallest file size, but always the longest compression time.

Coding Style
------------

[](#coding-style)

As this is a partial port of a package from a completely different programming language, the programming style is pretty much a pure mess. I decided to keep most of the casing (e.g. $writer-&gt;CreateRowGroup() instead of -&gt;createRowGroup()) to keep a certain 'visual compatibility' to parquet-dotnet. At least, this is a desirable state from my perspective, as it makes comparing and extending much easier during initial development stages.

Acknowledgements
----------------

[](#acknowledgements)

Some code parts and concepts have been ported from C#/.NET, see:

-
-

License
-------

[](#license)

php-parquet is licensed under the MIT license. See file LICENSE.

Contributing
------------

[](#contributing)

You might do a PR, if you want. Info on how to contribute is coming soon.

###  Health Score

34

—

LowBetter than 77% of packages

Maintenance19

Infrequent updates — may be unmaintained

Popularity35

Limited adoption so far

Community8

Small or concentrated contributor base

Maturity57

Maturing project, gaining track record

 Bus Factor1

Top contributor holds 100% of commits — single point of failure

How is this calculated?**Maintenance (25%)** — Last commit recency, latest release date, and issue-to-star ratio. Uses a 2-year decay window.

**Popularity (30%)** — Total and monthly downloads, GitHub stars, and forks. Logarithmic scaling prevents top-heavy scores.

**Community (15%)** — Contributors, dependents, forks, watchers, and maintainers. Measures real ecosystem engagement.

**Maturity (30%)** — Project age, version count, PHP version support, and release stability.

###  Release Activity

Cadence

Every ~23 days

Recently: every ~84 days

Total

17

Last Release

1650d ago

PHP version history (3 changes)v0.1.0PHP &gt;=7.3 &lt;=7.4

v0.1.2PHP &gt;=7.3 &lt;=7.4.99

v0.4.0PHP &gt;=7.3 &lt;=8.0.99

### Community

Maintainers

![](https://avatars.githubusercontent.com/u/33830725?v=4)[Kevin](/maintainers/KevinVonJocoon)[@KevinVonJocoon](https://github.com/KevinVonJocoon)

---

Top Contributors

[![KevinVonJocoon](https://avatars.githubusercontent.com/u/33830725?v=4)](https://github.com/KevinVonJocoon "KevinVonJocoon (94 commits)")

---

Tags

apache-parquetparquetphpphp-library

###  Code Quality

TestsPHPUnit

### Embed Badge

![Health badge](/badges/jocoon-parquet/health.svg)

```
[![Health](https://phpackages.com/badges/jocoon-parquet/health.svg)](https://phpackages.com/packages/jocoon-parquet)
```

###  Alternatives

[codename/parquet

Thrift-based PHP implementation for using the Apache Parquet format

86645.0k3](/packages/codename-parquet)[leth/ip-address

IPv4 and IPv6 address and subnet classes with awesome utility functions.

62716.6k3](/packages/leth-ip-address)

PHPackages © 2026

[Directory](/)[Categories](/categories)[Trending](/trending)[Changelog](/changelog)[Analyze](/analyze)
