PHPackages                             jpuck/etl - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [Utility &amp; Helpers](/categories/utility)
4. /
5. jpuck/etl

ActiveLibrary[Utility &amp; Helpers](/categories/utility)

jpuck/etl
=========

Tools for extract, transform, and load processes.

0.7.0(9y ago)041[12 issues](https://github.com/jpuck/etl/issues)GPL-3.0+PHPPHP ^7.0

Since Sep 26Pushed 9y ago1 watchersCompare

[ Source](https://github.com/jpuck/etl)[ Packagist](https://packagist.org/packages/jpuck/etl)[ RSS](/packages/jpuck-etl/feed)WikiDiscussions master Synced today

READMEChangelog (5)Dependencies (3)Versions (12)Used By (0)

PHP tools for ETL
=================

[](#php-tools-for-etl)

This is a collection of PHP 7 classes useful for [extracting, transforming, and loading](https://en.wikipedia.org/w/index.php?title=Extract,_transform,_load&oldid=738013120) data between sources.

Hierarchical XML and JSON can be automatically converted to relational SQL. Support includes extracting data documents from a file system or REST API, and then loading the data into a DBMS like Microsoft SQL Server.

Values are surveyed for datatypes, numeric cardinality, and unique natural key candidates. Then this information is used to create a normalized multi-table database structure suited to insert the data.

BranchTestsCode Coverage[master](https://github.com/jpuck/etl/tree/master)[![Build Status](https://camo.githubusercontent.com/a24373b69e0d1ed6051a4bfcbcd7e28adda5835c738b37ff873ab6d562fa1cd4/68747470733a2f2f7472617669732d63692e6f72672f6a7075636b2f65746c2e7376673f6272616e63683d6d6173746572)](https://travis-ci.org/jpuck/etl)[![Codecov](https://camo.githubusercontent.com/f541bc4952d7e573f6b0e95f590a4d3037a268be4df05be93ad48340d7bcccbe/68747470733a2f2f696d672e736869656c64732e696f2f636f6465636f762f632f6769746875622f6a7075636b2f65746c2f6d61737465722e737667)](https://codecov.io/gh/jpuck/etl/branch/master)[dev](https://github.com/jpuck/etl/tree/dev)[![Build Status](https://camo.githubusercontent.com/7e8bda93e6655769087c9819546606ec2c42386e1e114f6c4aa100e80dbd24d3/68747470733a2f2f7472617669732d63692e6f72672f6a7075636b2f65746c2e7376673f6272616e63683d646576)](https://travis-ci.org/jpuck/etl)[![Codecov](https://camo.githubusercontent.com/9b86c18b0244bb3ae1dab7382573f5adeb6f4bf52003816edd2b77572aef79a8/68747470733a2f2f696d672e736869656c64732e696f2f636f6465636f762f632f6769746875622f6a7075636b2f65746c2f6465762e737667)](https://codecov.io/gh/jpuck/etl/branch/dev)Requirements
------------

[](#requirements)

PHP &gt;= 7.0

Installation
------------

[](#installation)

This library is registered on [packagist](https://packagist.org/packages/jpuck/etl) and can be easily installed into your project using [composer](https://getcomposer.org/).

```
composer require jpuck/etl

```

---

Getting Started
===============

[](#getting-started)

There are 3 basic groups of interrelated classes: *Sources* provide *Data* which have *Schemata*.

1. Sources

    Sources extend the abstract `Source` class and transport `Datum` objects. In particular, the abstract `DB` class has concrete class implementations such as `MicrosoftSQLServer`.
2. Data

    Data classes extend `Datum` and must implement a valid parser, satisfied by `ParseValidator`. It uses the `Schematizer` to construct the object from raw data, which can be overridden by passing an existing `Schema`.
3. Schemata

    A `Schema` is a concrete class with a `Validator` to enforce structure. The `Merger` class is for combining Schemas to create super-set Schemas. The `DDL` trait is used by the `DB` class to generate [SQL Data Definition Language](https://en.wikipedia.org/wiki/Data_definition_language) which contains abstract methods to be implemented by a specific database management system.

Schematizer
-----------

[](#schematizer)

The `Schematizer` class is for surveying the structure of the data. It includes node names, the [count of distinct element groupings](http://stackoverflow.com/q/39260573/4233593), numeric cardinality for relationships between subnodes, and descriptive statistics about the values including uniqueness. Categorically, it recognizes datetime, integer, and decimal datatypes. Decimals will include scale and precision measurements suitable for SQL.

`Schematizer::getPrecision` returns the scale and precision of numeric values suitable for the SQL `DECIMAL(scale,precision)` datatype. This function has notable behavior in that trailing zeros are discarded when passed as raw PHP float types. However, when passed as a string, then the trailing zeros are preserved in the precision. See `SchematizerUtilitiesTest::precisionDataProvider` for examples. Note that in the `XML` class, parsed values are represented as strings in PHP, so *trailing zeros should be represented in the precision values*.

```
node name
├── count
│   ├── max
│   │   ├── measure
│   │   └── value
│   └── min
│       ├── measure
│       └── value
├── unique (all values)
├── primaryKey
├── varchar          ────┐
│   ├── max              │
│   │   ├── measure      │
│   │   └── value        │
│   └── min              │
│       ├── measure      │
│       └── value        │
├── datetime             ├ datatypes
│   ├── max              │
│   │   └── value        │
│   └── min              │
│       └── value        │
├── int/decimal          │
│   ├── max              │
│   │   └── value        │
│   └── min              │
│       └── value    ────┘
├── scale            ────┐
│   ├── max              │
│   │   ├── measure      │
│   │   └── value        │
│   └── min              │
│       ├── measure      │
│       └── value        │
├── precision            ├ if decimal
│   ├── max              │
│   │   ├── measure      │
│   │   └── value        │
│   └── min              │
│       ├── measure      │
│       └── value    ────┘
├── children
│   ├── distinct (count of children)
│   └── count
│       ├── max
│       │   └── measure
│       └── min
│           └── measure
├── attributes
│   └── ... (excluding count, which must be 1)
└── elements
    └── ...

```

Database Connections
--------------------

[](#database-connections)

The `DB` class requires an instance of [`PDO`](http://php.net/manual/en/book.pdo.php) in the constructor to connect, but it is possible to pass a `null` value if only utilizing the class for DDL.

SQL Data Definition Language
----------------------------

[](#sql-data-definition-language)

When one-to-many XML nodes are used to represent one-to-one relationships, then the `Schematizer` recognizes this and a `DDL` class flattens them as columns on a table. If a node has more than one of its name or grandchildren, then the one-to-many relationship is preserved in a separate normalized table. Surrogate keys are created to maintain the Primary/Foreign Key referential integrity.

If the `Schema` has a `primaryKey` set, then that field will be used for DDL generation instead of the surrogate. However, this `Schema` must also be passed to the `Datum` constructor prior to being used with `DB::insert`, otherwise the surrogate keys will be used by default and will result in a failed insertion if the surrogate columns don't exist.

Saving Schemas
--------------

[](#saving-schemas)

Generating Schemas can take a long time and may require customization, such as adding `primaryKey` flags or removing unwanted fields to be ignored. Here are some examples for exporting and importing:

```
$xml = file_get_contents("sample.xml");
$xml = new XML($xml);
$schema = $xml->schema();

// normal JSON_PRETTY_PRINT
echo $schema;
```

By simply echoing the object, output can be redirected to a file from console:

```
php script.php > myschema.json

```

Use [`file_put_contents`](http://php.net/manual/en/function.file-put-contents.php) to write to disk. `Schema::toJSON` accepts all the [`json_encode`](http://php.net/manual/en/function.json-encode.php) options.

```
$string = $schema->toJSON(JSON_UNESCAPED_UNICODE);
file_put_contents('myschema.json', $string);

// native php array
$array = var_export($schema->toArray(), true);
$array = "
