PHPackages                             jbzoo/csv-blueprint - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [PDF &amp; Document Generation](/categories/documents)
4. /
5. jbzoo/csv-blueprint

ActiveProject[PDF &amp; Document Generation](/categories/documents)

jbzoo/csv-blueprint
===================

CLI Utility for Validating and Generating CSV files based on custom rules. It ensures your data meets specified criteria, streamlining data management and integrity checks.

1.1.0(7mo ago)573.2k↓33.3%[2 PRs](https://github.com/JBZoo/Csv-Blueprint/pulls)MITPHPPHP ^8.4CI passing

Since Mar 11Pushed 7mo ago1 watchersCompare

[ Source](https://github.com/JBZoo/Csv-Blueprint)[ Packagist](https://packagist.org/packages/jbzoo/csv-blueprint)[ RSS](/packages/jbzoo-csv-blueprint/feed)WikiDiscussions master Synced 1mo ago

READMEChangelog (10)Dependencies (17)Versions (58)Used By (0)

CSV Blueprint
=============

[](#csv-blueprint)

[![CI](https://github.com/JBZoo/Csv-Blueprint/actions/workflows/main.yml/badge.svg?branch=master)](https://github.com/JBZoo/Csv-Blueprint/actions/workflows/main.yml?query=branch%3Amaster)[![CI](https://github.com/JBZoo/Csv-Blueprint/actions/workflows/demo.yml/badge.svg)](https://github.com/JBZoo/Csv-Blueprint/actions/workflows/demo.yml)[![Psalm Coverage](https://camo.githubusercontent.com/260bc6cdc2f98628eb0f0c7b8d019a8f6de17c1f88c2368d1f305107a19cc743/68747470733a2f2f73686570686572642e6465762f6769746875622f4a425a6f6f2f4373762d426c75657072696e742f636f7665726167652e737667)](https://shepherd.dev/github/JBZoo/Csv-Blueprint)[![Coverage](https://camo.githubusercontent.com/c2021d6bfd32f6a040a1ae3f8a695a3137f99b04cdc7a1b0d2afb56fc234ac99/68747470733a2f2f736f6e6172636c6f75642e696f2f6170692f70726f6a6563745f6261646765732f6d6561737572653f70726f6a6563743d4a425a6f6f5f4373762d426c75657072696e74266d65747269633d636f766572616765)](https://sonarcloud.io/code?id=JBZoo_Csv-Blueprint&selected=JBZoo_Csv-Blueprint%3Asrc)[![Bugs](https://camo.githubusercontent.com/90144a66315c00bec3de926ba326b3694227d0d65481482bebe434625e0c5a4c/68747470733a2f2f736f6e6172636c6f75642e696f2f6170692f70726f6a6563745f6261646765732f6d6561737572653f70726f6a6563743d4a425a6f6f5f4373762d426c75657072696e74266d65747269633d62756773)](https://sonarcloud.io/project/issues?resolved=false&id=JBZoo_Csv-Blueprint)[![Code smells](https://camo.githubusercontent.com/7010825c58e4fa645ec36753d2378a8cd9c47dfff027754279ca469af00c1c59/68747470733a2f2f736f6e6172636c6f75642e696f2f6170692f70726f6a6563745f6261646765732f6d6561737572653f70726f6a6563743d4a425a6f6f5f4373762d426c75657072696e74266d65747269633d636f64655f736d656c6c73)](https://sonarcloud.io/project/issues?resolved=false&id=JBZoo_Csv-Blueprint)[![Docker Pulls](https://camo.githubusercontent.com/e06be934e69c55ab9ca4a2935a391a22a8cae4cc613192f205eb6490a3d4e8bc/68747470733a2f2f696d672e736869656c64732e696f2f646f636b65722f70756c6c732f6a627a6f6f2f6373762d626c75657072696e742e737667)](https://hub.docker.com/r/jbzoo/csv-blueprint/tags)

[![Static Badge](https://camo.githubusercontent.com/12b7c044f0be19c8242c1b93931ebec0ce965c1fee09769a0314ace25e920a90/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f52756c65732d3334312d677265656e3f6c6162656c3d546f74616c2532306e756d6265722532306f6625323072756c6573266c6162656c436f6c6f723d6461726b677265656e26636f6c6f723d67726179)](schema-examples/full.yml)[![Static Badge](https://camo.githubusercontent.com/c5a02b92ec46034ca765131084dc9fb7c92dc132322a18574483028d816ae0da/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f52756c65732d3132372d677265656e3f6c6162656c3d43656c6c25323072756c6573266c6162656c436f6c6f723d626c756526636f6c6f723d67726179)](src/Rules/Cell)[![Static Badge](https://camo.githubusercontent.com/0b8fa3ad4b720d686e184279715d813abae1ebf1dd0bf6ecf55e5aa594fed6f1/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f52756c65732d3230362d677265656e3f6c6162656c3d41676772656761746525323072756c6573266c6162656c436f6c6f723d626c756526636f6c6f723d67726179)](src/Rules/Aggregate)[![Static Badge](https://camo.githubusercontent.com/56954bd9c74c90c646f46091a6a93cf186ae5e6866daf47c18735f2b574cafbc/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f52756c65732d382d677265656e3f6c6162656c3d4578747261253230636865636b73266c6162656c436f6c6f723d626c756526636f6c6f723d67726179)](#extra-checks)[![Static Badge](https://camo.githubusercontent.com/dcf3b4db27d1a2200cfb9e67e0d68d49b45729ca4de0dd67c222d927fa427f46/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f52756c65732d33312f31312f32302d677265656e3f6c6162656c3d506c616e253230746f253230616464266c6162656c436f6c6f723d6772617926636f6c6f723d67726179)](tests/schemas/todo.yml)

Strict and automated line-by-line CSV validation tool based on customizable Yaml schemas.

In seconds, make sure every char in a gigabyte file meets your expectations.

[![Intro](.github/assets/intro.png)](.github/assets/intro.png)

I believe it is the simplest yet flexible and powerful CSV validator in the world. ☺️

Features
--------

[](#features)

- Just create a simple and [friendly Yaml](#schema-definition) with your CSV schema and the tool will validate your files line by line. You will get a very [detailed report](#report-examples) with row, column and rule accuracy.
- Out of the box, you have access to [over 330 validation rules](schema-examples/full.yml) that can be combined to control the severity of validation.
- You can validate each value (like, date has a strict format on each line), or the entire column (like, median of all values is within limits). It's up to you to choose the severity of the rules.
- Use it anywhere as it is packaged in [Docker](#usage) or even as part of your [GitHub Actions](#github-action-format).
- Create a CSV in your pipelines/ETL/CI and ensure that it meets the most stringent expectations.
- Prepare your own libraries with complex rules using [presets](#presets-and-reusable-schemas). This will help you work with hundreds of different files at the same time.
- [Create schema on the fly](#complete-cli-help-message) based on an existing CSV file and also analyze data in CSV - find out what is stored in your file and get a summary report.

 CLICK to see a typical cross-team workflow1. **Data Preparation:** Team A generates CSV data adhering to a predefined format and places the file in a shared location accessible to Team B (e.g., a shared repository or cloud storage).
2. **Notification:** Team A notifies Team B that the data is ready via corporate communication channels (email, chat, task management system).
3. **Validation:** Team B uses predefined validation rules stored in the repository to check the CSV file for accuracy and integrity before importing. This includes verifying date formats, numerical values, and the presence of required columns.
4. **Data Import:** After successful validation, Team B imports the data from the CSV file into their system for further processing.
5. **Error Handling:** If validation identifies errors, the process halts, and Team B provides feedback to Team A for data correction.

**Why Validation is Necessary:**

- **Reduce Errors:** Validating data before import minimizes the likelihood of errors, enhancing data quality.
- **Efficiency:** Prevents time loss on manual error correction post-import.
- **Data Consistency:** Ensures data meets the expectations and requirements of Team B, facilitating accurate processing and analysis.
- **Automation:** Storing validation rules in the repository eases the process of checking automation and simplifies updating validation criteria.

### Live demo

[](#live-demo)

As a live demonstration of how the tool works, you can explore the super minimal repository at [demo](https://github.com/jbzoo/csv-blueprint-demo). For more complex examples and various reporting methods, take a look at the [demo pipeline](https://github.com/JBZoo/CSV-Blueprint/actions/runs/8667852752/job/23771733937) with different reports types.

See also:

- [PR as a live demo](https://github.com/jbzoo/csv-blueprint-demo/pull/1/files) - Note the automatic comments in Diff at PR's.
- [.github/workflows/demo.yml](.github/workflows/demo.yml)
- [demo\_invalid.yml](tests/schemas/demo_invalid.yml)
- [demo\_valid.yml](tests/schemas/demo_valid.yml)
- [demo.csv](tests/fixtures/demo.csv)

Table of content
----------------

[](#table-of-content)

- [Features](#features)
- [Table of content](#table-of-content)
- [Usage](#usage)
- [Schema definition](#schema-definition)
- [Presets and reusable schemas](#presets-and-reusable-schemas)
- [Parallel processing](#parallel-processing)
- [Complete CLI help message](#complete-cli-help-message)
- [Report examples](#report-examples)
- [Benchmarks](#benchmarks)
- [Disadvantages?](#disadvantages)
- [Coming soon](#coming-soon)
- [Contributing](#contributing)
- [License](#license)
- [See also](#see-also)

Usage
-----

[](#usage)

### Docker container

[](#docker-container)

Ensure you have Docker installed on your machine.

```
# Pull the Docker image
docker pull jbzoo/csv-blueprint:latest

# Run the tool inside Docker
docker run --rm                                  \
    --workdir=/parent-host                       \
    -v $(pwd):/parent-host                       \
    jbzoo/csv-blueprint:latest                   \
    validate-csv                                 \ # See available commands and options below.
    --csv=./tests/fixtures/demo.csv              \ # Your CSV(s).
    --schema=./tests/schemas/demo_invalid.yml    \ # Your schema(s).
    --ansi

# OR build it from source.
git clone git@github.com:jbzoo/csv-blueprint.git csv-blueprint
cd csv-blueprint
make docker-build  # local tag is "jbzoo/csv-blueprint:local"
```

### GitHub Action

[](#github-action)

You can find launch examples in the [workflow demo](https://github.com/JBZoo/Csv-Blueprint-Demo/blob/master/.github/workflows/demo.yml#L18-L22).

```
- uses: jbzoo/csv-blueprint@master # See the specific version on releases page. `@master` is latest.
  with:
    # Specify the path(s) to the CSV files you want to validate.
    #   This can include a direct path to a file or a directory to search with a maximum depth of 10 levels.
    #   Examples: p/file.csv; p/*.csv; p/**/*.csv; p/**/name-*.csv; **/*.csv
    csv: './tests/**/*.csv'

    # Specify the path(s) to the schema file(s), supporting YAML, JSON, or PHP formats.
    #   Similar to CSV paths, you can direct to specific files or search directories with glob patterns.
    #   Examples: p/file.yml; p/*.yml; p/**/*.yml; p/**/name-*.yml; **/*.yml
    schema: './tests/**/*.yml'

    # Report format. Available options: text, table, github, gitlab, teamcity, junit.
    # Default value: 'table'
    report: 'table'

    # Apply all schemas (also without `filename_pattern`) to all CSV files found as global rules.
    #   Available options:
    #   auto: If no glob pattern (*) is used for --schema, the schema is applied to all found CSV files.
    #   yes: Apply all schemas to all CSV files, Schemas without `filename_pattern` are applied as a global rule.
    #   no: Apply only schemas with not empty `filename_pattern` and match the CSV files.
    # Default value: 'auto'
    apply-all: 'auto'

    # Quick mode. It will not validate all rows. It will stop after the first error.
    # Default value: 'no'
    quick: 'no'

    # Skip schema validation. If you are sure that the schema is correct, you can skip this check.
    # Default value: 'no'
    skip-schema: 'no'

    # Extra options for the CSV Blueprint. Only for debugging and profiling.
    # Available options:
    #   Add flag `--parallel` if you want to validate CSV files in parallel.
    #   Add flag `--dump-schema` if you want to see the final schema after all includes and inheritance.
    #   Add flag `--debug` if you want to see more really deep details.
    #   Add flag `--profile` if you want to see profiling info. Add details with `-vvv`.
    #   Verbosity level: Available options: `-v`, `-vv`, `-vvv`
    #   ANSI output. You can disable ANSI colors if you want with `--no-ansi`.
    # Default value: 'options: --ansi'
    # You can skip it.
    extra: 'options: --ansi'
```

### Phar binary

[](#phar-binary)

 CLICK to see using PHAR fileEnsure you have PHP installed on your machine.

```
# Just download the latest version
wget https://github.com/jbzoo/csv-blueprint/releases/latest/download/csv-blueprint.phar
chmod +x ./csv-blueprint.phar
./csv-blueprint.phar validate-csv               \
   --csv=./tests/fixtures/demo.csv              \
   --schema=./tests/schemas/demo_invalid.yml

# OR create project via Composer (--no-dev is optional)
composer create-project --no-dev jbzoo/csv-blueprint
cd ./csv-blueprint
./csv-blueprint validate-csv                    \
    --csv=./tests/fixtures/demo.csv             \
    --schema=./tests/schemas/demo_invalid.yml

# OR build from source
git clone git@github.com:jbzoo/csv-blueprint.git csv-blueprint
cd csv-blueprint
make build
./csv-blueprint validate-csv                    \
    --csv=./tests/fixtures/demo.csv             \
    --schema=./tests/schemas/demo_invalid.yml
```

Schema definition
-----------------

[](#schema-definition)

Define your CSV validation schema in YAML for clear and structured configuration. Alternative formats are also supported: [JSON](schema-examples/full.json) and [PHP](schema-examples/full.php), accommodating various preferences and workflow requirements.

The provided example illustrates a schema for a CSV file with a header row. It mandates that the `id`column must not be empty and should only contain integer values. Additionally, the `name` column is required to have a minimum length of 3 characters, ensuring basic data integrity and usefulness.

### Example schema in YAML

[](#example-schema-in-yaml)

```
name: Simple CSV Schema
filename_pattern: /my-favorite-csv-\d+\.csv$/i
csv:
  delimiter: ';'

columns:
  - name: id
    rules:
      not_empty: true
      is_int: true
    aggregate_rules:
      is_unique: true
      sorted: [ asc, numeric ]

  - name: name
    rules:
      length_min: 3
    aggregate_rules:
      count: 10
```

### Full schema description

[](#full-schema-description)

In the [example YAML file](schema-examples/full.yml), a detailed description of all features is provided. This documentation is verified through automated tests, ensuring it remains current.

**Notes:**

- The traditional typing of columns (e.g., `type: integer`) has been intentionally omitted in favor of rules. These rules can be sequenced and combined freely, offering extensive flexibility for CSV file validation.
- All options are optional unless stated otherwise. You have the liberty to include or omit them as you see fit.
- Specifying an incorrect rule name, using non-existent values (not listed below), or assigning an incompatible variable type for any option will result in a schema validation error. To bypass these errors, you may opt to use the `--skip-schema` flag at your discretion, allowing the use of your custom keys in the schema.
- All rules ignore the empty string except `not_empty`. It doesn't ignore empty strings (length 0). To enforce a non-empty value, apply `not_empty: true`. Note that a single space counts as a character, making the string length `1`. To prevent such scenarios, include `is_trimmed: true`.
- Rules operate independently; they have no knowledge of or influence over one another.
- When a rule's value is `is_some_rule: true`, it merely serves as an activation toggle. Other values represent rule parameters.
- The sequence of rule execution follows their order in the schema, affecting only the order of error messages in the report.
- Unless explicitly stated, most rules are case-sensitive.
- As a fallback, the `regex` rule is available. However, using clear rule combinations is recommended for greater clarity on validation errors.

Below is a comprehensive list of rules, each accompanied by a brief explanation and example for clarity. This section is also validated through automated tests, ensuring the information is consistently accurate.

 CLICK to see details about each rule```
# It's a complete example of the CSV schema file in YAML format.
# See copy of the file without comments here ./schema-examples/full_clean.yml

# Just meta
name: CSV Blueprint Schema Example      # Name of a CSV file. Not used in the validation process.
description: |-                         # Any description of the CSV file. Not used in the validation process.
  This YAML file provides a detailed description and validation rules for CSV files
  to be processed by CSV Blueprint tool. It includes specifications for file name patterns,
  CSV formatting options, and extensive validation criteria for individual columns and their values,
  supporting a wide range of data validation rules from basic type checks to complex regex validations.
  This example serves as a comprehensive guide for creating robust CSV file validations.

# Include another schema and define an alias for it.
presets:
  my-preset: ./preset_users.yml         # Define preset alias "my-preset". See README.md for details.

# Regular expression to match the file name. If not set, then no pattern check.
# This allows you to pre-validate the file name before processing its contents.
# Feel free to check parent directories as well.
# See: https://www.php.net/manual/en/reference.pcre.pattern.syntax.php
filename_pattern: /\.csv$/i
#  preset: my-preset                    # See README.md for details.

# Here are default values to parse CSV file.
# You can skip this section if you don't need to override the default values.
csv:
  preset: my-preset                     # See README.md for details.
  header: true                          # If the first row is a header. If true, name of each column is required.
  delimiter: ,                          # Delimiter character in CSV file.
  quote_char: \                         # Quote character in CSV file.
  enclosure: '"'                        # Enclosure for each field in CSV file.
  encoding: utf-8                       # (Experimental) Only utf-8, utf-16, utf-32.
  bom: false                            # (Experimental) If the file has a BOM (Byte Order Mark) at the beginning.

# Structural rules for the CSV file. These rules are applied to the entire CSV file.
# They are not(!) related to the data in the columns.
# You can skip this section if you don't need to override the default values.
structural_rules: # Here are default values.
  preset: my-preset                     # See README.md for details.
  strict_column_order: true             # Ensure columns in CSV follow the same order as defined in this YML schema. It works only if "csv.header" is true.
  allow_extra_columns: false            # Allow CSV files to have more columns than specified in this YML schema.

# Add any extra data you want. It will be ignored by the tool but available for your own code.
# You can use any format and store anything. Examples:
# extra: 'some text'
# extra: [some, options, here]
# extra: 42
extra:
  - key: "value"

# Description of each column in CSV.
# It is recommended to present each column in the same order as presented in the CSV file.
# This will not affect the validator, but will make it easier for you to navigate.
# For convenience, use the first line as a header (if possible).
columns:
  - preset: my-preset/login             # Add preset rules for the column. See README.md for details.
    name: Column Name (header)          # Any custom name of the column in the CSV file (first row). Required if "csv.header" is true.
    description: Lorem ipsum            # Description of the column. Not used in the validation process.
    example: Some example               # Example of the column value. Schema will also check this value on its own.

    # If the column is required. If true, the column must be present in the CSV file. If false, the column can be missing in the CSV file.
    # So, if you want to make the column optional, set this value to false, and it will validate the column only if it is present.
    # By default, the column is required. It works only if "csv.header" is true and "structural_rules.allow_extra_columns" is false.
    required: true

    # Add any extra data you want. It will be ignored by the tool but available for your own code.
    # You can use any format and store anything. Examples:
    # extra: 'some text'
    # extra: [some, options, here]
    # extra: 42
    extra:
      - key: "value"

    ####################################################################################################################
    # Data validation for each(!) value in the column. Please, see notes in README.md
    # Every rule is optional.
    rules:
      preset: my-preset/login           # Add preset rules for the column. See README.md for details.

      # General rules
      not_empty: true                   # Value is not an empty string. Actually checks if the string length is not 0.
      exact_value: Some string          # Exact value for string in the column.
      allow_values: [ y, n, "" ]        # Strict set of values that are allowed.
      not_allow_values: [ invalid ]     # Strict set of values that are NOT allowed.

      # Any valid regex pattern. See: https://www.php.net/manual/en/reference.pcre.pattern.syntax.php
      # Of course it's a super powerful tool to verify any sort of string data.
      # Please, be careful. Regex is a powerful tool, but it can be very dangerous if used incorrectly.
      # Remember that if you want to solve a problem with regex, you now have two problems.
      # But have it your way, then happy debugging! https://regex101.com
      regex: /^[\d]{2}$/

      # Checks length of a string including spaces (multibyte safe).
      length_min: 1                     # x >= 1
      length_greater: 2                 # x >  2
      length_not: 0                     # x != 0
      length: 7                         # x == 7
      length_less: 8                    # x <  8
      length_max: 9                     # x = 1
      word_count_greater: 2             # x >  2
      word_count_not: 0                 # x != 0
      word_count: 7                     # x == 7
      word_count_less: 8                # x <  8
      word_count_max: 9                 # x = 1.0
      num_greater: 2.0                  # x >  2.0
      num_not: 5.0                      # x != 5.0
      num: 7.0                          # x == 7.0
      num_less: 8.0                     # x <  8.0
      num_max: 9.0                      # x = 1
      precision_greater: 2              # x >  2
      precision_not: 0                  # x != 0
      precision: 7                      # x == 7
      precision_less: 8                 # x <  8
      precision_max: 9                  # x = 1
      date_age_greater: 14              # x >  14
      date_age_not: 18                  # x != 18
      date_age: 21                      # x == 21
      date_age_less: 99                 # x <  99
      date_age_max: 100                 # x = 1
      password_strength_greater: 2      # x >  2
      password_strength_not: 0          # x != 0
      password_strength: 7              # x == 7
      password_strength_less: 8         # x <  8
      password_strength_max: 9          # x = 1
      count_zero_greater: 2             # x >  2
      count_zero_not: 0                 # x != 0
      count_zero: 7                     # x == 7
      count_zero_less: 8                # x <  8
      count_zero_max: 9                 # x = 1
      count_even_greater: 2             # x >  2
      count_even_not: 0                 # x != 0
      count_even: 7                     # x == 7
      count_even_less: 8                # x <  8
      count_even_max: 9                 # x = 1
      count_odd_greater: 2              # x >  2
      count_odd_not: 0                  # x != 0
      count_odd: 7                      # x == 7
      count_odd_less: 8                 # x <  8
      count_odd_max: 9                  # x = 1
      count_prime_greater: 2            # x >  2
      count_prime_not: 0                # x != 0
      count_prime: 7                    # x == 7
      count_prime_less: 8               # x <  8
      count_prime_max: 9                # x = 1.0
      median_greater: 2.0               # x >  2.0
      median_not: 5.0                   # x != 5.0
      median: 7.0                       # x == 7.0
      median_less: 8.0                  # x <  8.0
      median_max: 9.0                   # x = 1.0
      harmonic_mean_greater: 2.0        # x >  2.0
      harmonic_mean_not: 5.0            # x != 5.0
      harmonic_mean: 7.0                # x == 7.0
      harmonic_mean_less: 8.0           # x <  8.0
      harmonic_mean_max: 9.0            # x = 1.0
      geometric_mean_greater: 2.0       # x >  2.0
      geometric_mean_not: 5.0           # x != 5.0
      geometric_mean: 7.0               # x == 7.0
      geometric_mean_less: 8.0          # x <  8.0
      geometric_mean_max: 9.0           # x = 1.0
      contraharmonic_mean_greater: 2.0  # x >  2.0
      contraharmonic_mean_not: 5.0      # x != 5.0
      contraharmonic_mean: 7.0          # x == 7.0
      contraharmonic_mean_less: 8.0     # x <  8.0
      contraharmonic_mean_max: 9.0      # x = 1.0
      root_mean_square_greater: 2.0     # x >  2.0
      root_mean_square_not: 5.0         # x != 5.0
      root_mean_square: 7.0             # x == 7.0
      root_mean_square_less: 8.0        # x <  8.0
      root_mean_square_max: 9.0         # x = 1.0
      interquartile_mean_greater: 2.0   # x >  2.0
      interquartile_mean_not: 5.0       # x != 5.0
      interquartile_mean: 7.0           # x == 7.0
      interquartile_mean_less: 8.0      # x <  8.0
      interquartile_mean_max: 9.0       # x  "phone".
    name: phone
  - preset: users/password        # Overridden value to force a strong password.
    rules: { length_min: 10 }
  - name: admin_note              # New column specific only this schema.
    description: Admin note
    rules:
      not_empty: true
      length_min: 1
      length_max: 10
    aggregate_rules:              # In practice this will be a rare case, but the opportunity is there.
      preset: db/id               # Take only aggregate rules from the preset.
      is_unique: true             # Added new specific aggregate rule.
```

 CLICK to see what it looks like in memory.```
# Schema file is "./schema-examples/preset_usage.yml"
name: 'Schema uses presets and add new columns + specific rules.'
description: 'This schema uses presets. Also, it demonstrates how to override preset values.'
presets:
  users: ./schema-examples/preset_users.yml
  db: ./schema-examples/preset_database.yml
csv:
  delimiter: ;
  enclosure: '|'
columns:
  - name: id
    description: 'Unique identifier, usually used to denote a primary key in databases.'
    example: 12345
    extra:
      custom_key: 'custom value'
    rules:
      not_empty: true
      is_trimmed: true
      is_int: true
      num_min: 1
    aggregate_rules:
      is_unique: true
      sorted:
        - asc
        - numeric

  - name: status
    description: 'Status in database'
    example: active
    rules:
      not_empty: true
      allow_values:
        - active
        - inactive
        - pending
        - deleted

  - name: login
    description: "User's login name"
    example: johndoe
    rules:
      not_empty: true
      length_min: 3
      length_max: 20
      is_trimmed: true
      is_lowercase: true
      is_slug: true
      is_alnum: true
    aggregate_rules:
      is_unique: true

  - name: email
    description: "User's email address"
    example: user@example.com
    rules:
      not_empty: true
      is_trimmed: true
      is_lowercase: true
      is_email: true
    aggregate_rules:
      is_unique: true

  - name: full_name
    description: "User's full name"
    example: 'John Doe Smith'
    rules:
      not_empty: true
      is_trimmed: true
      is_capitalize: true
      word_count_min: 2
      word_count_max: 8
      contains: ' '
      charset: UTF-8
    aggregate_rules:
      is_unique: true

  - name: birthday
    description: "Validates the user's birthday."
    example: '1990-01-01'
    rules:
      not_empty: true
      is_trimmed: true
      is_date: true
      date_max: now
      date_format: Y-m-d
      date_age_greater: 0
      date_age_less: 150

  - name: phone
    description: "User's phone number in US"
    example: '+1 650 253 00 00'
    rules:
      not_empty: true
      is_trimmed: true
      starts_with: '+1'
      phone: US

  - name: password
    description: "User's password"
    example: 9RfzENKD
    rules:
      not_empty: true
      length_min: 10
      length_max: 20
      is_trimmed: true
      contains_none:
        - password
        - '123456'
        - qwerty
        - ' '
      password_strength_min: 7
      is_password_safe_chars: true
      charset: UTF-8

  - name: admin_note
    description: 'Admin note'
    rules:
      not_empty: true
      length_min: 1
      length_max: 10
    aggregate_rules:
      is_unique: true
      sorted:
        - asc
        - numeric
```

As a result, readability and maintainability became dramatically easier. You can easily add new rules, change existing, etc.

### Complete example with all available syntax

[](#complete-example-with-all-available-syntax)

 CLICK to see available syntax.```
name: Complite list of preset features
description: This schema contains all the features of the presets.

presets:
  # The basepath for the preset is `.` (current directory of the current schema file).
  # Define alias "db" for schema in `./preset_database.yml`.
  db: preset_database.yml           # Or `db: ./preset_database.yml`. It's up to you.

  # For example, you can use a relative path.
  users: ./../schema-examples/preset_users.yml

  # Or you can use an absolute path.
  # db: /full/path/preset_database.yml

filename_pattern: { preset: users } # Take the filename pattern from the preset.
structural_rules: { preset: users } # Take the global rules from the preset.
csv: { preset: users }              # Take the CSV settings from the preset.

columns:
  # Use name of column from the preset.
  # "db" is alias. "id" is column `name` in `preset_database.yml`.
  - preset: 'db/id'

  # Use column index. "db" is alias. "0" is column index in `preset_database.yml`.
  - preset: 'db/0'
  - preset: 'db/0:'

  # Use column index and column name. It useful if column name is not unique.
  - preset: 'db/0:id'

  # Use only `rules` of "status" column from the preset.
  - name: My column
    rules:
      preset: 'db/status'

  # Override only `aggregate_rules` from the preset.
  # Use only `aggregate_rules` of "id" column from the preset.
  # We strictly take only the very first column (index = 0).
  - name: My column
    aggregate_rules:
      preset: 'db/0:id'

  # Combo!!! If you're a risk-taker or have a high level of inner zen. :)
  # Creating a column from three other columns.
  # In fact, it will merge all three at once with key replacement.
  - name: Crazy combo!
    description: >                  # Just a great advice.
      I like to take risks, too.
      Be careful. Use your power wisely.
    example: ~                      # Ignore inherited "example" value. Set it `null`.
    preset: 'users/login'
    rules:
      preset: 'users/email'
      not_empty: true               # Disable the rule from the preset.
    aggregate_rules:
      preset: 'db/0'
```

**Note:** All provided YAML examples pass built-in validation, yet they may not make practical sense. These are intended solely for demonstration and to illustrate potential configurations and features.

Parallel processing
-------------------

[](#parallel-processing)

The `--parallel` option is available for speeding up the validation of CSV files by utilizing more CPU resources effectively.

### Key Points

[](#key-points)

- **Experimental Feature:** This feature is currently experimental and requires further debugging and testing. Although it performs well in synthetic autotests and benchmarks. More practical use cases are needed to validate its stability.
- **Use Case:** This option is beneficial if you are processing dozens of CSV files, with each file taking 1 second or more to process.
- **Default Behavior:** If you use `--parallel` without specifying a value, it defaults to using the maximum number of available CPU cores.
- **Thread Pool Size:** You can set a specific number of threads for the pool. For example, `--parallel=10` will set the thread pool size to 10. It doesn't make much sense to specify more than the number of logical cores in your CPU. Otherwise, it will only slow things down a bit due to the system overhead to handle multithreading.
- **Disabling Parallelism:** Using `--parallel=1` disables parallel processing, which is the default setting if the option is not specified.
- **Implementation:** The feature relies on the `ext-parallel` PHP extension, which enables the creation of lightweight threads rather than processes. This extension is already included in our Docker image. Ensure that you have the `ext-parallel` extension installed if you are not using our Docker image. This extension is crucial for the operation of the parallel processing feature. The application always runs in single-threaded mode if the extension is not installed.

Complete CLI help message
-------------------------

[](#complete-cli-help-message)

This section outlines all available options and commands provided by the tool, leveraging the JBZoo/Cli package for its CLI. The tool offers a comprehensive set of options to cater to various needs and scenarios, ensuring flexibility and efficiency in CSV file validation and manipulation.

For detailed information on each command and option, refer to the [JBZoo/Cli documentation](https://github.com/JBZoo/Cli). This resource provides insights into the functionality and application of the CLI commands, helping users make the most out of the tool's capabilities.

`./csv-blueprint validate-csv --help`

 CLICK to see validate-csv help message```
Description:
  Validate CSV file(s) by schema(s).

Usage:
  validate-csv [options]
  validate:csv

Options:
  -c, --csv=CSV                    Specify the path(s) to the CSV files you want to validate.
                                   This can include a direct path to a file or a directory to search with a maximum depth of 10 levels.
                                   Examples: p/file.csv; p/*.csv; p/**/*.csv; p/**/name-*.csv; **/*.csv
                                    (multiple values allowed)
  -s, --schema=SCHEMA              Specify the path(s) to the schema file(s), supporting YAML, JSON, or PHP formats.
                                   Similar to CSV paths, you can direct to specific files or search directories with glob patterns.
                                   Examples: p/file.yml; p/*.yml; p/**/*.yml; p/**/name-*.yml; **/*.yml
                                    (multiple values allowed)
  -S, --skip-schema[=SKIP-SCHEMA]  Skips schema validation for quicker checks when the schema's correctness is certain.
                                   Use any non-empty value or "yes" to activate
                                    [default: "no"]
  -a, --apply-all[=APPLY-ALL]      Apply all schemas (also without `filename_pattern`) to all CSV files found as global rules.
                                   Available options:
                                   - auto: If no glob pattern (*) is used for --schema, the schema is applied to all found CSV files.
                                   - yes|y|1: Apply all schemas to all CSV files, Schemas without `filename_pattern` are applied as a global rule.
                                   - no|n|0: Apply only schemas with not empty `filename_pattern` and match the CSV files.
                                   Note. If specify the option `--apply-all` without value, it will be treated as "yes".
                                    [default: "auto"]
  -Q, --quick[=QUICK]              Stops the validation process upon encountering the first error,
                                   accelerating the check but limiting error visibility.
                                   Returns a non-zero exit code if any error is detected.
                                   Enable by setting to any non-empty value or "yes".
                                    [default: "no"]
  -r, --report=REPORT              Determines the report's output format.
                                   Available options: text, table, github, gitlab, teamcity, junit
                                    [default: "table"]
      --dump-schema                Dumps the schema of the CSV file if you want to see the final schema after inheritance.
      --debug                      Intended solely for debugging and advanced profiling purposes.
                                   Activating this option provides detailed process insights,
                                   useful for troubleshooting and performance analysis.
      --parallel[=PARALLEL]        EXPERIMENTAL! Launches the process in parallel mode (if possible). Works only with ext-parallel.
                                   You can specify the number of threads.
                                   If you do not specify a value, the number of threads will be equal to the number of CPU cores.
                                   By default, the process is launched in a single-threaded mode. [default: "1"]
      --no-progress                Disable progress bar animation for logs. It will be used only for text output format.
      --mute-errors                Mute any sort of errors. So exit code will be always "0" (if it's possible).
                                   It has major priority then --non-zero-on-error. It's on your own risk!
      --stdout-only                For any errors messages application will use StdOut instead of StdErr. It's on your own risk!
      --non-zero-on-error          None-zero exit code on any StdErr message.
      --timestamp                  Show timestamp at the beginning of each message.It will be used only for text output format.
      --profile                    Display timing and memory usage information.
      --output-mode=OUTPUT-MODE    Output format. Available options:
                                   text - Default text output format, userfriendly and easy to read.
                                   cron - Shortcut for crontab. It's basically focused on human-readable logs output.
                                   It's combination of --timestamp --profile --stdout-only --no-progress -vv.
                                   logstash - Logstash output format, for integration with ELK stack.
                                    [default: "text"]
      --cron                       Alias for --output-mode=cron. Deprecated!
  -h, --help                       Display help for the given command. When no command is given display help for the list command
      --silent                     Do not output any message
  -q, --quiet                      Only errors are displayed. All other output is suppressed
  -V, --version                    Display this application version
      --ansi|--no-ansi             Force (or disable --no-ansi) ANSI output
  -n, --no-interaction             Do not ask any interactive question
  -v|vv|vvv, --verbose             Increase the verbosity of messages: 1 for normal output, 2 for more verbose output and 3 for debug
```

`./csv-blueprint validate-schema --help`

 CLICK to see validate-schema help message```
Description:
  Validate syntax in schema file(s).

Usage:
  validate-schema [options]

Options:
  -s, --schema=SCHEMA            Specify the path(s) to the schema file(s), supporting YAML, JSON, or PHP formats.
                                 Similar to CSV paths, you can direct to specific files or search directories with glob patterns.
                                 Examples: /full/path/name.yml; p/file.yml; p/*.yml; p/**/*.yml; p/**/name-*.yml; **/*.yml
                                  (multiple values allowed)
  -Q, --quick[=QUICK]            Stops the validation process upon encountering the first error,
                                 accelerating the check but limiting error visibility.
                                 Returns a non-zero exit code if any error is detected.
                                 Enable by setting to any non-empty value or "yes".
                                  [default: "no"]
  -r, --report=REPORT            Determines the report's output format.
                                 Available options: text, table, github, gitlab, teamcity, junit
                                  [default: "table"]
      --dump-schema              Dumps the schema of the CSV file if you want to see the final schema after inheritance.
      --debug                    Intended solely for debugging and advanced profiling purposes.
                                 Activating this option provides detailed process insights,
                                 useful for troubleshooting and performance analysis.
      --parallel[=PARALLEL]      EXPERIMENTAL! Launches the process in parallel mode (if possible). Works only with ext-parallel.
                                 You can specify the number of threads.
                                 If you do not specify a value, the number of threads will be equal to the number of CPU cores.
                                 By default, the process is launched in a single-threaded mode. [default: "1"]
      --no-progress              Disable progress bar animation for logs. It will be used only for text output format.
      --mute-errors              Mute any sort of errors. So exit code will be always "0" (if it's possible).
                                 It has major priority then --non-zero-on-error. It's on your own risk!
      --stdout-only              For any errors messages application will use StdOut instead of StdErr. It's on your own risk!
      --non-zero-on-error        None-zero exit code on any StdErr message.
      --timestamp                Show timestamp at the beginning of each message.It will be used only for text output format.
      --profile                  Display timing and memory usage information.
      --output-mode=OUTPUT-MODE  Output format. Available options:
                                 text - Default text output format, userfriendly and easy to read.
                                 cron - Shortcut for crontab. It's basically focused on human-readable logs output.
                                 It's combination of --timestamp --profile --stdout-only --no-progress -vv.
                                 logstash - Logstash output format, for integration with ELK stack.
                                  [default: "text"]
      --cron                     Alias for --output-mode=cron. Deprecated!
  -h, --help                     Display help for the given command. When no command is given display help for the list command
      --silent                   Do not output any message
  -q, --quiet                    Only errors are displayed. All other output is suppressed
  -V, --version                  Display this application version
      --ansi|--no-ansi           Force (or disable --no-ansi) ANSI output
  -n, --no-interaction           Do not ask any interactive question
  -v|vv|vvv, --verbose           Increase the verbosity of messages: 1 for normal output, 2 for more verbose output and 3 for debug
```

`./csv-blueprint dump-schema --help`

 CLICK to see debug-schema help message```
Description:
  Show the internal representation of the schema taking into account presets.

Usage:
  debug-schema [options]

Options:
  -s, --schema=SCHEMA            Specify the path to a schema file, supporting YAML, JSON, or PHP formats.
                                 Examples: /full/path/name.yml; p/file.yml
  -d, --hide-defaults            Hide default values in the output.
      --no-progress              Disable progress bar animation for logs. It will be used only for text output format.
      --mute-errors              Mute any sort of errors. So exit code will be always "0" (if it's possible).
                                 It has major priority then --non-zero-on-error. It's on your own risk!
      --stdout-only              For any errors messages application will use StdOut instead of StdErr. It's on your own risk!
      --non-zero-on-error        None-zero exit code on any StdErr message.
      --timestamp                Show timestamp at the beginning of each message.It will be used only for text output format.
      --profile                  Display timing and memory usage information.
      --output-mode=OUTPUT-MODE  Output format. Available options:
                                 text - Default text output format, userfriendly and easy to read.
                                 cron - Shortcut for crontab. It's basically focused on human-readable logs output.
                                 It's combination of --timestamp --profile --stdout-only --no-progress -vv.
                                 logstash - Logstash output format, for integration with ELK stack.
                                  [default: "text"]
      --cron                     Alias for --output-mode=cron. Deprecated!
  -h, --help                     Display help for the given command. When no command is given display help for the list command
      --silent                   Do not output any message
  -q, --quiet                    Only errors are displayed. All other output is suppressed
  -V, --version                  Display this application version
      --ansi|--no-ansi           Force (or disable --no-ansi) ANSI output
  -n, --no-interaction           Do not ask any interactive question
  -v|vv|vvv, --verbose           Increase the verbosity of messages: 1 for normal output, 2 for more verbose output and 3 for debug
```

`./csv-blueprint create-schema --help`It's beta. Work in progress.

 CLICK to see create-schema help message```
Description:
  Analyze CSV files and suggest a schema based on the data found.

Usage:
  create-schema [options]

Options:
  -c, --csv=CSV                      Specify the path(s) to the CSV files you want to analyze.
                                     This can include a direct path to a file or a directory to search with a maximum depth of 10 levels.
                                     Examples: p/file.csv; p/*.csv; p/**/*.csv; p/**/name-*.csv; **/*.csv
                                      (multiple values allowed)
  -H, --header[=HEADER]              Force the presence of a header row in the CSV files. [default: "yes"]
  -L, --lines[=LINES]                The number of lines to read when detecting parameters. Minimum is 1. [default: 10000]
  -C, --check-syntax[=CHECK-SYNTAX]  Check the syntax of the suggested schema. [default: "yes"]
  -r, --report=REPORT                Determines the report's output format.
                                     Available options: text, table, github, gitlab, teamcity, junit
                                      [default: "table"]
      --dump-schema                  Dumps the schema of the CSV file if you want to see the final schema after inheritance.
      --debug                        Intended solely for debugging and advanced profiling purposes.
                                     Activating this option provides detailed process insights,
                                     useful for troubleshooting and performance analysis.
      --parallel[=PARALLEL]          EXPERIMENTAL! Launches the process in parallel mode (if possible). Works only with ext-parallel.
                                     You can specify the number of threads.
                                     If you do not specify a value, the number of threads will be equal to the number of CPU cores.
                                     By default, the process is launched in a single-threaded mode. [default: "1"]
      --no-progress                  Disable progress bar animation for logs. It will be used only for text output format.
      --mute-errors                  Mute any sort of errors. So exit code will be always "0" (if it's possible).
                                     It has major priority then --non-zero-on-error. It's on your own risk!
      --stdout-only                  For any errors messages application will use StdOut instead of StdErr. It's on your own risk!
      --non-zero-on-error            None-zero exit code on any StdErr message.
      --timestamp                    Show timestamp at the beginning of each message.It will be used only for text output format.
      --profile                      Display timing and memory usage information.
      --output-mode=OUTPUT-MODE      Output format. Available options:
                                     text - Default text output format, userfriendly and easy to read.
                                     cron - Shortcut for crontab. It's basically focused on human-readable logs output.
                                     It's combination of --timestamp --profile --stdout-only --no-progress -vv.
                                     logstash - Logstash output format, for integration with ELK stack.
                                      [default: "text"]
      --cron                         Alias for --output-mode=cron. Deprecated!
  -h, --help                         Display help for the given command. When no command is given display help for the list command
      --silent                       Do not output any message
  -q, --quiet                        Only errors are displayed. All other output is suppressed
  -V, --version                      Display this application version
      --ansi|--no-ansi               Force (or disable --no-ansi) ANSI output
  -n, --no-interaction               Do not ask any interactive question
  -v|vv|vvv, --verbose               Increase the verbosity of messages: 1 for normal output, 2 for more verbose output and 3 for debug
```

Report examples
---------------

[](#report-examples)

The validation process culminates in a human-readable report detailing any errors identified within the CSV file. While the default report format is a table, the tool supports various output formats, including text, GitHub, GitLab, TeamCity, JUnit, among others, to best suit your project's needs and your personal or team preferences.

### GitHub Action format

[](#github-action-format)

To see user-friendly error outputs in your pull requests (PRs), specify `report: github`. This utilizes [annotations](https://docs.github.com/en/actions/using-workflows/workflow-commands-for-github-actions#setting-a-warning-message)to highlight bugs directly within the GitHub interface at the PR level. This feature allows errors to be displayed in the exact location within the CSV file, right in the diff of your Pull Requests. For a practical example, view [this live demo PR](https://github.com/jbzoo/csv-blueprint-demo/pull/1/files).

[![GitHub Actions - PR](.github/assets/github-actions-pr.png)](.github/assets/github-actions-pr.png)

 CLICK to see example in GitHub Actions terminal[![GitHub Actions - Terminal](.github/assets/github-actions-termintal.png)](.github/assets/github-actions-termintal.png)

### Text format

[](#text-format)

Optional format `text` with highlited keywords for qucik navigation.

[![Report - Text](.github/assets/output-text.png)](.github/assets/output-text.png)

### Table format

[](#table-format)

When using the `table` format (default), the output is organized in a clear, easily interpretable table that lists all discovered errors. This format is ideal for quick reviews and sharing with team members for further action.

[![Table format](.github/assets/output-table.png)](.github/assets/output-table.png)

Notes

- Report format for GitHub Actions is `table` by default.
- Tools uses [JBZoo/CI-Report-Converter](https://github.com/JBZoo/CI-Report-Converter) as SDK to convert reports to different formats. So you can easily integrate it with any CI system.

Benchmarks
----------

[](#benchmarks)

Understanding the performance of this tool is crucial, but it's important to note that its efficiency is influenced by several key factors:

- **File Size:** The dimensions of the CSV file, both in terms of rows and columns, directly impact processing time. Performance scales linearly with file size and is dependent on the capabilities of your hardware, such as CPU and SSD speed.
- **Number of Rules:** More validation rules per column mean more iterations for processing. Each rule operates independently, so the total time and memory consumption are cumulative across all rules.
- **Rule Intensity:** While most validation rules are optimized for speed and low memory usage, some, like `interquartile_mean`, can be significantly slower. For instance, `interquartile_mean` might process around 4,000 lines per second, whereas other rules can handle upwards of 50 million lines per second.

However, to gain a general understanding of performance, refer to the table below.

- All tests were conducted on a dataset comprising `2 million lines` plus an additional line for the header.
- These results are derived from the most current version, as verified by tests run using [GitHub Actions](https://github.com/jbzoo/csv-blueprint/actions/workflows/benchmark.yml) ([See workflow.yml](.github/workflows/benchmark.yml)). The link provides access to a variety of builds, which are essential for different testing scenarios and experiments. The most representative data can be found under `Docker (latest, XX)`.
- Developer mode was activated for these tests, using the flags `-vvv --debug --profile`.
- Testing environment included the latest Ubuntu + Docker. For more information about the GitHub Actions (GA) hardware used, please [see details about GA hardware](https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners/about-github-hosted-runners#standard-github-hosted-runners-for-private-repositories).
- The primary performance indicator is the processing speed, measured in lines per second. Note that speeds are presented in thousands of lines per second (`100K` equals `100,000 lines per second`).
- Peak RAM usage throughout the duration of each test case serves as an additional performance metric.

Profiles:

- **[Quickest:](tests/Benchmarks/bench_0_quickest_combo.yml)** Focuses on the fastest rules, either cell or aggregation, providing a baseline.
- **[Minimum:](tests/Benchmarks/bench_1_mini_combo.yml)** Uses a set of normal performance rules, with two instances of each.
- **[Realistic:](tests/Benchmarks/bench_2_realistic_combo.yml)** Represents a mix of rules likely encountered in typical use cases.
- **[All Aggregations:](tests/Benchmarks/bench_3_all_agg.yml)** Tests all aggregation rules simultaneously, illustrating maximum load.

Divisions:

- **Cell Rules:** Tests only individual cell validation rules.
- **Agg Rules:** Focuses solely on column-wide aggregation rules.
- **Cell + Agg:** Combines cell and aggregation rules for comprehensive validation.
- **Peak Memory:** Indicates the maximum RAM usage, particularly relevant in scenarios with aggregation rules.

**Note:** The `Peak Memory` metric is primarily of interest when aggregation rules are used, as non-aggregation scenarios typically require no more than 2-4 megabytes of memory, regardless of file size or rule count.

These benchmarks offer a snapshot of the tool's capabilities across a range of scenarios, helping you gauge its suitability for your specific CSV validation needs.

 **File / Profile**
 **Metric**
 **Quickest** **Minimum** **Realistic** **All aggregations** Columns: 1
Size: ~8 MB

 Cell rules
Agg rules
Cell + Agg
Peak Memory 786K, 2.5 sec
1187K, 1.7 sec
762K, 2.6 sec
52 MB 386K, 5.2 sec
1096K, 1.8 sec
373K, 5.4 sec
68 MB 189K, 10.6 sec
667K, 3.0 sec
167K, 12.0 sec
208 MB 184K, 10.9 sec
96K, 20.8 sec
63K, 31.7 sec
272 MB Columns: 5
Size: 64 MB

 Cell rules
Agg rules
Cell + Agg
Peak Memory 545K, 3.7 sec
714K, 2.8 sec
538K, 3.7 sec
52 MB 319K, 6.3 sec
675K, 3.0 sec
308K, 6.5 sec
68 MB 174K, 11.5 sec
486K, 4.1 sec
154K, 13.0 sec
208 MB 168K, 11.9 sec
96K, 20.8 sec
61K, 32.8 sec
272 MB Columns: 10
Size: 220 MB

 Cell rules
Agg rules
Cell + Agg
Peak Memory 311K, 6.4 sec
362K, 5.5 sec
307K, 6.5 sec
52 MB 221K, 9.0 sec
354K, 5.6 sec
215K, 9.3 sec
68 MB 137K, 14.6 sec
294K, 6.8 sec
125K, 16.0 sec
208 MB 135K, 14.8 sec
96K, 20.8 sec
56K, 35.7 sec
272 MB Columns: 20
Size: 1.2 GB

 Cell rules
Agg rules
Cell + Agg
Peak Memory 103K, 19.4 sec
108K, 18.5 sec
102K, 19.6 sec
52 MB 91K, 22.0 sec
107K, 18.7 sec
89K, 22.5 sec
68 MB 72K, 27.8 sec
101K, 19.8 sec
69K, 29.0 sec
208 MB 71K, 28.2 sec
96K, 20.8 sec
41K, 48.8 sec
272 MB**Additional Benchmark Insights:**

When running the same validation tests on different hardware configurations, the performance of the tool can vary significantly. Notably, testing on a **MacBook 14" M2 Max (2023)** yields results that are approximately twice as fast as those observed on the GitHub Actions hardware. This indicates the tool's exceptional performance on modern, high-spec devices.

Conversely, tests conducted on a **MacBook Pro (2019) with an Intel 2.4 GHz processor** align closely with the GitHub Actions results, suggesting that the benchmark table provided reflects an average performance level for typical engineering hardware.

### Brief conclusions

[](#brief-conclusions)

- **Cell Rules**: These rules are highly CPU-intensive but require minimal RAM, typically around 1-2 MB at peak. The more cell rules applied to a column, the longer the validation process takes due to the additional actions performed on each value.
- **Aggregation Rules**: These rules operate at incredible speeds, processing anywhere from 10 million to billions of rows per second. However, they are significantly more RAM-intensive. Interestingly, adding over 100 different aggregation rules does not substantially increase memory consumption.
- **PHP Array Functions**: Not all PHP array functions can operate by reference (`&$var`). Whether or not a dataset in a column can be manipulated in this way is highly dependent on the specific algorithm used. For example, a 20 MB dataset might be duplicated during processing, leading to a peak memory usage of 40 MB. Consequently, optimization techniques that rely on passing data by reference are often ineffective.
- **Practical Implications**: If processing a 1 GB file within 30-60 seconds is acceptable, and if there is 200-500 MB of RAM available, there may be little need to overly concern oneself with these performance considerations.
- **Memory Management**: Throughout testing, no memory leaks were observed.

### Examples of CSV Files

[](#examples-of-csv-files)

The CSV files utilized for benchmark testing are described below. These files were initially generated using [PHP Faker](tests/Benchmarks/Commands/CreateCsv.php) to create the first 2000 lines. Subsequently, they were replicated [1000 times within themselves](tests/Benchmarks/create-csv.sh), allowing for the creation of significantly large random files in a matter of seconds.

A key principle observed in these files is that as the number of columns increases, the length of the values within these columns also tends to increase, following a pattern akin to exponential growth.

 Columns: 1, Size: 8.48 MB```
id
1
2
```

 Columns: 5, Size: 64 MB```
id,bool_int,bool_str,number,float
1,0,false,289566,864360.14285714
2,1,true,366276,444761.71428571
```

 Columns: 10, Size: 220 MB```
id,bool_int,bool_str,number,float,date,datetime,domain,email,ip4
1,1,true,779914,1101964.2857143,2011-02-04,"2000-03-02 00:33:57",erdman.net,germaine.brakus@yahoo.com,32.51.181.238
2,0,true,405408,695839.42857143,1971-01-29,"1988-08-12 21:25:27",bode.com,tatyana.cremin@yahoo.com,76.79.155.73
```

 Columns: 20, Size: 1.2 GB```
id,bool_int,bool_str,number,float,date,datetime,domain,email,ip4,uuid,address,postcode,latitude,longitude,ip6,sentence_tiny,sentence_small,sentence_medium,sentence_huge
1,1,false,884798,1078489.5714286,2006-02-09,"2015-12-07 22:59:06",gerhold.com,alisa93@barrows.com,173.231.203.134,5a2b6f01-0bac-35b2-bef1-5be7bb3c2d78,"776 Moises Coves Apt. 531; Port Rylan, DC 80810",10794,-69.908375,136.780034,78cb:75d9:4dd:8248:f190:9f3c:b0e:9afc,"Ea iusto non.","Qui sapiente qui ut nihil sit.","Modi et voluptate blanditiis aliquid iure eveniet voluptas facilis ipsum omnis velit.","Minima in molestiae nam ullam voluptatem sapiente corporis sunt in ut aut alias exercitationem incidunt fugiat doloribus laudantium ducimus iusto nemo assumenda non ratione neque labore voluptatem."
2,0,false,267823,408705.14285714,1985-07-19,"1996-11-18 08:21:44",keebler.net,wwolff@connelly.com,73.197.210.145,29e076ab-a769-3a1f-abd4-2bc73ab17c99,"909 Sabryna Island Apt. 815; West Matteoside, CO 54360-7141",80948,7.908256,123.666864,bf3b:abab:3dcb:c335:b1a:b5d6:60e9:107e,"Aut dolor distinctio quasi.","Alias sit ut perferendis quod at dolores.","Molestiae est eos dolore deserunt atque temporibus.","Quisquam velit aut saepe temporibus officia labore quam numquam eveniet velit aliquid aut autem quis voluptatem in ut iste sunt omnis iure laudantium aspernatur tenetur nemo consequatur aliquid sint nostrum aut nostrum."
```

### Run benchmark locally

[](#run-benchmark-locally)

Make sure you have PHP 8.2+ and Docker installed.

```
# Clone the latest version
git clone git@github.com:jbzoo/csv-blueprint.git csv-blueprint
cd csv-blueprint

# download dependencies and build the tool.
make build              # We need it to build benchmark tool. See `./tests/Benchmarks` folder.
make build-phar-file    # Optional. Only if you want to test it.
make docker-build       # Recommended. local tag is "jbzoo/csv-blueprint:local"

# Create random CSV files with 5 columns (max: 20).
BENCH_COLS=5 make bench-create-csv

# Run the benchmark for the recent CSV file.
BENCH_COLS=5 make bench-docker # Recommended
BENCH_COLS=5 make bench-phar
BENCH_COLS=5 make bench-php

# It's a shortcut that combines CSV file creation and Docker run.
# By default BENCH_COLS=10
make bench
```

Disadvantages?
--------------

[](#disadvantages)

The perception that PHP is inherently slow is a common misconception. However, with the right optimization strategies, PHP can perform exceptionally well. For evidence, refer to the article [Processing One Billion CSV Rows in PHP!](https://dev.to/realflowcontrol/processing-one-billion-rows-in-php-3eg0), which demonstrates that PHP can process, aggregate, and calculate data from CSV files at approximately **15 million lines per second**! While not all optimizations are currently implemented, the performance is already quite impressive.

- Yes, it's acknowledged that this tool might not be the fastest available, but it's also far from the slowest. For more details, see the link provided above.
- Yes, the tool is built with PHP—not Python, Go, or PySpark—which may not be the first choice for such tasks.
- Yes, it functions like a standalone binary. The recommendation is simply to use it without overthinking its internal workings.
- Yes, it's recognized that this cannot be used as a Python SDK within a pipeline.

However, for the majority of scenarios, these are not deal-breakers. The utility effectively addresses the challenge of validating CSV files in continuous integration (CI) environments. 👍

This utility is designed for immediate use without necessitating a deep understanding of its inner mechanics. It adheres to rigorous testing standards, including strict typing, approximately seven linters and static analyzers at the highest rule level. Furthermore, every pull request is subjected to around ten different checks on GitHub Actions, spanning a matrix of PHP versions and modes, ensuring robustness. The extensive coverage and precautions are due to the unpredictability of usage conditions, embodying the spirit of the Open Source community.

In summary, the tool is developed with the highest standards of modern PHP practices, ensuring it performs as expected.

Coming soon
-----------

[](#coming-soon)

It's random ideas and plans. No promises and deadlines. Feel free to [help me!](#contributing).

 CLICK to see the roadmap- **Batch processing**

    - If option `--csv` is not specified, then the STDIN is used. To build a pipeline in Unix-like systems.
- **Validation**

    - Multi `filename_pattern`. Support list of regexs.
    - Multi values in one cell.
    - Custom cell rule as a callback. It's useful when you have a complex rule that can't be described in the schema file.
    - Custom agregate rule as a callback. It's useful when you have a complex rule that can't be described in the schema file.
    - Configurable keyword for null/empty values. By default, it's an empty string. But you will use `null`, `nil`, `none`, `empty`, etc. Overridable on the column level.
    - Handle empty files and files with only a header row, or only with one line of data. One column wthout header is also possible.
    - If option `--schema` is not specified, then validate only super base level things (like "is it a CSV file?").
    - Complex rules (like "if field `A` is not empty, then field `B` should be not empty too").
    - Extending with custom rules and custom report formats. Plugins?
    - Input encoding detection + `BOM` (right now it's experimental). It works but not so accurate... UTF-8 is the best choice for now.
- **Performance and optimization**

    - Using [vectors](https://www.php.net/manual/en/class.ds-vector.php) instead of arrays to optimaze memory usage and speed of access.
    - Multithreading support for parallel validation of CSV by columns.
- **Mock data generation**

    - Create CSV files based on the schema (like "create 1000 rows with random data based on schema and rules").
    - Use [Faker](https://github.com/FakerPHP/Faker) for random data generation.
    - [ReverseRegex](https://github.com/enso-media/ReverseRegex) to generate text from regex.
- **Reporting**

    - More report formats (like JSON, XML, etc). Any ideas?
    - Gitlab and JUnit reports must be as one structure. It's not so easy to implement. But it's a good idea.
    - Merge reports from multiple CSV files into one report. It's useful when you have a lot of files and you want to see all errors in one place. Especially for GitLab and JUnit reports.
- **Misc**

    - Rewrite in Go/Rust. It's a good idea to have a standalone binary with the same functionality.
    - Install via brew on MacOS.
    - Install via apt on Ubuntu.
    - Use it as PHP SDK. Examples in Readme.
    - Warnings about deprecated options and features.
    - Add option `--recomendation` to show a list of recommended rules for the schema or potential issues in the CSV file or schema. It's useful when you are not sure what rules to use.
    - Add option `--error=[level]` to show only errors with a specific level. It's useful when you have a lot of warnings and you want to see only errors.
    - More examples and documentation.

PS. [There is a file](tests/schemas/todo.yml) with my ideas and imagination. It's not valid schema file, just a draft. I'm not sure if I will implement all of them. But I will try to do my best.

Contributing
------------

[](#contributing)

If you have any ideas or suggestions, feel free to open an issue or create a pull request.

```
# Fork the repo and build project
git clone git@github.com:jbzoo/csv-blueprint.git ./jbzoo-csv-blueprint
cd ./jbzoo-csv-blueprint
make build

# Make your local changes

# Autofix code style
make test-phpcsfixer-fix test-phpcs

# Run all tests and check code style
make test
make codestyle

# Create your pull request and check all tests in CI (Github Actions)
# ???
# Profit!
```

License
-------

[](#license)

[MIT License](LICENSE): It's like free pizza - enjoy it, share it, just don't sell it as your own. And remember, no warranty for stomach aches! 😅

See also
--------

[](#see-also)

- [Cli](https://github.com/JBZoo/Cli) - Framework helps create complex CLI apps and provides new tools for Symfony/Console.
- [CI-Report-Converter](https://github.com/JBZoo/CI-Report-Converter) - It converts different error reporting standards for popular CI systems.
- [Composer-Diff](https://github.com/JBZoo/Composer-Diff) - See what packages have changed after `composer update`.
- [Composer-Graph](https://github.com/JBZoo/Composer-Graph) - Dependency graph visualization of `composer.json` based on [Mermaid JS](https://mermaid.js.org/).
- [Mermaid-PHP](https://github.com/JBZoo/Mermaid-PHP) - Generate diagrams and flowcharts with the help of the [mermaid](https://mermaid.js.org/) script language.
- [Utils](https://github.com/JBZoo/Utils) - Collection of useful PHP functions, mini-classes, and snippets for every day.
- [Image](https://github.com/JBZoo/Image) - Package provides object-oriented way to manipulate with images as simple as possible.
- [Data](https://github.com/JBZoo/Data) - Extended implementation of ArrayObject. Use Yml/PHP/JSON/INI files as config. Forget about arrays.
- [Retry](https://github.com/JBZoo/Retry) - Tiny PHP library providing retry/backoff functionality with strategies and jitter.

 just interesting factI've achieved a personal milestone. The [initial release](https://github.com/jbzoo/csv-blueprint/releases/tag/0.1) of the project was crafted from the ground up in approximately 3 days, interspersed with regular breaks to care for a 4-month-old baby. Reflecting on the first commit and the earliest git tag, it's clear that this was accomplished over a weekend, utilizing spare moments on my personal laptop.

###  Health Score

46

—

FairBetter than 93% of packages

Maintenance62

Regular maintenance activity

Popularity27

Limited adoption so far

Community7

Small or concentrated contributor base

Maturity72

Established project with proven stability

 Bus Factor1

Top contributor holds 100% of commits — single point of failure

How is this calculated?**Maintenance (25%)** — Last commit recency, latest release date, and issue-to-star ratio. Uses a 2-year decay window.

**Popularity (30%)** — Total and monthly downloads, GitHub stars, and forks. Logarithmic scaling prevents top-heavy scores.

**Community (15%)** — Contributors, dependents, forks, watchers, and maintainers. Measures real ecosystem engagement.

**Maturity (30%)** — Project age, version count, PHP version support, and release stability.

###  Release Activity

Cadence

Every ~10 days

Recently: every ~131 days

Total

55

Last Release

232d ago

Major Versions

0.50 → 1.0.02024-04-19

PHP version history (3 changes)0.1PHP ^8.1

0.47PHP ^8.2

1.1.0PHP ^8.4

### Community

Maintainers

![](https://www.gravatar.com/avatar/75e6de2785f6d099699f430ff58404af4fc0e83060d2953028c9664a54704a5f?d=identicon)[smetdenis](/maintainers/smetdenis)

---

Top Contributors

[![SmetDenis](https://avatars.githubusercontent.com/u/1118678?v=4)](https://github.com/SmetDenis "SmetDenis (206 commits)")

---

Tags

cicsvcsv-formatcsv-generationcsv-lintercsv-schemacsv-validatorgithub-actionsjbzoojunitline-by-linelinterschemateamcityvalidationvalidatorcsvjbzoocsv-validatorCSV SchemaCSV Validationcsv-generationcsv-formatcsv-rulescsv-linter

### Embed Badge

![Health badge](/badges/jbzoo-csv-blueprint/health.svg)

```
[![Health](https://phpackages.com/badges/jbzoo-csv-blueprint/health.svg)](https://phpackages.com/packages/jbzoo-csv-blueprint)
```

###  Alternatives

[silverstripe/framework

The SilverStripe framework

7213.5M2.5k](/packages/silverstripe-framework)[simplesamlphp/simplesamlphp

A PHP implementation of a SAML 2.0 service provider and identity provider.

1.1k12.4M193](/packages/simplesamlphp-simplesamlphp)[shopware/platform

The Shopware e-commerce core

3.3k1.5M3](/packages/shopware-platform)[drupal/core

Drupal is an open source content management platform powering millions of websites and applications.

19462.3M1.3k](/packages/drupal-core)[sulu/sulu

Core framework that implements the functionality of the Sulu content management system

1.3k1.3M152](/packages/sulu-sulu)[prestashop/prestashop

PrestaShop is an Open Source e-commerce platform, committed to providing the best shopping cart experience for both merchants and customers.

9.0k15.4k](/packages/prestashop-prestashop)

PHPackages © 2026

[Directory](/)[Categories](/categories)[Trending](/trending)[Changelog](/changelog)[Analyze](/analyze)