PHPackages                             randak/charlotte - PHPackages - PHPackages  [Skip to content](#main-content)[PHPackages](/)[Directory](/)[Categories](/categories)[Trending](/trending)[Leaderboard](/leaderboard)[Changelog](/changelog)[Analyze](/analyze)[Collections](/collections)[Log in](/login)[Sign up](/register)

1. [Directory](/)
2. /
3. [Utility &amp; Helpers](/categories/utility)
4. /
5. randak/charlotte

ActiveLibrary[Utility &amp; Helpers](/categories/utility)

randak/charlotte
================

Charlotte crawls through your website and captures data in a database, including links between pages, H1 elements, scripts, stylesheets, and keywords.

1.2.1(12y ago)05[5 issues](https://github.com/randak/charlotte/issues)GPL-3.0PHPPHP &gt;=5.3.3

Since Mar 21Pushed 12y ago1 watchersCompare

[ Source](https://github.com/randak/charlotte)[ Packagist](https://packagist.org/packages/randak/charlotte)[ RSS](/packages/randak-charlotte/feed)WikiDiscussions master Synced 2d ago

READMEChangelogDependencies (3)Versions (8)Used By (0)

Charlotte
=========

[](#charlotte)

Author: Kristian Randall Copyright 2014

PHP-based web crawler for site analysis. Crawls your website and stores information about your pages, scripts and stylesheets in a Neo4j graph database. (Can be extended to use any database.)

Installation
------------

[](#installation)

Install using [Composer](https://getcomposer.org/):

> composer require randak/charlotte:dev-master

Depending on your composer settings, you may need to run `composer require everyman/neo4jphp:dev-master`before you can install Charlotte. If you get an error about that package not being available, this is the likely solution.

In addition to installing Charlotte, you'll also need Neo4j, whether it be on the same machine or another server.

Configuration
-------------

[](#configuration)

After installation, you will need to set up your configuration. Currently, there is an example config file in the `examples` folder. The config will look something like this:

```
    crawler:
        start: http://www.example.com
        exclude:
            - "/^javascript\:void\(0\)$/"
            - "/^#.*/"
            - "/^\\/$/"
            - "/\.(pdf|zip|zi|png|jpg|jpeg|doc|ppt)$/i"
    connections:
        Neo4j:
            host: localhost
            port: 7474
```

You should set the URL here to be the homepage of the website you wish to crawl.

The exclude patterns are regular expressions that will match URLs you don't want to crawl. For example, we are ignoring certain file types, and any URL that starts with a #.

Usage
-----

[](#usage)

Charlotte is currently designed to be run from the command line only.

Create a file called `crawl.php`.

> touch crawl.php

Insert the follow.

```
