Schema

[ ]:
import arche
from arche import *

Creating

A schema can be inferred from a job item. basic_json_schema() returns Python dict representacion.

[ ]:
schema = basic_json_schema("381798/1/3"); schema
[ ]:
schema.raw

But there’s also a json() method, notice the difference in boolean values and regex.

[ ]:
schema.json()

Setting

[ ]:
a = Arche("381798/1/3")

You can set JSON schemas by different ways, by passing a schema argument to Arche instance or by setting schema property

From a dict

[ ]:
a.schema = {
    "$schema": "http://json-schema.org/draft-07/schema#",
    "definitions": {
        "float": {
            "pattern": "^-?[0-9]+\\.[0-9]{2}$"
        },
        "url": {
            "pattern": "^https?://(www\\.)?[a-z0-9.-]*\\.[a-z]{2,}([^<>%\\x20\\x00-\\x1f\\x7F]|%[0-9a-fA-F]{2})*$"
        }
    },
    "additionalProperties": False,
    "type": "object",
    "properties": {
        "category": {"type": "string", "tag": ["category"]},
        "price": {"type": "string", "pattern": "^£\d{2}.\d{2}$"},
        "description": {"type": "string"},
        "title": {"type": "string", "tag": ["unique"]},
    },
    "required": [
        "category",
        "description",
        "price",
        "title"
    ]
}
a.schema

From a url

[ ]:
a.schema = "https://raw.githubusercontent.com/scrapinghub/arche/master/docs/source/nbs/data/books.json"
a.schema, a.schema_source

From a private repo

Github

For github, you just specify the raw link which will contain a token on the end. The token expires after 5 minutes.

a.schema = "https://raw.githubusercontent.com/manycoding/repo/master/schema.json?token=AJ6jjTtZtWZr5zyw7DuWduieMJ2ms1ks5ctRC6wA%3%3D"

Bitbucket

For bitbucket, you have to set up BITBUCKET_USER and BITBUCKET_PASSWORD environment variables. For example, in Jupyter it looks like:

[ ]:
%env BITBUCKET_USER=your_id
%env BITBUCKET_PASSWORD=your_pass

Besides the user’s username and password, you can use Bitbucket’s app passwords.

It supports both regular URL am raw links:

a.schema = "https://bitbucket.org/user/repo/raw/HEAD/schema.json"

or

a.schema = "https://bitbucket.org/user/repo/src/HEAD/schema.json"

Optionally, you can set BITBUCKET_NETLOC and BITBUCKET_API_NETLOC when you wish to access files from a self-hosted Bitbucket server. Eg.:

[1]:
%env BITBUCKET_NETLOC=bitbucket.org
%env BITBUCKET_API_NETLOC=api.bitbucket.org
env: BITBUCKET_NETLOC=bitbucket.org
env: BITBUCKET_API_NETLOC=api.bitbucket.org

From AWS S3

To get schemas from private s3 bucket, you need to set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY

[ ]:
%env AWS_ACCESS_KEY_ID=your_id
%env AWS_SECRET_ACCESS_KEY=your_key

And then just specify s3 link

a.schema = "s3://bucket/schema.json"

Properties

[ ]:
a.schema.tags
[ ]:
a.schema.enums
[ ]:
a.schema.raw