Schema¶
[ ]:
import arche
from arche import *
Creating¶
A schema can be inferred from a job item. basic_json_schema()
returns Python dict representacion.
[ ]:
schema = basic_json_schema("381798/1/3"); schema
[ ]:
schema.raw
But there’s also a json()
method, notice the difference in boolean values and regex.
[ ]:
schema.json()
Setting¶
[ ]:
a = Arche("381798/1/3")
You can set JSON schemas by different ways, by passing a schema
argument to Arche
instance or by setting schema
property
From a dict¶
[ ]:
a.schema = {
"$schema": "http://json-schema.org/draft-07/schema#",
"definitions": {
"float": {
"pattern": "^-?[0-9]+\\.[0-9]{2}$"
},
"url": {
"pattern": "^https?://(www\\.)?[a-z0-9.-]*\\.[a-z]{2,}([^<>%\\x20\\x00-\\x1f\\x7F]|%[0-9a-fA-F]{2})*$"
}
},
"additionalProperties": False,
"type": "object",
"properties": {
"category": {"type": "string", "tag": ["category"]},
"price": {"type": "string", "pattern": "^£\d{2}.\d{2}$"},
"description": {"type": "string"},
"title": {"type": "string", "tag": ["unique"]},
},
"required": [
"category",
"description",
"price",
"title"
]
}
a.schema
From a url¶
[ ]:
a.schema = "https://raw.githubusercontent.com/scrapinghub/arche/master/docs/source/nbs/data/books.json"
a.schema, a.schema_source
From a private repo¶
Github¶
For github, you just specify the raw link which will contain a token on the end. The token expires after 5 minutes.
a.schema = "https://raw.githubusercontent.com/manycoding/repo/master/schema.json?token=AJ6jjTtZtWZr5zyw7DuWduieMJ2ms1ks5ctRC6wA%3%3D"
Bitbucket¶
For bitbucket, you have to set up BITBUCKET_USER
and BITBUCKET_PASSWORD
environment variables. For example, in Jupyter it looks like:
[ ]:
%env BITBUCKET_USER=your_id
%env BITBUCKET_PASSWORD=your_pass
Besides the user’s username and password, you can use Bitbucket’s app passwords.
It supports both regular URL am raw links:
a.schema = "https://bitbucket.org/user/repo/raw/HEAD/schema.json"
or
a.schema = "https://bitbucket.org/user/repo/src/HEAD/schema.json"
Optionally, you can set BITBUCKET_NETLOC
and BITBUCKET_API_NETLOC
when you wish to access files from a self-hosted Bitbucket server. Eg.:
[1]:
%env BITBUCKET_NETLOC=bitbucket.org
%env BITBUCKET_API_NETLOC=api.bitbucket.org
env: BITBUCKET_NETLOC=bitbucket.org
env: BITBUCKET_API_NETLOC=api.bitbucket.org
From AWS S3¶
To get schemas from private s3 bucket, you need to set AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
[ ]:
%env AWS_ACCESS_KEY_ID=your_id
%env AWS_SECRET_ACCESS_KEY=your_key
And then just specify s3 link
a.schema = "s3://bucket/schema.json"