{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Schema" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import arche\n", "from arche import *" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A schema can be inferred from a job item. `basic_json_schema()` returns Python dict representacion." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "schema = basic_json_schema(\"381798/1/3\"); schema" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "schema.raw" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But there's also a `json()` method, notice the difference in boolean values and regex." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "schema.json()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setting" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a = Arche(\"381798/1/3\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can set JSON schemas by different ways, by passing a `schema` argument to `Arche` instance or by setting `schema` property" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### From a dict" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a.schema = {\n", " \"$schema\": \"http://json-schema.org/draft-07/schema#\",\n", " \"definitions\": {\n", " \"float\": {\n", " \"pattern\": \"^-?[0-9]+\\\\.[0-9]{2}$\"\n", " },\n", " \"url\": {\n", " \"pattern\": \"^https?://(www\\\\.)?[a-z0-9.-]*\\\\.[a-z]{2,}([^<>%\\\\x20\\\\x00-\\\\x1f\\\\x7F]|%[0-9a-fA-F]{2})*$\"\n", " }\n", " },\n", " \"additionalProperties\": False,\n", " \"type\": \"object\",\n", " \"properties\": {\n", " \"category\": {\"type\": \"string\", \"tag\": [\"category\"]},\n", " \"price\": {\"type\": \"string\", \"pattern\": \"^£\\d{2}.\\d{2}$\"},\n", " \"description\": {\"type\": \"string\"},\n", " \"title\": {\"type\": \"string\", \"tag\": [\"unique\"]},\n", " },\n", " \"required\": [\n", " \"category\",\n", " \"description\",\n", " \"price\",\n", " \"title\"\n", " ]\n", "}\n", "a.schema" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### From a url" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a.schema = \"https://raw.githubusercontent.com/scrapinghub/arche/master/docs/source/nbs/data/books.json\"\n", "a.schema, a.schema_source" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### From a private repo" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Github\n", "For github, you just specify the raw link which will contain a token on the end. The token expires after 5 minutes.\n", "\n", "```a.schema = \"https://raw.githubusercontent.com/manycoding/repo/master/schema.json?token=AJ6jjTtZtWZr5zyw7DuWduieMJ2ms1ks5ctRC6wA%3%3D\"```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Bitbucket\n", "For bitbucket, you have to set up `BITBUCKET_USER` and `BITBUCKET_PASSWORD` environment variables.\n", "For example, in Jupyter it looks like:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%env BITBUCKET_USER=your_id\n", "%env BITBUCKET_PASSWORD=your_pass" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Besides the user's username and password, you can use [Bitbucket's app passwords](https://confluence.atlassian.com/bitbucket/app-passwords-828781300.html).\n", "\n", "It supports both regular URL am raw links:\n", "```\n", "a.schema = \"https://bitbucket.org/user/repo/raw/HEAD/schema.json\"\n", "```\n", "or\n", "```\n", "a.schema = \"https://bitbucket.org/user/repo/src/HEAD/schema.json\"\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Optionally, you can set `BITBUCKET_NETLOC` and `BITBUCKET_API_NETLOC` when you wish to access files from a self-hosted Bitbucket server. Eg.:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "env: BITBUCKET_NETLOC=bitbucket.org\n", "env: BITBUCKET_API_NETLOC=api.bitbucket.org\n" ] } ], "source": [ "%env BITBUCKET_NETLOC=bitbucket.org\n", "%env BITBUCKET_API_NETLOC=api.bitbucket.org" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### From AWS S3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get schemas from private s3 bucket, you need to set `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%env AWS_ACCESS_KEY_ID=your_id\n", "%env AWS_SECRET_ACCESS_KEY=your_key" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And then just specify s3 link\n", "\n", "```a.schema = \"s3://bucket/schema.json\"```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Properties" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a.schema.tags" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a.schema.enums" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a.schema.raw" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }