{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Basics" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2019-03-22T21:01:59.316035Z", "start_time": "2019-03-22T21:01:45.087Z" } }, "outputs": [], "source": [ "import arche\n", "from arche import *" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The only required parameter is `source`, which accepts various inputs - see signature (`?Arche`) or examples." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Sources\n", "`Arche` with `pandas` API provide ability to read data from various places and formats.\n", "\n", "### `*.json` as iterable" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "with open(\"data/items_books_1.json\") as f:\n", " raw_items = json.load(f)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a = Arche(source=raw_items)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### `*.jl.gz` and pandas API" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "url = \"https://raw.githubusercontent.com/scrapinghub/arche/master/docs/source/nbs/data/items_books_1.jl.gz\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df = pd.read_json(url,lines=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`jsonlines` and `json` are not memory efficient if data contains nested objects. If other types are not available, you can read compressed jsonline in chunks." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "chunks = pd.read_json(url, lines=True, chunksize=500)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dfs = [df for df in chunks]\n", "df = pd.concat(dfs, sort=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Uncompressed jsonline files however need to be downloaded first" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "raw_json = arche.tools.s3.get_contents(\"https://raw.githubusercontent.com/scrapinghub/arche/master/docs/source/nbs/data/items_books_1.jl\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "chunks = pd.read_json(raw_json, lines=True, chunksize=500)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dfs = [df for df in chunks]\n", "df = pd.concat(dfs, sort=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a = Arche(source=df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Scrapy Cloud keys" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can access data from a job at Scrapy Cloud using the job key." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note: To access Scrapy Cloud Data, you need to set [Scrapinghub API key](https://app.scrapinghub.com/account/apikey) in `SH_APIKEY` environment variable." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a = Arche(source=\"381798/1/3\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get a full report of the data, arche provides the `report_all()` function" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2019-03-22T21:01:50.955877Z", "start_time": "2019-03-22T21:01:39.951101Z" } }, "outputs": [], "source": [ "a.report_all()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This method runs a determined set of [rules](https://arche.readthedocs.io/en/latest/nbs/Rules.html). Which rules to execute is dependent from input parameters - i.e. if we have both `source` and `target` then [comparison] rules will be executed too. Some of them is not part of `report_all()`, see [rules](https://arche.readthedocs.io/en/latest/nbs/Rules.html) for more information. The validation can be improved by adding a json schema, so let's infer one from the data we already have." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## JSON Schema" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2019-03-22T21:01:59.279211Z", "start_time": "2019-03-22T21:01:50.965889Z" } }, "outputs": [], "source": [ "basic_json_schema(\"381798/1/3\")" ] }, { "cell_type": "markdown", "metadata": { "ExecuteTime": { "end_time": "2019-03-22T21:01:59.295022Z", "start_time": "2019-03-22T21:01:59.285607Z" } }, "source": [ "By itself a basic schema is not very helpful, but you can update it." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2019-03-22T21:02:14.500232Z", "start_time": "2019-03-22T21:02:14.452833Z" } }, "outputs": [], "source": [ "a.source_items.df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looks like `price` can be checked with regex. Let's also add `category` tag which helps to see the distribution in categoric data and `unique` tag to title to ensure there are no duplicates." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2019-03-22T21:02:15.116277Z", "start_time": "2019-03-22T21:02:15.097663Z" } }, "outputs": [], "source": [ "a.schema = {\n", " \"$schema\": \"http://json-schema.org/draft-07/schema#\",\n", " \"definitions\": {\n", " \"float\": {\n", " \"pattern\": \"^-?[0-9]+\\\\.[0-9]{2}$\"\n", " },\n", " \"url\": {\n", " \"pattern\": \"^https?://(www\\\\.)?[a-z0-9.-]*\\\\.[a-z]{2,}([^<>%\\\\x20\\\\x00-\\\\x1f\\\\x7F]|%[0-9a-fA-F]{2})*$\"\n", " }\n", " },\n", " \"additionalProperties\": False,\n", " \"type\": \"object\",\n", " \"properties\": {\n", " \"category\": {\"type\": \"string\", \"tag\": [\"category\"]},\n", " \"price\": {\"type\": \"string\", \"pattern\": \"^£\\d{2}.\\d{2}$\"},\n", " \"_type\": {\"type\": \"string\"},\n", " \"description\": {\"type\": \"string\"},\n", " \"title\": {\"type\": \"string\", \"tag\": [\"unique\"]},\n", " \"_key\": {\"type\": \"string\"}\n", " },\n", " \"required\": [\n", " \"_key\",\n", " \"_type\",\n", " \"category\",\n", " \"description\",\n", " \"price\",\n", " \"title\"\n", " ]\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2019-03-22T21:02:15.808566Z", "start_time": "2019-03-22T21:02:15.625333Z" } }, "outputs": [], "source": [ "a.validate_with_json_schema()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or if your job is really big you can use almost 100x faster [backend](https://github.com/horejsek/python-fastjsonschema)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a.glance()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We already got something! Let's execute the whole thing again to see how `category` tag works." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "ExecuteTime": { "end_time": "2019-03-22T21:02:20.766381Z", "start_time": "2019-03-22T21:02:18.838290Z" }, "scrolled": true }, "outputs": [], "source": [ "a.report_all()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Accessing Results Data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a.report.results.keys()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a.report.results.get(\"Coverage For Scraped Categories\").stats" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.2" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 4 }