{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Basics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-03-22T21:01:59.316035Z",
     "start_time": "2019-03-22T21:01:45.087Z"
    }
   },
   "outputs": [],
   "source": [
    "import arche\n",
    "from arche import *"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The only required parameter is `source`, which accepts various inputs - see signature (`?Arche`) or examples."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Sources\n",
    "`Arche` with `pandas` API provide ability to read data from various places and formats.\n",
    "\n",
    "### `*.json` as iterable"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "with open(\"data/items_books_1.json\") as f:\n",
    "    raw_items = json.load(f)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a = Arche(source=raw_items)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `*.jl.gz` and pandas API"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "url = \"https://raw.githubusercontent.com/scrapinghub/arche/master/docs/source/nbs/data/items_books_1.jl.gz\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_json(url,lines=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`jsonlines` and `json` are not memory efficient if data contains nested objects. If other types are not available, you can read compressed jsonline in chunks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "chunks = pd.read_json(url, lines=True, chunksize=500)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dfs = [df for df in chunks]\n",
    "df = pd.concat(dfs, sort=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Uncompressed jsonline files however need to be downloaded first"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raw_json = arche.tools.s3.get_contents(\"https://raw.githubusercontent.com/scrapinghub/arche/master/docs/source/nbs/data/items_books_1.jl\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "chunks = pd.read_json(raw_json, lines=True, chunksize=500)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dfs = [df for df in chunks]\n",
    "df = pd.concat(dfs, sort=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a = Arche(source=df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Scrapy Cloud keys"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can access data from a job at Scrapy Cloud using the job key."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note: To access Scrapy Cloud Data, you need to set [Scrapinghub API key](https://app.scrapinghub.com/account/apikey) in `SH_APIKEY` environment variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a = Arche(source=\"381798/1/3\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To get a full report of the data, arche provides the `report_all()` function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-03-22T21:01:50.955877Z",
     "start_time": "2019-03-22T21:01:39.951101Z"
    }
   },
   "outputs": [],
   "source": [
    "a.report_all()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This method runs a determined set of [rules](https://arche.readthedocs.io/en/latest/nbs/Rules.html). Which rules to execute is dependent from input parameters - i.e. if we have both `source` and `target` then [comparison] rules will be executed too. Some of them is not part of `report_all()`, see [rules](https://arche.readthedocs.io/en/latest/nbs/Rules.html) for more information. The validation can be improved by adding a json schema, so let's infer one from the data we already have."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## JSON Schema"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-03-22T21:01:59.279211Z",
     "start_time": "2019-03-22T21:01:50.965889Z"
    }
   },
   "outputs": [],
   "source": [
    "basic_json_schema(\"381798/1/3\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-03-22T21:01:59.295022Z",
     "start_time": "2019-03-22T21:01:59.285607Z"
    }
   },
   "source": [
    "By itself a basic schema is not very helpful, but you can update it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-03-22T21:02:14.500232Z",
     "start_time": "2019-03-22T21:02:14.452833Z"
    }
   },
   "outputs": [],
   "source": [
    "a.source_items.df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looks like `price` can be checked with regex. Let's also add `category` tag which helps to see the distribution in categoric data and `unique` tag to title to ensure there are no duplicates."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-03-22T21:02:15.116277Z",
     "start_time": "2019-03-22T21:02:15.097663Z"
    }
   },
   "outputs": [],
   "source": [
    "a.schema = {\n",
    "    \"$schema\": \"http://json-schema.org/draft-07/schema#\",\n",
    "    \"definitions\": {\n",
    "        \"float\": {\n",
    "            \"pattern\": \"^-?[0-9]+\\\\.[0-9]{2}$\"\n",
    "        },\n",
    "        \"url\": {\n",
    "            \"pattern\": \"^https?://(www\\\\.)?[a-z0-9.-]*\\\\.[a-z]{2,}([^<>%\\\\x20\\\\x00-\\\\x1f\\\\x7F]|%[0-9a-fA-F]{2})*$\"\n",
    "        }\n",
    "    },\n",
    "    \"additionalProperties\": False,\n",
    "    \"type\": \"object\",\n",
    "    \"properties\": {\n",
    "        \"category\": {\"type\": \"string\", \"tag\": [\"category\"]},\n",
    "        \"price\": {\"type\": \"string\", \"pattern\": \"^£\\d{2}.\\d{2}$\"},\n",
    "        \"_type\": {\"type\": \"string\"},\n",
    "        \"description\": {\"type\": \"string\"},\n",
    "        \"title\": {\"type\": \"string\", \"tag\": [\"unique\"]},\n",
    "        \"_key\": {\"type\": \"string\"}\n",
    "    },\n",
    "    \"required\": [\n",
    "        \"_key\",\n",
    "        \"_type\",\n",
    "        \"category\",\n",
    "        \"description\",\n",
    "        \"price\",\n",
    "        \"title\"\n",
    "    ]\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-03-22T21:02:15.808566Z",
     "start_time": "2019-03-22T21:02:15.625333Z"
    }
   },
   "outputs": [],
   "source": [
    "a.validate_with_json_schema()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or if your job is really big you can use almost 100x faster [backend](https://github.com/horejsek/python-fastjsonschema)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a.glance()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We already got something! Let's execute the whole thing again to see how `category` tag works."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2019-03-22T21:02:20.766381Z",
     "start_time": "2019-03-22T21:02:18.838290Z"
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "a.report_all()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Accessing Results Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a.report.results.keys()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a.report.results.get(\"Coverage For Scraped Categories\").stats"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.2"
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {},
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}