Basics¶

[1]:

import arche
from arche import *

The only required parameter is source, which accepts various inputs - see signature (?Arche) or examples.

Data Sources¶

Arche with pandas API provide ability to read data from various places and formats.

`*.json` as iterable¶

[2]:

import json
with open("data/items_books_1.json") as f:
    raw_items = json.load(f)

[3]:

a = Arche(source=raw_items)

`*.jl.gz` and pandas API¶

[4]:

url = "https://raw.githubusercontent.com/scrapinghub/arche/master/docs/source/nbs/data/items_books_1.jl.gz"

[5]:

df = pd.read_json(url,lines=True)

jsonlines and json are not memory efficient if data contains nested objects. If other types are not available, you can read compressed jsonline in chunks.

[6]:

chunks = pd.read_json(url, lines=True, chunksize=500)

[7]:

dfs = [df for df in chunks]
df = pd.concat(dfs, sort=False)

[8]:

df.shape

[8]:

(1000, 5)

Uncompressed jsonline files however need to be downloaded first

[9]:

raw_json = arche.tools.s3.get_contents("https://raw.githubusercontent.com/scrapinghub/arche/master/docs/source/nbs/data/items_books_1.jl")

[10]:

chunks = pd.read_json(raw_json, lines=True, chunksize=500)

[11]:

dfs = [df for df in chunks]
df = pd.concat(dfs, sort=False)

[12]:

df.shape

[12]:

(1000, 5)

[13]:

a = Arche(source=df)

WARNING
Pandas stores `NA` (missing) data differently, which might affect schema validation. Should you care, consider passing raw data in array-like types.
For more details, see https://pandas.pydata.org/pandas-docs/stable/user_guide/gotchas.html#nan-integer-na-values-and-na-type-promotions

Scrapy Cloud keys¶

You can access data from a job at Scrapy Cloud using the job key.

Note: To access Scrapy Cloud Data, you need to set Scrapinghub API key in SH_APIKEY environment variable.

[14]:

a = Arche(source="381798/1/3")

To get a full report of the data, arche provides the report_all() function

[15]:

a.report_all()

This method runs a determined set of rules. Which rules to execute is dependent from input parameters - i.e. if we have both source and target then [comparison] rules will be executed too. Some of them is not part of report_all(), see rules for more information. The validation can be improved by adding a json schema, so let’s infer one from the data we already have.

JSON Schema¶

[16]:

basic_json_schema("381798/1/3")

[16]:

{'$schema': 'http://json-schema.org/draft-07/schema#',
 'additionalProperties': False,
 'definitions': {'url': {'pattern': '^https?://(www\\.)?[a-z0-9.-]*\\.[a-z]{2,}([^<>%\\x20\\x00-\\x1f\\x7F]|%[0-9a-fA-F]{2})*$'}},
 'properties': {'category': {'type': 'string'},
                'description': {'type': 'string'},
                'price': {'type': 'string'},
                'title': {'type': 'string'}},
 'required': ['category', 'description', 'price', 'title'],
 'type': 'object'}

By itself a basic schema is not very helpful, but you can update it.

[17]:

a.source_items.df.head()

[17]:

	title	price	category	description
https://app.scrapinghub.com/p/381798/1/3/item/0	It's Only the Himalayas	£45.17	Travel	“Wherever you go, whatever you do, just . . . ...
https://app.scrapinghub.com/p/381798/1/3/item/1	Libertarianism for Beginners	£51.33	Politics	Libertarianism isn't about winning elections; ...
https://app.scrapinghub.com/p/381798/1/3/item/2	Mesaerion: The Best Science Fiction Stories 18...	£37.59	Science Fiction	Andrew Barger, award-winning author and engine...
https://app.scrapinghub.com/p/381798/1/3/item/3	Olio	£23.88	Poetry	Part fact, part fiction, Tyehimba Jess's much ...
https://app.scrapinghub.com/p/381798/1/3/item/4	Our Band Could Be Your Life: Scenes from the A...	£57.25	Music	This is the never-before-told story of the mus...

Looks like price can be checked with regex. Let’s also add category tag which helps to see the distribution in categoric data and unique tag to title to ensure there are no duplicates.

[18]:

a.schema = {
    "$schema": "http://json-schema.org/draft-07/schema#",
    "definitions": {
        "float": {
            "pattern": "^-?[0-9]+\\.[0-9]{2}$"
        },
        "url": {
            "pattern": "^https?://(www\\.)?[a-z0-9.-]*\\.[a-z]{2,}([^<>%\\x20\\x00-\\x1f\\x7F]|%[0-9a-fA-F]{2})*$"
        }
    },
    "additionalProperties": False,
    "type": "object",
    "properties": {
        "category": {"type": "string", "tag": ["category"]},
        "price": {"type": "string", "pattern": "^£\d{2}.\d{2}$"},
        "_type": {"type": "string"},
        "description": {"type": "string"},
        "title": {"type": "string", "tag": ["unique"]},
        "_key": {"type": "string"}
    },
    "required": [
        "_key",
        "_type",
        "category",
        "description",
        "price",
        "title"
    ]
}

[19]:

a.validate_with_json_schema()

Or if your job is really big you can use almost 100x faster backend

[20]:

a.glance()

We already got something! Let’s execute the whole thing again to see how category tag works.

[21]:

a.report_all()

Accessing Results Data¶

[22]:

a.report.results.keys()

[22]:

dict_keys(['Job Outcome', 'Job Errors', 'Garbage Symbols', 'Fields Coverage', 'Categories', 'JSON Schema Validation', 'Tags', 'Compare Price Was And Now', 'Duplicates', 'Coverage For Scraped Categories'])

[23]:

a.report.results.get("Coverage For Scraped Categories").stats

[23]:

[Paranormal              1
 Cultural                1
 Novels                  1
 Parenting               1
 Academic                1
 Suspense                1
 Short Stories           1
 Crime                   1
 Erotica                 1
 Adult Fiction           1
 Historical              2
 Christian               3
 Contemporary            3
 Politics                3
 Health                  4
 Self Help               5
 Biography               5
 Sports and Games        5
 Spirituality            6
 Christian Fiction       6
 New Adult               6
 Religion                7
 Psychology              7
 Art                     8
 Autobiography           9
 Humor                  10
 Thriller               11
 Philosophy             11
 Travel                 11
 Business               12
 Music                  13
 Science                14
 Science Fiction        16
 Womens Fiction         17
 Horror                 17
 History                18
 Classics               19
 Poetry                 19
 Historical Fiction     26
 Childrens              29
 Food and Drink         30
 Mystery                32
 Romance                35
 Fantasy                48
 Young Adult            54
 Fiction                65
 Add a comment          67
 Sequential Art         75
 Nonfiction            110
 Default               152
 Name: category, dtype: int64]

[ ]: