Basics¶
[1]:
import arche
from arche import *
The only required parameter is source
, which accepts various inputs - see signature (?Arche
) or examples.
Data Sources¶
Arche
with pandas
API provide ability to read data from various places and formats.
*.json
as iterable¶
[2]:
import json
with open("data/items_books_1.json") as f:
raw_items = json.load(f)
[3]:
a = Arche(source=raw_items)
*.jl.gz
and pandas API¶
[4]:
url = "https://raw.githubusercontent.com/scrapinghub/arche/master/docs/source/nbs/data/items_books_1.jl.gz"
[5]:
df = pd.read_json(url,lines=True)
jsonlines
and json
are not memory efficient if data contains nested objects. If other types are not available, you can read compressed jsonline in chunks.
[6]:
chunks = pd.read_json(url, lines=True, chunksize=500)
[7]:
dfs = [df for df in chunks]
df = pd.concat(dfs, sort=False)
[8]:
df.shape
[8]:
(1000, 5)
Uncompressed jsonline files however need to be downloaded first
[9]:
raw_json = arche.tools.s3.get_contents("https://raw.githubusercontent.com/scrapinghub/arche/master/docs/source/nbs/data/items_books_1.jl")
[10]:
chunks = pd.read_json(raw_json, lines=True, chunksize=500)
[11]:
dfs = [df for df in chunks]
df = pd.concat(dfs, sort=False)
[12]:
df.shape
[12]:
(1000, 5)
[13]:
a = Arche(source=df)
WARNING
Pandas stores `NA` (missing) data differently, which might affect schema validation. Should you care, consider passing raw data in array-like types.
For more details, see https://pandas.pydata.org/pandas-docs/stable/user_guide/gotchas.html#nan-integer-na-values-and-na-type-promotions
Scrapy Cloud keys¶
You can access data from a job at Scrapy Cloud using the job key.
Note: To access Scrapy Cloud Data, you need to set Scrapinghub API key in SH_APIKEY
environment variable.
[14]:
a = Arche(source="381798/1/3")
To get a full report of the data, arche provides the report_all()
function
[15]:
a.report_all()
This method runs a determined set of rules. Which rules to execute is dependent from input parameters - i.e. if we have both source
and target
then [comparison] rules will be executed too. Some of them is not part of report_all()
, see rules for more information. The validation can be improved by adding a json schema, so let’s infer one from the data we already have.
JSON Schema¶
[16]:
basic_json_schema("381798/1/3")
[16]:
{'$schema': 'http://json-schema.org/draft-07/schema#',
'additionalProperties': False,
'definitions': {'url': {'pattern': '^https?://(www\\.)?[a-z0-9.-]*\\.[a-z]{2,}([^<>%\\x20\\x00-\\x1f\\x7F]|%[0-9a-fA-F]{2})*$'}},
'properties': {'category': {'type': 'string'},
'description': {'type': 'string'},
'price': {'type': 'string'},
'title': {'type': 'string'}},
'required': ['category', 'description', 'price', 'title'],
'type': 'object'}
By itself a basic schema is not very helpful, but you can update it.
[17]:
a.source_items.df.head()
[17]:
title | price | category | description | |
---|---|---|---|---|
https://app.scrapinghub.com/p/381798/1/3/item/0 | It's Only the Himalayas | £45.17 | Travel | “Wherever you go, whatever you do, just . . . ... |
https://app.scrapinghub.com/p/381798/1/3/item/1 | Libertarianism for Beginners | £51.33 | Politics | Libertarianism isn't about winning elections; ... |
https://app.scrapinghub.com/p/381798/1/3/item/2 | Mesaerion: The Best Science Fiction Stories 18... | £37.59 | Science Fiction | Andrew Barger, award-winning author and engine... |
https://app.scrapinghub.com/p/381798/1/3/item/3 | Olio | £23.88 | Poetry | Part fact, part fiction, Tyehimba Jess's much ... |
https://app.scrapinghub.com/p/381798/1/3/item/4 | Our Band Could Be Your Life: Scenes from the A... | £57.25 | Music | This is the never-before-told story of the mus... |
Looks like price
can be checked with regex. Let’s also add category
tag which helps to see the distribution in categoric data and unique
tag to title to ensure there are no duplicates.
[18]:
a.schema = {
"$schema": "http://json-schema.org/draft-07/schema#",
"definitions": {
"float": {
"pattern": "^-?[0-9]+\\.[0-9]{2}$"
},
"url": {
"pattern": "^https?://(www\\.)?[a-z0-9.-]*\\.[a-z]{2,}([^<>%\\x20\\x00-\\x1f\\x7F]|%[0-9a-fA-F]{2})*$"
}
},
"additionalProperties": False,
"type": "object",
"properties": {
"category": {"type": "string", "tag": ["category"]},
"price": {"type": "string", "pattern": "^£\d{2}.\d{2}$"},
"_type": {"type": "string"},
"description": {"type": "string"},
"title": {"type": "string", "tag": ["unique"]},
"_key": {"type": "string"}
},
"required": [
"_key",
"_type",
"category",
"description",
"price",
"title"
]
}
[19]:
a.validate_with_json_schema()
Or if your job is really big you can use almost 100x faster backend
[20]:
a.glance()
We already got something! Let’s execute the whole thing again to see how category
tag works.
[21]:
a.report_all()
Accessing Results Data¶
[22]:
a.report.results.keys()
[22]:
dict_keys(['Job Outcome', 'Job Errors', 'Garbage Symbols', 'Fields Coverage', 'Categories', 'JSON Schema Validation', 'Tags', 'Compare Price Was And Now', 'Duplicates', 'Coverage For Scraped Categories'])
[23]:
a.report.results.get("Coverage For Scraped Categories").stats
[23]:
[Paranormal 1
Cultural 1
Novels 1
Parenting 1
Academic 1
Suspense 1
Short Stories 1
Crime 1
Erotica 1
Adult Fiction 1
Historical 2
Christian 3
Contemporary 3
Politics 3
Health 4
Self Help 5
Biography 5
Sports and Games 5
Spirituality 6
Christian Fiction 6
New Adult 6
Religion 7
Psychology 7
Art 8
Autobiography 9
Humor 10
Thriller 11
Philosophy 11
Travel 11
Business 12
Music 13
Science 14
Science Fiction 16
Womens Fiction 17
Horror 17
History 18
Classics 19
Poetry 19
Historical Fiction 26
Childrens 29
Food and Drink 30
Mystery 32
Romance 35
Fantasy 48
Young Adult 54
Fiction 65
Add a comment 67
Sequential Art 75
Nonfiction 110
Default 152
Name: category, dtype: int64]
[ ]: