Rules¶
This notebook contains rules used in the library with examples. Some rules executed during Arche.report_all()
, and some are meant to be executed separately.
Some definitions here are used interchangeably:
Rule - a test case for data. As a test case, it can be failed, passed or skipped. Some of the rules output only information like Category fields
df - a dataframe which holds input data (from a job, collection or other source)
Scrapy cloud item - a row in a df
Items fields - columns in a df
[1]:
import arche
from arche import *
from arche.readers.items import Items
[2]:
items = Items.from_df(pd.read_csv("https://raw.githubusercontent.com/scrapinghub/arche/master/docs/source/nbs/data/items_products_8.csv"))
target_items = Items.from_df(pd.read_csv("https://raw.githubusercontent.com/scrapinghub/arche/master/docs/source/nbs/data/items_products_7.csv"))
[4]:
df = items.df.drop(columns=["_type"])
target_df = target_items.df
Accessing Graphs Data¶
The data is in stats
. See Result
class for more details.
[11]:
arche.rules.coverage.check_fields_coverage(df).stats
Coverage¶
Fields coverage on input data¶
[12]:
help(arche.rules.coverage.check_fields_coverage)
[13]:
arche.rules.coverage.check_fields_coverage(df).show()
Anomalies¶
[14]:
help(arche.rules.coverage.anomalies)
[15]:
res = arche.rules.coverage.anomalies(target="381798/2/4", sample=["381798/2/8", "381798/2/7", "381798/2/6"])
res.show()
Categories¶
Category fields¶
[ ]:
help(arche.rules.category.get_categories)
[ ]:
arche.rules.category.get_categories(df, max_uniques=200).show()
Category coverage¶
In report_all()
, these rules use category
tag.
[ ]:
help(arche.rules.category.get_coverage_per_category)
[ ]:
arche.rules.category.get_coverage_per_category(df, ["category"]).show()
[ ]:
help(arche.rules.category.get_difference)
[ ]:
arche.rules.category.get_difference(df, target_df, ["category"]).show()