Rules¶

This notebook contains rules used in the library with examples. Some rules executed during Arche.report_all(), and some are meant to be executed separately.

Some definitions here are used interchangeably:

Rule - a test case for data. As a test case, it can be failed, passed or skipped. Some of the rules output only information like Category fields
df - a dataframe which holds input data (from a job, collection or other source)
Scrapy cloud item - a row in a df
Items fields - columns in a df

[1]:

import arche
from arche import *
from arche.readers.items import Items

[2]:

items = Items.from_df(pd.read_csv("https://raw.githubusercontent.com/scrapinghub/arche/master/docs/source/nbs/data/items_products_8.csv"))
target_items = Items.from_df(pd.read_csv("https://raw.githubusercontent.com/scrapinghub/arche/master/docs/source/nbs/data/items_products_7.csv"))

[4]:

df = items.df.drop(columns=["_type"])
target_df = target_items.df

Accessing Graphs Data¶

The data is in stats. See Result class for more details.

[11]:

arche.rules.coverage.check_fields_coverage(df).stats

Coverage¶

Fields coverage on input data¶

[12]:

help(arche.rules.coverage.check_fields_coverage)

[13]:

arche.rules.coverage.check_fields_coverage(df).show()

Anomalies¶

[14]:

help(arche.rules.coverage.anomalies)

[15]:

res = arche.rules.coverage.anomalies(target="381798/2/4", sample=["381798/2/8", "381798/2/7", "381798/2/6"])
res.show()

Categories¶

Category fields¶

[ ]:

help(arche.rules.category.get_categories)

[ ]:

arche.rules.category.get_categories(df, max_uniques=200).show()

Category coverage¶

In report_all(), these rules use category tag.

[ ]:

help(arche.rules.category.get_coverage_per_category)

[ ]:

arche.rules.category.get_coverage_per_category(df, ["category"]).show()

[ ]:

help(arche.rules.category.get_difference)

[ ]:

arche.rules.category.get_difference(df, target_df, ["category"]).show()

Compare¶

Fields¶

[ ]:

help(arche.rules.compare.fields)

[ ]:

arche.rules.compare.fields(df, target_df, ["part_number", "name", "uom"]).show()

Duplicates¶

Find duplicates by any combination of columns (fields)¶

This rule is executed when uniques is passed to Arche.report_all().

[5]:

help(arche.rules.duplicates.find_by)

[8]:

arche.rules.duplicates.find_by(df, ["uom", ["name", "part_number"]]).show(short=True)

[ ]: