Rules

This notebook contains rules used in the library with examples. Some rules executed during Arche.report_all(), and some are meant to be executed separately.

Some definitions here are used interchangeably:

  • Rule - a test case for data. As a test case, it can be failed, passed or skipped. Some of the rules output only information like Category fields

  • df - a dataframe which holds input data (from a job, collection or other source)

  • Scrapy cloud item - a row in a df

  • Items fields - columns in a df

[1]:
import arche
from arche import *
from arche.readers.items import Items
[2]:
items = Items.from_df(pd.read_csv("https://raw.githubusercontent.com/scrapinghub/arche/master/docs/source/nbs/data/items_products_8.csv"))
target_items = Items.from_df(pd.read_csv("https://raw.githubusercontent.com/scrapinghub/arche/master/docs/source/nbs/data/items_products_7.csv"))
[4]:
df = items.df.drop(columns=["_type"])
target_df = target_items.df

Accessing Graphs Data

The data is in stats. See Result class for more details.

[11]:
arche.rules.coverage.check_fields_coverage(df).stats

Coverage

Fields coverage on input data

[12]:
help(arche.rules.coverage.check_fields_coverage)
[13]:
arche.rules.coverage.check_fields_coverage(df).show()

Anomalies

[14]:
help(arche.rules.coverage.anomalies)
[15]:
res = arche.rules.coverage.anomalies(target="381798/2/4", sample=["381798/2/8", "381798/2/7", "381798/2/6"])
res.show()

Categories

Category fields

[ ]:
help(arche.rules.category.get_categories)
[ ]:
arche.rules.category.get_categories(df, max_uniques=200).show()

Category coverage

In report_all(), these rules use category tag.

[ ]:
help(arche.rules.category.get_coverage_per_category)
[ ]:
arche.rules.category.get_coverage_per_category(df, ["category"]).show()
[ ]:
help(arche.rules.category.get_difference)
[ ]:
arche.rules.category.get_difference(df, target_df, ["category"]).show()

Compare

Fields

[ ]:
help(arche.rules.compare.fields)
[ ]:
arche.rules.compare.fields(df, target_df, ["part_number", "name", "uom"]).show()

Duplicates

Find duplicates by any combination of columns (fields)

This rule is executed when uniques is passed to Arche.report_all().

[5]:
help(arche.rules.duplicates.find_by)
[8]:
arche.rules.duplicates.find_by(df, ["uom", ["name", "part_number"]]).show(short=True)
[ ]: