`arche.rules.category`¶

Module Contents¶

arche.rules.category.get_difference(source_df: pd.DataFrame, target_df: pd.DataFrame, category_names: List[str], err_thr: float = 0.2, warn_thr: float = 0.1) → Result¶

Find and show differences between categories coverage, including nan values. Coverage means value counts divided on total size.

Parameters

source_df – a data you want to compare
target_df – a data you want to compare with
category_names – list of columns which values to compare
err_thr – sets error threshold
warn_thr – warning threshold

Returns

A result instance with messages containing significant difference defined by thresholds, a dataframe showing all normalized value counts in percents and a series containing significant difference.

arche.rules.category.get_coverage_per_category(df: pd.DataFrame, category_names: List[str]) → Result¶

Get value counts per column, excluding nan.

Parameters

df – a source data to assess
category_names – list of columns which values counts to see

Returns

Number of categories per field, value counts series for each field.

arche.rules.category.get_categories(df: pd.DataFrame, max_uniques: int = 10) → Result¶

Find category columns. A category column is the column which holds a limited number of possible values, including NAN.

Parameters

df – data
max_uniques – filter which determines which columns to use. Only columns with
number of unique values less than or equal to max_uniques are category columns. (the) –

Returns

A result with stats containing value counts of categorical columns.

arche.rules.category.find_likely_cats(df: pd.DataFrame, max_uniques: int, sample_size: int = 5000) → List[str]¶

Find columns which are probably categorical, including nested data. In fact we filter from columns which are certainly not categorical by given sample_size. Useful in cases with big datasets and nested data, since value_counts performance degrades significantly (100x-10000x) in such cases.

Parameters

df – where to find
max_uniques – how we decide what is a categorical column
sample_size – sample we look in. Defaults to 5000 since for bigger data
values make value_counts really slow after this number (nested) –

Returns

List of potential categorical column names.

arche.rules.category¶

Module Contents¶

`arche.rules.category`¶