arche.rules.category
¶
Module Contents¶
-
arche.rules.category.
get_difference
(source_df: pd.DataFrame, target_df: pd.DataFrame, category_names: List[str], err_thr: float = 0.2, warn_thr: float = 0.1) → Result¶ Find and show differences between categories coverage, including nan values. Coverage means value counts divided on total size.
- Parameters
source_df – a data you want to compare
target_df – a data you want to compare with
category_names – list of columns which values to compare
err_thr – sets error threshold
warn_thr – warning threshold
- Returns
A result instance with messages containing significant difference defined by thresholds, a dataframe showing all normalized value counts in percents and a series containing significant difference.
-
arche.rules.category.
get_coverage_per_category
(df: pd.DataFrame, category_names: List[str]) → Result¶ Get value counts per column, excluding nan.
- Parameters
df – a source data to assess
category_names – list of columns which values counts to see
- Returns
Number of categories per field, value counts series for each field.
-
arche.rules.category.
get_categories
(df: pd.DataFrame, max_uniques: int = 10) → Result¶ Find category columns. A category column is the column which holds a limited number of possible values, including NAN.
- Parameters
df – data
max_uniques – filter which determines which columns to use. Only columns with
number of unique values less than or equal to max_uniques are category columns. (the) –
- Returns
A result with stats containing value counts of categorical columns.
-
arche.rules.category.
find_likely_cats
(df: pd.DataFrame, max_uniques: int, sample_size: int = 5000) → List[str]¶ Find columns which are probably categorical, including nested data. In fact we filter from columns which are certainly not categorical by given sample_size. Useful in cases with big datasets and nested data, since value_counts performance degrades significantly (100x-10000x) in such cases.
- Parameters
df – where to find
max_uniques – how we decide what is a categorical column
sample_size – sample we look in. Defaults to 5000 since for bigger data
values make value_counts really slow after this number (nested) –
- Returns
List of potential categorical column names.