arche.rules.category

Module Contents

arche.rules.category.get_difference(source_df: pd.DataFrame, target_df: pd.DataFrame, category_names: List[str], err_thr: float = 0.2, warn_thr: float = 0.1) → Result

Find and show differences between categories coverage, including nan values. Coverage means value counts divided on total size.

Parameters
  • source_df – a data you want to compare

  • target_df – a data you want to compare with

  • category_names – list of columns which values to compare

  • err_thr – sets error threshold

  • warn_thr – warning threshold

Returns

A result instance with messages containing significant difference defined by thresholds, a dataframe showing all normalized value counts in percents and a series containing significant difference.

arche.rules.category.get_coverage_per_category(df: pd.DataFrame, category_names: List[str]) → Result

Get value counts per column, excluding nan.

Parameters
  • df – a source data to assess

  • category_names – list of columns which values counts to see

Returns

Number of categories per field, value counts series for each field.

arche.rules.category.get_categories(df: pd.DataFrame, max_uniques: int = 10) → Result

Find category columns. A category column is the column which holds a limited number of possible values, including NAN.

Parameters
  • df – data

  • max_uniques – filter which determines which columns to use. Only columns with

  • number of unique values less than or equal to max_uniques are category columns. (the) –

Returns

A result with stats containing value counts of categorical columns.

arche.rules.category.find_likely_cats(df: pd.DataFrame, max_uniques: int, sample_size: int = 5000) → List[str]

Find columns which are probably categorical, including nested data. In fact we filter from columns which are certainly not categorical by given sample_size. Useful in cases with big datasets and nested data, since value_counts performance degrades significantly (100x-10000x) in such cases.

Parameters
  • df – where to find

  • max_uniques – how we decide what is a categorical column

  • sample_size – sample we look in. Defaults to 5000 since for bigger data

  • values make value_counts really slow after this number (nested) –

Returns

List of potential categorical column names.