:mod:`arche.rules.category` =========================== .. py:module:: arche.rules.category Module Contents --------------- .. function:: get_difference(source_df: pd.DataFrame, target_df: pd.DataFrame, category_names: List[str], err_thr: float = 0.2, warn_thr: float = 0.1) -> Result Find and show differences between categories coverage, including `nan` values. Coverage means value counts divided on total size. :param source_df: a data you want to compare :param target_df: a data you want to compare with :param category_names: list of columns which values to compare :param err_thr: sets error threshold :param warn_thr: warning threshold :returns: A result instance with messages containing significant difference defined by thresholds, a dataframe showing all normalized value counts in percents and a series containing significant difference. .. function:: get_coverage_per_category(df: pd.DataFrame, category_names: List[str]) -> Result Get value counts per column, excluding nan. :param df: a source data to assess :param category_names: list of columns which values counts to see :returns: Number of categories per field, value counts series for each field. .. function:: get_categories(df: pd.DataFrame, max_uniques: int = 10) -> Result Find category columns. A category column is the column which holds a limited number of possible values, including `NAN`. :param df: data :param max_uniques: filter which determines which columns to use. Only columns with :param the number of unique values less than or equal to `max_uniques` are category columns.: :returns: A result with stats containing value counts of categorical columns. .. function:: find_likely_cats(df: pd.DataFrame, max_uniques: int, sample_size: int = 5000) -> List[str] Find columns which are probably categorical, including nested data. In fact we filter from columns which are certainly not categorical by given `sample_size`. Useful in cases with big datasets and nested data, since `value_counts` performance degrades significantly (100x-10000x) in such cases. :param df: where to find :param max_uniques: how we decide what is a categorical column :param sample_size: sample we look in. Defaults to 5000 since for bigger data :param nested values make `value_counts` really slow after this number: :returns: List of potential categorical column names.