arche.rules.coverage

Module Contents

arche.rules.coverage.check_fields_coverage(df: pd.DataFrame) → Result

Get fields coverage from df. Coverage reflects the percentage of real values (exluding nan) per column.

Parameters

df – a data to count the coverage

Returns

A result with coverage for all columns in provided df. If column contains only nan, treat it as an error.

arche.rules.coverage.get_difference(source_job: Job, target_job: Job, err_thr: float = 0.1, warn_thr: float = 0.05) → Result

Get difference between jobs coverages. The coverage is job fields counts divided on the job size.

Parameters
  • source_job – a base job, the difference is calculated from it

  • target_job – a job to compare

  • err_thr – a threshold for errors

  • warn_thr – a threshold for warnings

Returns

A Result instance with huge dif and stats with fields counts coverage and dif

arche.rules.coverage.compare_scraped_fields(source_df: pd.DataFrame, target_df: pd.DataFrame) → Result

Find new or missing columns between source_df and target_df

arche.rules.coverage.anomalies(target: str, sample: List[str]) → Result

Find fields with significant deviation. Significant means dev > 2 * std()

Parameters
  • target – where to look for anomalies

  • sample – a list of jobs keys to infer metadata from

Returns

A Result with a dataframe of significant deviations