whylogs.features.autosegmentation

Module Contents

Functions

_entropy(series: pandas.Series, normalized: bool = True) → numpy.float64

Entropy calculation. If normalized, use log cardinality.

_weighted_entropy(df: pandas.DataFrame, split_columns: List[Optional[str]], target_column_name: str, normalized: bool = True)

Entropy calculation. If normalized, use log cardinality.

_information_gain_ratio(df: pandas.DataFrame, prev_split_columns: List[Optional[str]], column_name: str, target_column_name: str, normalized: bool = True)

Entropy calculation. If normalized, use log cardinality.

_find_best_split(df: pandas.DataFrame, prev_split_columns: List[str], valid_column_names: List[str], target_column_name: str)

_estimate_segments(df: pandas.DataFrame, target_field: str = None, max_segments: int = 30) → Optional[Union[List[Dict], List[str]]]

Estimates the most important features and values on which to segment

whylogs.features.autosegmentation._entropy(series: pandas.Series, normalized: bool = True) numpy.float64

Entropy calculation. If normalized, use log cardinality.

whylogs.features.autosegmentation._weighted_entropy(df: pandas.DataFrame, split_columns: List[Optional[str]], target_column_name: str, normalized: bool = True)

Entropy calculation. If normalized, use log cardinality.

whylogs.features.autosegmentation._information_gain_ratio(df: pandas.DataFrame, prev_split_columns: List[Optional[str]], column_name: str, target_column_name: str, normalized: bool = True)

Entropy calculation. If normalized, use log cardinality.

whylogs.features.autosegmentation._find_best_split(df: pandas.DataFrame, prev_split_columns: List[str], valid_column_names: List[str], target_column_name: str)
whylogs.features.autosegmentation._estimate_segments(df: pandas.DataFrame, target_field: str = None, max_segments: int = 30) Optional[Union[List[Dict], List[str]]]

Estimates the most important features and values on which to segment data profiling using entropy-based methods.

If no target column provided, maximum entropy column is substituted.

Parameters
  • df – the dataframe of data to profile

  • target_field – target field (optional)

  • max_segments – upper threshold for total combinations of segments,

default 30 :return: a list of segmentation feature names