whylogs.core.summaryconverters

Library module defining function for generating summaries

Module Contents

Functions

from_sketch(sketch: datasketches.update_theta_sketch, num_std_devs: float = 1)

Generate a protobuf summary message from a datasketches theta sketch

from_string_sketch(sketch: datasketches.frequent_strings_sketch)

Generate a protobuf summary message from a string sketch

quantiles_from_sketch(sketch: datasketches.kll_floats_sketch, quantiles=None)

Calculate quantiles from a data sketch

single_quantile_from_sketch(sketch: datasketches.kll_floats_sketch, quantile: float)

Calculate the specified quantile from a data sketch

_calculate_bins(end: float, start: float, n: int, avg_per_bucket: float, max_buckets: int)

histogram_from_sketch(sketch: datasketches.kll_floats_sketch, max_buckets: int = None, avg_per_bucket: int = None)

Generate a summary of a kll_floats_sketch, including a histogram

entropy_from_column_summary(summary: whylogs.proto.ColumnSummary, histogram: datasketches.kll_floats_sketch)

Calculate the estimated entropy for a ColumnProfile, using the ColumnSummary

ks_test_compute_p_value(target_distribution: datasketches.kll_floats_sketch, reference_distribution: datasketches.kll_floats_sketch)

Compute the Kolmogorov-Smirnov test p-value of two continuous distributions.

compute_kl_divergence(target_distribution: Union[datasketches.kll_floats_sketch, whylogs.proto.ReferenceDistributionDiscreteMessage], reference_distribution: Union[datasketches.kll_floats_sketch, whylogs.proto.ReferenceDistributionDiscreteMessage])

Calculates the KL divergence between a target feature and a reference feature.

_compute_kl_divergence_continuous_distributions(target_distribution: datasketches.kll_floats_sketch, reference_distribution: datasketches.kll_floats_sketch)

Calculates the estimated KL divergence for two continuous distributions.

_compute_kl_divergence_discrete_distributions(target_distribution: whylogs.proto.ReferenceDistributionDiscreteMessage, reference_distribution: whylogs.proto.ReferenceDistributionDiscreteMessage)

Calculates the estimated KL divergence for two discrete distributions.

compute_chi_squared_test_p_value(target_distribution: whylogs.proto.ReferenceDistributionDiscreteMessage, reference_distribution: whylogs.proto.ReferenceDistributionDiscreteMessage)

Calculates the Chi-Squared test p-value for two discrete distributions.

Attributes

MAX_HIST_BUCKETS

HIST_AVG_NUMBER_PER_BUCKET

QUANTILES

logger

whylogs.core.summaryconverters.MAX_HIST_BUCKETS = 30
whylogs.core.summaryconverters.HIST_AVG_NUMBER_PER_BUCKET = 4.0
whylogs.core.summaryconverters.QUANTILES = [0.0, 0.01, 0.05, 0.25, 0.5, 0.75, 0.95, 0.99, 1.0]
whylogs.core.summaryconverters.logger
whylogs.core.summaryconverters.from_sketch(sketch: datasketches.update_theta_sketch, num_std_devs: float = 1)

Generate a protobuf summary message from a datasketches theta sketch

Parameters
  • sketch – Theta sketch to summarize

  • num_std_devs – Number of standard deviations for calculating bounds

Returns

summary

Return type

UniqueCountSummary

whylogs.core.summaryconverters.from_string_sketch(sketch: datasketches.frequent_strings_sketch)

Generate a protobuf summary message from a string sketch

Parameters

sketch – Frequent strings sketch

Returns

summary

Return type

FrequentStringsSummary

whylogs.core.summaryconverters.quantiles_from_sketch(sketch: datasketches.kll_floats_sketch, quantiles=None)

Calculate quantiles from a data sketch

Parameters
  • sketch (kll_floats_sketch) – Data sketch

  • quantiles (list-like) – Override the default quantiles. Should be a list of values from 0 to 1 inclusive.

whylogs.core.summaryconverters.single_quantile_from_sketch(sketch: datasketches.kll_floats_sketch, quantile: float)

Calculate the specified quantile from a data sketch

Parameters
  • sketch (kll_floats_sketch) – Data sketch

  • quantile (float) – Override the default quantiles to a single quantile. Should be a value from 0 to 1 inclusive.

Return type

Anonymous object with one filed equal to the quantile value

whylogs.core.summaryconverters._calculate_bins(end: float, start: float, n: int, avg_per_bucket: float, max_buckets: int)
whylogs.core.summaryconverters.histogram_from_sketch(sketch: datasketches.kll_floats_sketch, max_buckets: int = None, avg_per_bucket: int = None)

Generate a summary of a kll_floats_sketch, including a histogram

Parameters
  • sketch (kll_floats_sketch) – Data sketch

  • max_buckets (int) – Override the default maximum number of buckets

  • avg_per_bucket (int) – Override the default target number of items per bucket.

Returns

histogram – Protobuf histogram message

Return type

HistogramSummary

whylogs.core.summaryconverters.entropy_from_column_summary(summary: whylogs.proto.ColumnSummary, histogram: datasketches.kll_floats_sketch)

Calculate the estimated entropy for a ColumnProfile, using the ColumnSummary Can be used for both continuous and discrete types of data.

Parameters
  • summary (ColumnSummary) – Protobuf summary message

  • histogram (datasketches.kll_floats_sketch) – Data sketch for quantiles

Returns

entropy – Estimated entropy value, np.nan if the inferred data type of the column is not categorical or numeric

Return type

float

whylogs.core.summaryconverters.ks_test_compute_p_value(target_distribution: datasketches.kll_floats_sketch, reference_distribution: datasketches.kll_floats_sketch)

Compute the Kolmogorov-Smirnov test p-value of two continuous distributions. Uses the quantile values and the corresponding CDFs to calculate the approximate KS statistic. Only applicable to continuous distributions. The null hypothesis expects the samples to come from the same distribution.

Parameters
  • target_distribution (datasketches.kll_floats_sketch) – A kll_floats_sketch (quantiles sketch) from the target distribution’s values

  • reference_distribution (datasketches.kll_floats_sketch) – A kll_floats_sketch (quantiles sketch) from the reference (expected) distribution’s values Can be generated from a theoretical distribution, or another sample for the same feature.

Returns

  • p_value (float)

  • The estimated p-value from the parametrized KS test, applied on the target and reference distributions’

  • kll_floats_sketch summaries

whylogs.core.summaryconverters.compute_kl_divergence(target_distribution: Union[datasketches.kll_floats_sketch, whylogs.proto.ReferenceDistributionDiscreteMessage], reference_distribution: Union[datasketches.kll_floats_sketch, whylogs.proto.ReferenceDistributionDiscreteMessage])

Calculates the KL divergence between a target feature and a reference feature. Applicable to both continuous and discrete distributions. Uses the pmf and the datasketches.kll_floats_sketch to calculate the KL divergence in the continuous case. Uses the top frequent items to calculate the KL divergence in the discrete case.

Parameters
  • target_distribution (Union[kll_floats_sketch, ReferenceDistributionDiscreteMessage]) – The target distribution. Should be a datasketches.kll_floats_sketch if the target distribution is continuous. Should be a ReferenceDistributionDiscreteMessage if the target distribution is discrete. Both the target distribution, specified in target_distribution, and the reference distribution, specified in reference_distribution must be of the same type.

  • reference_distribution (Union[kll_floats_sketch, ReferenceDistributionDiscreteMessage]) – The reference distribution. Should be a datasketches.kll_floats_sketch if the reference distribution is continuous. Should be a ReferenceDistributionDiscreteMessage if the reference distribution is discrete. Both the target distribution, specified in target_distribution, and the reference distribution, specified in reference_distribution must be of the same type.

Returns

  • kl_divergence (float)

  • The estimated value of the KL divergence between the target and the reference feature

whylogs.core.summaryconverters._compute_kl_divergence_continuous_distributions(target_distribution: datasketches.kll_floats_sketch, reference_distribution: datasketches.kll_floats_sketch)

Calculates the estimated KL divergence for two continuous distributions. Uses the datasketches.kll_floats_sketch sketch to calculate the KL divergence based on the PMFs. Only applicable to continuous distributions.

Parameters
  • target_distribution (datasketches.kll_floats_sketch) – The quantiles summary of the target feature’s distribution.

  • reference_distribution (datasketches.kll_floats_sketch) – The quantiles summary of the reference feature’s distribution.

Returns

  • kl_divergence (float)

  • The estimated KL divergence between two continuous features.

whylogs.core.summaryconverters._compute_kl_divergence_discrete_distributions(target_distribution: whylogs.proto.ReferenceDistributionDiscreteMessage, reference_distribution: whylogs.proto.ReferenceDistributionDiscreteMessage)

Calculates the estimated KL divergence for two discrete distributions. Uses the frequent items summary to calculate the estimated frequencies of items in each distribution. Only applicable to discrete distributions.

Parameters
  • target_distribution (ReferenceDistributionDiscreteMessage) – The summary message of the target feature’s distribution. Should be a ReferenceDistributionDiscreteMessage containing the frequent items, unique, and total count summaries.

  • reference_distribution (ReferenceDistributionDiscreteMessage) – The summary message of the reference feature’s distribution. Should be a ReferenceDistributionDiscreteMessage containing the frequent items, unique, and total count summaries.

Returns

  • kl_divergence (float)

  • The estimated KL divergence between two discrete features.

whylogs.core.summaryconverters.compute_chi_squared_test_p_value(target_distribution: whylogs.proto.ReferenceDistributionDiscreteMessage, reference_distribution: whylogs.proto.ReferenceDistributionDiscreteMessage)

Calculates the Chi-Squared test p-value for two discrete distributions. Uses the top frequent items summary, unique count estimate and total count estimate for each feature, to calculate the estimated Chi-Squared statistic. Applicable only to discrete distributions.

Parameters
  • target_distribution (ReferenceDistributionDiscreteMessage) – The summary message of the target feature’s distribution. Should be a ReferenceDistributionDiscreteMessage containing the frequent items, unique, and total count summaries.

  • reference_distribution (ReferenceDistributionDiscreteMessage) – The summary message of the reference feature’s distribution. Should be a ReferenceDistributionDiscreteMessage containing the frequent items, unique, and total count summaries.

Returns

  • p_value (float)

  • The estimated p-value from the Chi-Squared test, applied on the target and reference distributions’

  • frequent and unique items summaries