`whylogs.core.summaryconverters`¶

Library module defining function for generating summaries

Module Contents¶

Functions¶

`from_sketch`(sketch: datasketches.update_theta_sketch, num_std_devs: float = 1)	Generate a protobuf summary message from a datasketches theta sketch
`from_string_sketch`(sketch: datasketches.frequent_strings_sketch)	Generate a protobuf summary message from a string sketch
`quantiles_from_sketch`(sketch: datasketches.kll_floats_sketch, quantiles=None)	Calculate quantiles from a data sketch
`single_quantile_from_sketch`(sketch: datasketches.kll_floats_sketch, quantile: float)	Calculate the specified quantile from a data sketch
`_calculate_bins`(end: float, start: float, n: int, avg_per_bucket: float, max_buckets: int)
`histogram_from_sketch`(sketch: datasketches.kll_floats_sketch, max_buckets: int = None, avg_per_bucket: int = None)	Generate a summary of a kll_floats_sketch, including a histogram
`entropy_from_column_summary`(summary: whylogs.proto.ColumnSummary, histogram: datasketches.kll_floats_sketch)	Calculate the estimated entropy for a ColumnProfile, using the ColumnSummary
`ks_test_compute_p_value`(target_distribution: datasketches.kll_floats_sketch, reference_distribution: datasketches.kll_floats_sketch)	Compute the Kolmogorov-Smirnov test p-value of two continuous distributions.
`compute_kl_divergence`(target_distribution: Union[datasketches.kll_floats_sketch, whylogs.proto.ReferenceDistributionDiscreteMessage], reference_distribution: Union[datasketches.kll_floats_sketch, whylogs.proto.ReferenceDistributionDiscreteMessage])	Calculates the KL divergence between a target feature and a reference feature.
`_compute_kl_divergence_continuous_distributions`(target_distribution: datasketches.kll_floats_sketch, reference_distribution: datasketches.kll_floats_sketch)	Calculates the estimated KL divergence for two continuous distributions.
`_compute_kl_divergence_discrete_distributions`(target_distribution: whylogs.proto.ReferenceDistributionDiscreteMessage, reference_distribution: whylogs.proto.ReferenceDistributionDiscreteMessage)	Calculates the estimated KL divergence for two discrete distributions.
`compute_chi_squared_test_p_value`(target_distribution: whylogs.proto.ReferenceDistributionDiscreteMessage, reference_distribution: whylogs.proto.ReferenceDistributionDiscreteMessage)	Calculates the Chi-Squared test p-value for two discrete distributions.

Attributes¶

`MAX_HIST_BUCKETS`
`HIST_AVG_NUMBER_PER_BUCKET`
`QUANTILES`
`logger`

whylogs.core.summaryconverters.MAX_HIST_BUCKETS = 30¶

whylogs.core.summaryconverters.HIST_AVG_NUMBER_PER_BUCKET = 4.0¶

whylogs.core.summaryconverters.QUANTILES = [0.0, 0.01, 0.05, 0.25, 0.5, 0.75, 0.95, 0.99, 1.0]¶

whylogs.core.summaryconverters.logger¶

whylogs.core.summaryconverters.from_sketch(sketch: datasketches.update_theta_sketch, num_std_devs: float = 1)¶

Generate a protobuf summary message from a datasketches theta sketch

Parameters

sketch – Theta sketch to summarize
num_std_devs – Number of standard deviations for calculating bounds

Returns

summary

Return type

UniqueCountSummary

whylogs.core.summaryconverters.from_string_sketch(sketch: datasketches.frequent_strings_sketch)¶

Generate a protobuf summary message from a string sketch

Parameters: sketch – Frequent strings sketch
Returns: summary
Return type: FrequentStringsSummary

whylogs.core.summaryconverters.quantiles_from_sketch(sketch: datasketches.kll_floats_sketch, quantiles=None)¶

Calculate quantiles from a data sketch

Parameters

sketch (kll_floats_sketch) – Data sketch
quantiles (list-like) – Override the default quantiles. Should be a list of values from 0 to 1 inclusive.

whylogs.core.summaryconverters.single_quantile_from_sketch(sketch: datasketches.kll_floats_sketch, quantile: float)¶

Calculate the specified quantile from a data sketch

Parameters

sketch (kll_floats_sketch) – Data sketch
quantile (float) – Override the default quantiles to a single quantile. Should be a value from 0 to 1 inclusive.

Return type

Anonymous object with one filed equal to the quantile value

whylogs.core.summaryconverters._calculate_bins(end: float, start: float, n: int, avg_per_bucket: float, max_buckets: int)¶

whylogs.core.summaryconverters.histogram_from_sketch(sketch: datasketches.kll_floats_sketch, max_buckets: int = None, avg_per_bucket: int = None)¶

Generate a summary of a kll_floats_sketch, including a histogram

Parameters

sketch (kll_floats_sketch) – Data sketch
max_buckets (int) – Override the default maximum number of buckets
avg_per_bucket (int) – Override the default target number of items per bucket.

Returns

histogram – Protobuf histogram message

Return type

HistogramSummary

whylogs.core.summaryconverters.entropy_from_column_summary(summary: whylogs.proto.ColumnSummary, histogram: datasketches.kll_floats_sketch)¶

Calculate the estimated entropy for a ColumnProfile, using the ColumnSummary Can be used for both continuous and discrete types of data.

Parameters

summary (ColumnSummary) – Protobuf summary message
histogram (datasketches.kll_floats_sketch) – Data sketch for quantiles

Returns

entropy – Estimated entropy value, np.nan if the inferred data type of the column is not categorical or numeric

Return type

float

whylogs.core.summaryconverters.ks_test_compute_p_value(target_distribution: datasketches.kll_floats_sketch, reference_distribution: datasketches.kll_floats_sketch)¶

Compute the Kolmogorov-Smirnov test p-value of two continuous distributions. Uses the quantile values and the corresponding CDFs to calculate the approximate KS statistic. Only applicable to continuous distributions. The null hypothesis expects the samples to come from the same distribution.

Parameters

target_distribution (datasketches.kll_floats_sketch) – A kll_floats_sketch (quantiles sketch) from the target distribution’s values
reference_distribution (datasketches.kll_floats_sketch) – A kll_floats_sketch (quantiles sketch) from the reference (expected) distribution’s values Can be generated from a theoretical distribution, or another sample for the same feature.

Returns

p_value (float)
The estimated p-value from the parametrized KS test, applied on the target and reference distributions’
kll_floats_sketch summaries

whylogs.core.summaryconverters.compute_kl_divergence(target_distribution: Union[datasketches.kll_floats_sketch, whylogs.proto.ReferenceDistributionDiscreteMessage], reference_distribution: Union[datasketches.kll_floats_sketch, whylogs.proto.ReferenceDistributionDiscreteMessage])¶

Calculates the KL divergence between a target feature and a reference feature. Applicable to both continuous and discrete distributions. Uses the pmf and the datasketches.kll_floats_sketch to calculate the KL divergence in the continuous case. Uses the top frequent items to calculate the KL divergence in the discrete case.

Parameters

target_distribution (Union[kll_floats_sketch, ReferenceDistributionDiscreteMessage]) – The target distribution. Should be a datasketches.kll_floats_sketch if the target distribution is continuous. Should be a ReferenceDistributionDiscreteMessage if the target distribution is discrete. Both the target distribution, specified in target_distribution, and the reference distribution, specified in reference_distribution must be of the same type.
reference_distribution (Union[kll_floats_sketch, ReferenceDistributionDiscreteMessage]) – The reference distribution. Should be a datasketches.kll_floats_sketch if the reference distribution is continuous. Should be a ReferenceDistributionDiscreteMessage if the reference distribution is discrete. Both the target distribution, specified in target_distribution, and the reference distribution, specified in reference_distribution must be of the same type.

Returns

kl_divergence (float)
The estimated value of the KL divergence between the target and the reference feature

whylogs.core.summaryconverters._compute_kl_divergence_continuous_distributions(target_distribution: datasketches.kll_floats_sketch, reference_distribution: datasketches.kll_floats_sketch)¶

Calculates the estimated KL divergence for two continuous distributions. Uses the datasketches.kll_floats_sketch sketch to calculate the KL divergence based on the PMFs. Only applicable to continuous distributions.

Parameters

target_distribution (datasketches.kll_floats_sketch) – The quantiles summary of the target feature’s distribution.
reference_distribution (datasketches.kll_floats_sketch) – The quantiles summary of the reference feature’s distribution.

Returns

kl_divergence (float)
The estimated KL divergence between two continuous features.

whylogs.core.summaryconverters._compute_kl_divergence_discrete_distributions(target_distribution: whylogs.proto.ReferenceDistributionDiscreteMessage, reference_distribution: whylogs.proto.ReferenceDistributionDiscreteMessage)¶

Calculates the estimated KL divergence for two discrete distributions. Uses the frequent items summary to calculate the estimated frequencies of items in each distribution. Only applicable to discrete distributions.

Parameters

target_distribution (ReferenceDistributionDiscreteMessage) – The summary message of the target feature’s distribution. Should be a ReferenceDistributionDiscreteMessage containing the frequent items, unique, and total count summaries.
reference_distribution (ReferenceDistributionDiscreteMessage) – The summary message of the reference feature’s distribution. Should be a ReferenceDistributionDiscreteMessage containing the frequent items, unique, and total count summaries.

Returns

kl_divergence (float)
The estimated KL divergence between two discrete features.

whylogs.core.summaryconverters.compute_chi_squared_test_p_value(target_distribution: whylogs.proto.ReferenceDistributionDiscreteMessage, reference_distribution: whylogs.proto.ReferenceDistributionDiscreteMessage)¶

Calculates the Chi-Squared test p-value for two discrete distributions. Uses the top frequent items summary, unique count estimate and total count estimate for each feature, to calculate the estimated Chi-Squared statistic. Applicable only to discrete distributions.

Parameters

target_distribution (ReferenceDistributionDiscreteMessage) – The summary message of the target feature’s distribution. Should be a ReferenceDistributionDiscreteMessage containing the frequent items, unique, and total count summaries.
reference_distribution (ReferenceDistributionDiscreteMessage) – The summary message of the reference feature’s distribution. Should be a ReferenceDistributionDiscreteMessage containing the frequent items, unique, and total count summaries.

Returns

p_value (float)
The estimated p-value from the Chi-Squared test, applied on the target and reference distributions’
frequent and unique items summaries

whylogs.core.summaryconverters¶

Module Contents¶

Functions¶

Attributes¶

`whylogs.core.summaryconverters`¶