whylogs.core.summaryconverters
¶
Library module defining function for generating summaries
Module Contents¶
Functions¶
|
Generate a protobuf summary message from a datasketches theta sketch |
|
Generate a protobuf summary message from a string sketch |
|
Calculate quantiles from a data sketch |
|
Calculate the specified quantile from a data sketch |
|
|
|
Generate a summary of a kll_floats_sketch, including a histogram |
|
Calculate the estimated entropy for a ColumnProfile, using the ColumnSummary |
|
Compute the Kolmogorov-Smirnov test p-value of two continuous distributions. |
|
Calculates the KL divergence between a target feature and a reference feature. |
|
Calculates the estimated KL divergence for two continuous distributions. |
|
Calculates the estimated KL divergence for two discrete distributions. |
|
Calculates the Chi-Squared test p-value for two discrete distributions. |
Attributes¶
- whylogs.core.summaryconverters.MAX_HIST_BUCKETS = 30¶
- whylogs.core.summaryconverters.HIST_AVG_NUMBER_PER_BUCKET = 4.0¶
- whylogs.core.summaryconverters.QUANTILES = [0.0, 0.01, 0.05, 0.25, 0.5, 0.75, 0.95, 0.99, 1.0]¶
- whylogs.core.summaryconverters.logger¶
- whylogs.core.summaryconverters.from_sketch(sketch: datasketches.update_theta_sketch, num_std_devs: float = 1)¶
Generate a protobuf summary message from a datasketches theta sketch
- Parameters
sketch – Theta sketch to summarize
num_std_devs – Number of standard deviations for calculating bounds
- Returns
summary
- Return type
UniqueCountSummary
- whylogs.core.summaryconverters.from_string_sketch(sketch: datasketches.frequent_strings_sketch)¶
Generate a protobuf summary message from a string sketch
- Parameters
sketch – Frequent strings sketch
- Returns
summary
- Return type
FrequentStringsSummary
- whylogs.core.summaryconverters.quantiles_from_sketch(sketch: datasketches.kll_floats_sketch, quantiles=None)¶
Calculate quantiles from a data sketch
- Parameters
sketch (kll_floats_sketch) – Data sketch
quantiles (list-like) – Override the default quantiles. Should be a list of values from 0 to 1 inclusive.
- whylogs.core.summaryconverters.single_quantile_from_sketch(sketch: datasketches.kll_floats_sketch, quantile: float)¶
Calculate the specified quantile from a data sketch
- Parameters
sketch (kll_floats_sketch) – Data sketch
quantile (float) – Override the default quantiles to a single quantile. Should be a value from 0 to 1 inclusive.
- Return type
Anonymous object with one filed equal to the quantile value
- whylogs.core.summaryconverters._calculate_bins(end: float, start: float, n: int, avg_per_bucket: float, max_buckets: int)¶
- whylogs.core.summaryconverters.histogram_from_sketch(sketch: datasketches.kll_floats_sketch, max_buckets: int = None, avg_per_bucket: int = None)¶
Generate a summary of a kll_floats_sketch, including a histogram
- Parameters
sketch (kll_floats_sketch) – Data sketch
max_buckets (int) – Override the default maximum number of buckets
avg_per_bucket (int) – Override the default target number of items per bucket.
- Returns
histogram – Protobuf histogram message
- Return type
HistogramSummary
- whylogs.core.summaryconverters.entropy_from_column_summary(summary: whylogs.proto.ColumnSummary, histogram: datasketches.kll_floats_sketch)¶
Calculate the estimated entropy for a ColumnProfile, using the ColumnSummary Can be used for both continuous and discrete types of data.
- Parameters
summary (ColumnSummary) – Protobuf summary message
histogram (datasketches.kll_floats_sketch) – Data sketch for quantiles
- Returns
entropy – Estimated entropy value, np.nan if the inferred data type of the column is not categorical or numeric
- Return type
float
- whylogs.core.summaryconverters.ks_test_compute_p_value(target_distribution: datasketches.kll_floats_sketch, reference_distribution: datasketches.kll_floats_sketch)¶
Compute the Kolmogorov-Smirnov test p-value of two continuous distributions. Uses the quantile values and the corresponding CDFs to calculate the approximate KS statistic. Only applicable to continuous distributions. The null hypothesis expects the samples to come from the same distribution.
- Parameters
target_distribution (datasketches.kll_floats_sketch) – A kll_floats_sketch (quantiles sketch) from the target distribution’s values
reference_distribution (datasketches.kll_floats_sketch) – A kll_floats_sketch (quantiles sketch) from the reference (expected) distribution’s values Can be generated from a theoretical distribution, or another sample for the same feature.
- Returns
p_value (float)
The estimated p-value from the parametrized KS test, applied on the target and reference distributions’
kll_floats_sketch summaries
- whylogs.core.summaryconverters.compute_kl_divergence(target_distribution: Union[datasketches.kll_floats_sketch, whylogs.proto.ReferenceDistributionDiscreteMessage], reference_distribution: Union[datasketches.kll_floats_sketch, whylogs.proto.ReferenceDistributionDiscreteMessage])¶
Calculates the KL divergence between a target feature and a reference feature. Applicable to both continuous and discrete distributions. Uses the pmf and the datasketches.kll_floats_sketch to calculate the KL divergence in the continuous case. Uses the top frequent items to calculate the KL divergence in the discrete case.
- Parameters
target_distribution (Union[kll_floats_sketch, ReferenceDistributionDiscreteMessage]) – The target distribution. Should be a datasketches.kll_floats_sketch if the target distribution is continuous. Should be a ReferenceDistributionDiscreteMessage if the target distribution is discrete. Both the target distribution, specified in target_distribution, and the reference distribution, specified in reference_distribution must be of the same type.
reference_distribution (Union[kll_floats_sketch, ReferenceDistributionDiscreteMessage]) – The reference distribution. Should be a datasketches.kll_floats_sketch if the reference distribution is continuous. Should be a ReferenceDistributionDiscreteMessage if the reference distribution is discrete. Both the target distribution, specified in target_distribution, and the reference distribution, specified in reference_distribution must be of the same type.
- Returns
kl_divergence (float)
The estimated value of the KL divergence between the target and the reference feature
- whylogs.core.summaryconverters._compute_kl_divergence_continuous_distributions(target_distribution: datasketches.kll_floats_sketch, reference_distribution: datasketches.kll_floats_sketch)¶
Calculates the estimated KL divergence for two continuous distributions. Uses the datasketches.kll_floats_sketch sketch to calculate the KL divergence based on the PMFs. Only applicable to continuous distributions.
- Parameters
target_distribution (datasketches.kll_floats_sketch) – The quantiles summary of the target feature’s distribution.
reference_distribution (datasketches.kll_floats_sketch) – The quantiles summary of the reference feature’s distribution.
- Returns
kl_divergence (float)
The estimated KL divergence between two continuous features.
- whylogs.core.summaryconverters._compute_kl_divergence_discrete_distributions(target_distribution: whylogs.proto.ReferenceDistributionDiscreteMessage, reference_distribution: whylogs.proto.ReferenceDistributionDiscreteMessage)¶
Calculates the estimated KL divergence for two discrete distributions. Uses the frequent items summary to calculate the estimated frequencies of items in each distribution. Only applicable to discrete distributions.
- Parameters
target_distribution (ReferenceDistributionDiscreteMessage) – The summary message of the target feature’s distribution. Should be a ReferenceDistributionDiscreteMessage containing the frequent items, unique, and total count summaries.
reference_distribution (ReferenceDistributionDiscreteMessage) – The summary message of the reference feature’s distribution. Should be a ReferenceDistributionDiscreteMessage containing the frequent items, unique, and total count summaries.
- Returns
kl_divergence (float)
The estimated KL divergence between two discrete features.
- whylogs.core.summaryconverters.compute_chi_squared_test_p_value(target_distribution: whylogs.proto.ReferenceDistributionDiscreteMessage, reference_distribution: whylogs.proto.ReferenceDistributionDiscreteMessage)¶
Calculates the Chi-Squared test p-value for two discrete distributions. Uses the top frequent items summary, unique count estimate and total count estimate for each feature, to calculate the estimated Chi-Squared statistic. Applicable only to discrete distributions.
- Parameters
target_distribution (ReferenceDistributionDiscreteMessage) – The summary message of the target feature’s distribution. Should be a ReferenceDistributionDiscreteMessage containing the frequent items, unique, and total count summaries.
reference_distribution (ReferenceDistributionDiscreteMessage) – The summary message of the reference feature’s distribution. Should be a ReferenceDistributionDiscreteMessage containing the frequent items, unique, and total count summaries.
- Returns
p_value (float)
The estimated p-value from the Chi-Squared test, applied on the target and reference distributions’
frequent and unique items summaries