whylogs.core.statistics.constraints
¶
Module Contents¶
Classes¶
ValueConstraints express a binary boolean relationship between an implied numeric value and a literal. |
|
Summary constraints specify a relationship between a summary field and a static value, |
|
ValueConstraints express a binary boolean relationship between an implied numeric value and a literal. |
|
Functions¶
|
Return whether the string is in a strftime format. |
|
Return whether the string can be interpreted as a date. |
|
Return whether the string can be interpreted as json. |
|
Return whether the provided json matches the provided schema. |
|
|
|
|
|
|
|
Defines a summary constraint on the standard deviation of a feature. The standard deviation can be defined to be |
|
Defines a summary constraint on the mean (average) of a feature. The mean can be defined to be |
|
Defines a summary constraint on the minimum value of a feature. The minimum can be defined to be |
|
Defines a summary constraint on the minimum value of a feature. The minimum can be defined to be |
|
Defines a summary constraint on the maximum value of a feature. The maximum can be defined to be |
|
Defines a summary constraint on the maximum value of a feature. The maximum can be defined to be |
|
Defines a summary constraint on the distinct values of a feature. All of the distinct values should |
|
Defines a summary constraint on the distinct values of a feature. The set of the distinct values should |
|
Defines a summary constraint on the distinct values of a feature. The set of user-supplied reference values, |
|
Defines a value constraint with set operations on the values of a single feature. |
|
Defines a value constraint with email regex matching operations on the values of a single feature. |
|
Defines a value constraint with credit card number regex matching operations on the values of a single feature. |
|
Defines a value constraint which checks if the values of a single feature |
|
Defines a value constraint which checks if the values of a single feature |
|
Defines a value constraint which checks if the values of a single feature |
|
Defines a value constraint which checks if the values of a single feature |
|
Defines a value constraint with social security number (SSN) matching operations |
|
Defines a value constraint with URL regex matching operations on the values of a single feature. |
|
Defines a value constraint which checks if the string values of a single feature |
|
Defines a value constraint which checks if the string values' length of a single feature |
|
Defines a summary constraint on the n-th quantile value of a numeric feature. |
|
Defines a summary constraint on the cardinality of a specific feature. |
|
Defines a summary constraint on the proportion of unique values of a specific feature. |
|
Defines a constraint on the data set schema. |
|
Defines a constraint on the data set schema. |
|
Defines a constraint on the data set schema. |
|
Defines a summary constraint on the most common value of a feature. |
|
Defines a non-null summary constraint on the value of a feature. |
|
Defines a summary constraint on the proportion of missing values of a specific feature. |
|
Defines a summary constraint on the type of the feature values. |
|
Defines a summary constraint on the type of the feature values. |
|
Defines a summary constraint specifying the expected interval of the features estimated entropy. |
|
Defines a summary constraint specifying the expected |
|
Defines a summary constraint specifying the expected |
|
Defines a summary constraint specifying the expected |
|
Defines a multi-column value constraint which specifies that each value in column A, |
|
Defines a multi-column value constraint which specifies that each value in column A, |
|
Defines a multi-column value constraint which specifies that each value in column A, |
|
Defines a multi-column value constraint which specifies that each value in column A, |
|
Defines a multi-column value constraint which specifies that each value in column A, |
|
Defines a multi-column value constraint which specifies that each value in column A, |
|
Defines a multi-column value constraint which specifies that the sum of the values in each row |
|
Defines a multi-column value constraint which specifies that the pair of values of columns A and B, |
|
Defines a multi-column value constraint which specifies that the values of column A |
Attributes¶
Dict indexed by constraint operator. |
|
- whylogs.core.statistics.constraints.TYPES¶
- whylogs.core.statistics.constraints.logger¶
- whylogs.core.statistics.constraints._try_parse_strftime_format(strftime_val: str, format: str) Optional[datetime.datetime] ¶
Return whether the string is in a strftime format. :param strftime_val: str, string to check for date :param format: format to check if strftime_val can be parsed :return None if not parseable, otherwise the parsed datetime.datetime object
- whylogs.core.statistics.constraints._try_parse_dateutil(dateutil_val: str, ref_val=None) Optional[datetime.datetime] ¶
Return whether the string can be interpreted as a date. :param dateutil_val: str, string to check for date :param ref_val: any, not used, interface design requirement :return None if not parseable, otherwise the parsed datetime.datetime object
- whylogs.core.statistics.constraints._try_parse_json(json_string: str, ref_val=None) Optional[dict] ¶
Return whether the string can be interpreted as json. :param json_string: str, string to check for json :param ref_val: any, not used, interface design requirement :return None if not parseable, otherwise the parsed json object
- whylogs.core.statistics.constraints._matches_json_schema(json_data: Union[str, dict], json_schema: Union[str, dict]) bool ¶
Return whether the provided json matches the provided schema. :param json_data: json object to check :param json_schema: schema to check if the json object matches it :return True if the json data matches the schema, False otherwise
- whylogs.core.statistics.constraints.MAX_SET_DISPLAY_MESSAGE_LENGTH = 20¶
Dict indexed by constraint operator.
These help translate from constraint schema to language-specific functions that are faster to evaluate. This is just a form of currying, and I chose to bind the boolean comparison operator first.
- whylogs.core.statistics.constraints._value_funcs¶
- whylogs.core.statistics.constraints._summary_funcs1¶
- whylogs.core.statistics.constraints._summary_funcs2¶
- whylogs.core.statistics.constraints._multi_column_value_funcs¶
- class whylogs.core.statistics.constraints.ValueConstraint(op: whylogs.proto.Op, value=None, regex_pattern: str = None, apply_function=None, name: str = None, verbose=False)¶
ValueConstraints express a binary boolean relationship between an implied numeric value and a literal. When associated with a ColumnProfile, the relation is evaluated for every incoming value that is processed by whylogs.
- Parameters
op (whylogs.proto.Op (required)) – Enumeration of binary comparison operator applied between static value and incoming stream. Enum values are mapped to operators like ‘==’, ‘<’, and ‘<=’, etc.
value ((one-of)) – When value is provided, regex_pattern must be None. Static value to compare against incoming stream using operator specified in op.
regex_pattern ((one-of)) – When regex_pattern is provided, value must be None. Regex pattern to use when MATCH or NOMATCH operations are used.
apply_function – To be supplied only when using APPLY_FUNC operation. In case when the apply_function requires argument, to be supplied in the value param.
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- property name(self)¶
- update(self, v) bool ¶
- apply_func_validate(self, value) str ¶
- merge(self, other) ValueConstraint ¶
- static from_protobuf(msg: whylogs.proto.ValueConstraintMsg) ValueConstraint ¶
- to_protobuf(self) whylogs.proto.ValueConstraintMsg ¶
- report(self)¶
- class whylogs.core.statistics.constraints.SummaryConstraint(first_field: str, op: whylogs.proto.Op, value=None, upper_value=None, quantile_value: Union[int, float] = None, second_field: str = None, third_field: str = None, reference_set: Union[List[Any], Set[Any], datasketches.kll_floats_sketch, whylogs.proto.ReferenceDistributionDiscreteMessage] = None, name: str = None, verbose=False)¶
Summary constraints specify a relationship between a summary field and a static value, or between two summary fields. e.g. ‘min’ < 6
‘std_dev’ < 2.17 ‘min’ > ‘avg’
- Parameters
first_field (str) – Name of field in NumberSummary that will be compared against either a second field or a static value.
op (whylogs.proto.Op (required)) – Enumeration of binary comparison operator applied between summary values. Enum values are mapped to operators like ‘==’, ‘<’, and ‘<=’, etc.
value ((one-of)) – Static value to be compared against summary field specified in first_field. Only one of value or second_field should be supplied.
upper_value ((one-of)) – Only to be supplied when using Op.BTWN. Static upper boundary value to be compared against summary field specified in first_field. Only one of upper_value or third_field should be supplied.
second_field ((one-of)) – Name of second field in NumberSummary to be compared against summary field specified in first_field. Only one of value or second_field should be supplied.
third_field ((one-of)) –
- Only to be supplied when op == Op.BTWN. Name of third field in NumberSummary, used as an upper boundary,
to be compared against summary field specified in first_field.
Only one of upper_value or third_field should be supplied.
reference_set ((one-of)) – Only to be supplied when using set operations or distributional measures. Used as a reference set to be compared with the column distinct values. Or is instance of datasketches.kll_floats_sketch or ReferenceDistributionDiscreteMessage. Only to be supplied for constraints on distributional measures, such as KS test, KL divergence and Chi-Squared test.
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- property name(self)¶
- _get_field_name(self)¶
- _get_value_or_field(self)¶
- _get_constraint_type(self)¶
- _check_and_init_table_shape_constraint(self, reference_set)¶
- _check_and_init_valid_set_constraint(self, reference_set)¶
- _check_and_init_distributional_measure_constraint(self, reference_set)¶
- _check_and_init_between_constraint(self)¶
- _get_str_from_ref_set(self) str ¶
- _try_cast_set(self) Set[Any] ¶
- _get_string_and_numbers_sets(self)¶
- _create_theta_sketch(self, ref_set: set = None)¶
- update(self, update_summary: object) bool ¶
- merge(self, other) SummaryConstraint ¶
- _check_if_summary_constraint_message_is_valid(msg: whylogs.proto.SummaryConstraintMsg)¶
- static from_protobuf(msg: whylogs.proto.SummaryConstraintMsg) SummaryConstraint ¶
- to_protobuf(self) whylogs.proto.SummaryConstraintMsg ¶
- report(self)¶
- class whylogs.core.statistics.constraints.ValueConstraints(constraints: Mapping[str, ValueConstraint] = None)¶
- static from_protobuf(msg: whylogs.proto.ValueConstraintMsgs) ValueConstraints ¶
- __getitem__(self, name: str) Optional[ValueConstraint] ¶
- to_protobuf(self) whylogs.proto.ValueConstraintMsgs ¶
- update(self, v)¶
- update_typed(self, v)¶
- merge(self, other) ValueConstraints ¶
- report(self) List[tuple] ¶
- class whylogs.core.statistics.constraints.SummaryConstraints(constraints: Mapping[str, SummaryConstraint] = None)¶
- static from_protobuf(msg: whylogs.proto.SummaryConstraintMsgs) SummaryConstraints ¶
- __getitem__(self, name: str) Optional[SummaryConstraint] ¶
- to_protobuf(self) whylogs.proto.SummaryConstraintMsgs ¶
- update(self, v)¶
- merge(self, other) SummaryConstraints ¶
- report(self) List[tuple] ¶
- class whylogs.core.statistics.constraints.MultiColumnValueConstraint(dependent_columns: Union[str, List[str], Tuple[str], numpy.ndarray], op: whylogs.proto.Op, reference_columns: Union[str, List[str], Tuple[str], numpy.ndarray] = None, internal_dependent_cols_op: whylogs.proto.Op = None, value=None, name: str = None, verbose: bool = False)¶
Bases:
ValueConstraint
ValueConstraints express a binary boolean relationship between an implied numeric value and a literal. When associated with a ColumnProfile, the relation is evaluated for every incoming value that is processed by whylogs.
- Parameters
op (whylogs.proto.Op (required)) – Enumeration of binary comparison operator applied between static value and incoming stream. Enum values are mapped to operators like ‘==’, ‘<’, and ‘<=’, etc.
value ((one-of)) – When value is provided, regex_pattern must be None. Static value to compare against incoming stream using operator specified in op.
regex_pattern ((one-of)) – When regex_pattern is provided, value must be None. Regex pattern to use when MATCH or NOMATCH operations are used.
apply_function – To be supplied only when using APPLY_FUNC operation. In case when the apply_function requires argument, to be supplied in the value param.
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- property name(self)¶
- update(self, column_values_dictionary)¶
- merge(self, other) MultiColumnValueConstraint ¶
- static from_protobuf(msg: whylogs.proto.MultiColumnValueConstraintMsg) MultiColumnValueConstraint ¶
- to_protobuf(self) whylogs.proto.MultiColumnValueConstraintMsg ¶
- class whylogs.core.statistics.constraints.MultiColumnValueConstraints(constraints: Mapping[str, MultiColumnValueConstraint] = None)¶
Bases:
ValueConstraints
- static from_protobuf(msg: whylogs.proto.ValueConstraintMsgs) MultiColumnValueConstraints ¶
- to_protobuf(self) whylogs.proto.ValueConstraintMsgs ¶
- class whylogs.core.statistics.constraints.DatasetConstraints(props: whylogs.proto.DatasetProperties, value_constraints: Mapping[str, ValueConstraints] = None, summary_constraints: Mapping[str, SummaryConstraints] = None, table_shape_constraints: Mapping[str, SummaryConstraints] = None, multi_column_value_constraints: Optional[MultiColumnValueConstraints] = None)¶
- __getitem__(self, key)¶
- static from_protobuf(msg: whylogs.proto.DatasetConstraintMsg) DatasetConstraints ¶
- static from_json(data: str) DatasetConstraints ¶
- to_protobuf(self) whylogs.proto.DatasetConstraintMsg ¶
- to_json(self) str ¶
- report(self)¶
- whylogs.core.statistics.constraints._check_between_constraint_valid_initialization(lower_value, upper_value, lower_field, upper_field)¶
- whylogs.core.statistics.constraints._set_between_constraint_default_name(field, lower_value, upper_value, lower_field, upper_field)¶
- whylogs.core.statistics.constraints._format_set_values_for_display(reference_set)¶
- whylogs.core.statistics.constraints.stddevBetweenConstraint(lower_value=None, upper_value=None, lower_field=None, upper_field=None, name=None, verbose=False)¶
Defines a summary constraint on the standard deviation of a feature. The standard deviation can be defined to be between two values, or between the values of two other summary fields of the same feature, such as the minimum and the maximum. The defined interval is a closed interval, which includes both of its limit points.
- Parameters
lower_value (numeric (one-of)) – Represents the lower value limit of the interval for the standard deviation. If lower_value is supplied, then upper_value must also be supplied, and none of lower_field and upper_field should be provided.
upper_value (numeric (one-of)) – Represents the upper value limit of the interval for the standard deviation. If upper_value is supplied, then lower_value must also be supplied, and none of lower_field and upper_field should be provided.
lower_field (str (one-of)) – Represents the lower field limit of the interval for the standard deviation. The lower field is a string representing a summary field e.g. min, mean, max, stddev, etc., for which the value will be used as a lower bound. If lower_field is supplied, then upper_field must also be supplied, and none of lower_value and upper_value should be provided.
upper_field (str (one-of)) – Represents the upper field limit of the interval for the standard deviation. The upper field is a string representing a summary field e.g. min, mean, max, stddev, etc., for which the value will be used as an upper bound. If upper_field is supplied, then lower_field must also be supplied, and none of lower_value and upper_value should be provided.
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Return type
SummaryConstraint - a summary constraint defining an interval of values for the standard deviation of a feature
- whylogs.core.statistics.constraints.meanBetweenConstraint(lower_value=None, upper_value=None, lower_field=None, upper_field=None, name=None, verbose=False)¶
Defines a summary constraint on the mean (average) of a feature. The mean can be defined to be between two values, or between the values of two other summary fields of the same feature, such as the minimum and the maximum. The defined interval is a closed interval, which includes both of its limit points.
- Parameters
lower_value (numeric (one-of)) – Represents the lower value limit of the interval for the mean. If lower_value is supplied, then upper_value must also be supplied, and none of lower_field and upper_field should be provided.
upper_value (numeric (one-of)) – Represents the upper value limit of the interval for the mean. If upper_value is supplied, then lower_value must also be supplied, and none of lower_field and upper_field should be provided.
lower_field (str (one-of)) – Represents the lower field limit of the interval for the mean. The lower field is a string representing a summary field e.g. min, mean, max, stddev, etc., for which the value will be used as a lower bound. If lower_field is supplied, then upper_field must also be supplied, and none of lower_value and upper_value should be provided.
upper_field (str (one-of)) – Represents the upper field limit of the interval for the mean. The upper field is a string representing a summary field e.g. min, mean, max, stddev, etc., for which the value will be used as an upper bound. If upper_field is supplied, then lower_field must also be supplied, and none of lower_value and upper_value should be provided.
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Return type
SummaryConstraint - a summary constraint defining an interval of values for the mean of a feature
- whylogs.core.statistics.constraints.minBetweenConstraint(lower_value=None, upper_value=None, lower_field=None, upper_field=None, name=None, verbose=False)¶
Defines a summary constraint on the minimum value of a feature. The minimum can be defined to be between two values, or between the values of two other summary fields of the same feature, such as the minimum and the maximum. The defined interval is a closed interval, which includes both of its limit points.
- Parameters
lower_value (numeric (one-of)) – Represents the lower value limit of the interval for the minimum. If lower_value is supplied, then upper_value must also be supplied, and none of lower_field and upper_field should be provided.
upper_value (numeric (one-of)) – Represents the upper value limit of the interval for the minimum. If upper_value is supplied, then lower_value must also be supplied, and none of lower_field and upper_field should be provided.
lower_field (str (one-of)) – Represents the lower field limit of the interval for the minimum. The lower field is a string representing a summary field e.g. min, mean, max, stddev, etc., for which the value will be used as a lower bound. If lower_field is supplied, then upper_field must also be supplied, and none of lower_value and upper_value should be provided.
upper_field (str (one-of)) – Represents the upper field limit of the interval for the minimum. The upper field is a string representing a summary field e.g. min, mean, max, stddev, etc., for which the value will be used as an upper bound. If upper_field is supplied, then lower_field must also be supplied, and none of lower_value and upper_value should be provided.
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Return type
SummaryConstraint - a summary constraint defining an interval of values for the minimum value of a feature
- whylogs.core.statistics.constraints.minGreaterThanEqualConstraint(value=None, field=None, name=None, verbose=False)¶
Defines a summary constraint on the minimum value of a feature. The minimum can be defined to be greater than or equal to some value, or greater than or equal to the values of another summary field of the same feature, such as the mean (average).
- Parameters
value (numeric (one-of)) – Represents the value which should be compared to the minimum value of the specified feature, for checking the greater than or equal to constraint. Only one of value and field should be supplied.
field (str (one-of)) – The field is a string representing a summary field e.g. min, mean, max, stddev, etc., for which the value will be used for checking the greater than or equal to constraint. Only one of field and value should be supplied.
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
SummaryConstraint - a summary constraint defining a constraint on the minimum value to be greater than
or equal to some value / summary field
- whylogs.core.statistics.constraints.maxBetweenConstraint(lower_value=None, upper_value=None, lower_field=None, upper_field=None, name=None, verbose=False)¶
Defines a summary constraint on the maximum value of a feature. The maximum can be defined to be between two values, or between the values of two other summary fields of the same feature, such as the minimum and the maximum. The defined interval is a closed interval, which includes both of its limit points.
- Parameters
lower_value (numeric (one-of)) – Represents the lower value limit of the interval for the maximum. If lower_value is supplied, then upper_value must also be supplied, and none of lower_field and upper_field should be provided.
upper_value (numeric (one-of)) – Represents the upper value limit of the interval for the maximum. If upper_value is supplied, then lower_value must also be supplied, and none of lower_field and upper_field should be provided.
lower_field (str (one-of)) – Represents the lower field limit of the interval for the maximum. The lower field is a string representing a summary field e.g. min, mean, max, stddev, etc., for which the value will be used as a lower bound. If lower_field is supplied, then upper_field must also be supplied, and none of lower_value and upper_value should be provided.
upper_field (str (one-of)) – Represents the upper field limit of the interval for the maximum. The upper field is a string representing a summary field e.g. min, mean, max, stddev, etc., for which the value will be used as an upper bound. If upper_field is supplied, then lower_field must also be supplied, and none of lower_value and upper_value should be provided.
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Return type
SummaryConstraint - a summary constraint defining an interval of values for the maximum value of a feature
- whylogs.core.statistics.constraints.maxLessThanEqualConstraint(value=None, field=None, name=None, verbose=False)¶
Defines a summary constraint on the maximum value of a feature. The maximum can be defined to be less than or equal to some value, or less than or equal to the values of another summary field of the same feature, such as the mean (average).
- Parameters
value (numeric (one-of)) – Represents the value which should be compared to the maximum value of the specified feature, for checking the less than or equal to constraint. Only one of value and field should be supplied.
field (str (one-of)) – The field is a string representing a summary field e.g. min, mean, max, stddev, etc., for which the value will be used for checking the less than or equal to constraint. Only one of field and value should be supplied.
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
SummaryConstraint - a summary constraint defining a constraint on the maximum value to be less than
or equal to some value / summary field
- whylogs.core.statistics.constraints.distinctValuesInSetConstraint(reference_set: Set[Any], name=None, verbose=False)¶
Defines a summary constraint on the distinct values of a feature. All of the distinct values should belong in the user-provided set or reference values reference_set. Useful for categorical features, for checking if the set of values present in a feature is contained in the set of expected categories.
- Parameters
reference_set (Set[Any] (required)) – Represents the set of reference (expected) values for a feature. The provided values can be of any type. If at least one of the distinct values of the feature is not in the user specified set reference_set, then the constraint will fail.
name (str) – The name of the constraint.
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
SummaryConstraint - a summary constraint defining a constraint on the distinct values of a feature
to belong in a user supplied set of values
- whylogs.core.statistics.constraints.distinctValuesEqualSetConstraint(reference_set: Set[Any], name=None, verbose=False)¶
Defines a summary constraint on the distinct values of a feature. The set of the distinct values should be equal to the user-provided set or reference values, reference_set. Useful for categorical features, for checking if the set of values present in a feature is the same as the set of expected categories.
- Parameters
reference_set (Set[Any] (required)) – Represents the set of reference (expected) values for a feature. The provided values can be of any type. If the distinct values of the feature are not equal to the user specified set reference_set, then the constraint will fail.
name (str) – The name of the constraint.
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
SummaryConstraint - a summary constraint defining a constraint on the distinct values of a feature
to be equal to a user supplied set of values
- whylogs.core.statistics.constraints.distinctValuesContainSetConstraint(reference_set: Set[Any], name=None, verbose=False)¶
Defines a summary constraint on the distinct values of a feature. The set of user-supplied reference values, reference_set should be a subset of the set of distinct values for the current feature. Useful for categorical features, for checking if the set of values present in a feature is a superset of the set of expected categories.
- Parameters
reference_set (Set[Any] (required)) – Represents the set of reference (expected) values for a feature. The provided values can be of any type. If at least one of the values of the reference set, specified in reference_set, is not contained in the set of distinct values of the feature, then the constraint will fail.
name (str) – The name of the constraint.
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
SummaryConstraint - a summary constraint defining a constraint on the distinct values of a feature
to be a super set of the user supplied set of values
- whylogs.core.statistics.constraints.columnValuesInSetConstraint(value_set: Set[Any], name=None, verbose=False)¶
Defines a value constraint with set operations on the values of a single feature. The values of the feature should all be in the set of user-supplied values, specified in value_set. Useful for categorical features, for checking if the values in a feature belong in a predefined set.
- Parameters
value_set (Set[Any] (required)) – Represents the set of expected values for a feature. The provided values can be of any type. Each value in the feature is checked against the constraint. The total number of failures equals the number of values not in the provided set value_set.
name (str) – The name of the constraint.
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
ValueConstraint - a value constraint specifying a constraint on the values of a feature
to be drawn from a predefined set of values.
- whylogs.core.statistics.constraints.containsEmailConstraint(regex_pattern: str = None, name=None, verbose=False)¶
Defines a value constraint with email regex matching operations on the values of a single feature. The constraint defines a default email regex pattern, but a user-defined pattern can be supplied to override it. Useful for checking the validity of features with values representing email addresses.
- Parameters
regex_pattern (str (optional)) – User-defined email regex pattern. If supplied, will override the default email regex pattern provided by whylogs.
name (str) – Name of the constraint used for reporting
verbose (bool (optional)) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Return type
ValueConstraint - a value constraint for email regex matching of the values of a single feature
- whylogs.core.statistics.constraints.containsCreditCardConstraint(regex_pattern: str = None, name=None, verbose=False)¶
Defines a value constraint with credit card number regex matching operations on the values of a single feature. The constraint defines a default credit card number regex pattern, but a user-defined pattern can be supplied to override it. Useful for checking the validity of features with values representing credit card numbers.
- Parameters
regex_pattern (str (optional)) – User-defined credit card number regex pattern. If supplied, will override the default credit card number regex pattern provided by whylogs.
name (str) – Name of the constraint used for reporting
verbose (bool (optional)) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Return type
ValueConstraint - a value constraint for credit card number regex matching of the values of a single feature
- whylogs.core.statistics.constraints.dateUtilParseableConstraint(name=None, verbose=False)¶
Defines a value constraint which checks if the values of a single feature can be parsed by the dateutil parser. Useful for checking if the date time values of a feature are compatible with dateutil.
- Parameters
name (str) – Name of the constraint used for reporting
verbose (bool (optional)) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Return type
ValueConstraint - a value constraint for checking if a feature’s values are dateutil parseable
- whylogs.core.statistics.constraints.jsonParseableConstraint(name=None, verbose=False)¶
Defines a value constraint which checks if the values of a single feature are JSON parseable. Useful for checking if the values of a feature can be serialized to JSON.
- Parameters
name (str) – Name of the constraint used for reporting
verbose (bool (optional)) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Return type
ValueConstraint - a value constraint for checking if a feature’s values are JSON parseable
- whylogs.core.statistics.constraints.matchesJsonSchemaConstraint(json_schema, name=None, verbose=False)¶
Defines a value constraint which checks if the values of a single feature match a user-provided JSON schema. Useful for checking if the values of a feature can be serialized to match a predefined JSON schema.
- Parameters
json_schema (Union[str, dict] (required)) – A string or dictionary of key-value pairs representing the expected JSON schema.
name (str) – Name of the constraint used for reporting
verbose (bool (optional)) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Return type
ValueConstraint - a value constraint for checking if a feature’s values match a user-provided JSON schema
- whylogs.core.statistics.constraints.strftimeFormatConstraint(format, name=None, verbose=False)¶
Defines a value constraint which checks if the values of a single feature are strftime parsable.
- Parameters
format (str (required)) – A string representing the expected strftime format for parsing the values.
name (str) – Name of the constraint used for reporting
verbose (bool (optional)) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Return type
ValueConstraint - a value constraint for checking if a feature’s values are strftime parseable
- whylogs.core.statistics.constraints.containsSSNConstraint(regex_pattern: str = None, name=None, verbose=False)¶
Defines a value constraint with social security number (SSN) matching operations on the values of a single feature. The constraint defines a default SSN regex pattern, but a user-defined pattern can be supplied to override it. Useful for checking the validity of features with values representing SNN numbers.
- Parameters
regex_pattern (str (optional)) – User-defined SSN regex pattern. If supplied, will override the default SSN regex pattern provided by whylogs.
name (str) – Name of the constraint used for reporting
verbose (bool (optional)) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Return type
ValueConstraint - a value constraint for SSN regex matching of the values of a single feature
- whylogs.core.statistics.constraints.containsURLConstraint(regex_pattern: str = None, name=None, verbose=False)¶
Defines a value constraint with URL regex matching operations on the values of a single feature. The constraint defines a default URL regex pattern, but a user-defined pattern can be supplied to override it. Useful for checking the validity of features with values representing URL addresses.
- Parameters
regex_pattern (str (optional)) – User-defined URL regex pattern. If supplied, will override the default URL regex pattern provided by whylogs.
name (str) – Name of the constraint used for reporting
verbose (bool (optional)) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Return type
ValueConstraint - a value constraint for URL regex matching of the values of a single feature
- whylogs.core.statistics.constraints.stringLengthEqualConstraint(length: int, name=None, verbose=False)¶
Defines a value constraint which checks if the string values of a single feature have a predefined length.
- Parameters
length (int (required)) – A numeric value which represents the expected length of the string values in the specified feature.
name (str) – Name of the constraint used for reporting
verbose (bool (optional)) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Return type
ValueConstraint - a value constraint for checking if a feature’s string values have a predefined length
- whylogs.core.statistics.constraints.stringLengthBetweenConstraint(lower_value: int, upper_value: int, name=None, verbose=False)¶
Defines a value constraint which checks if the string values’ length of a single feature is in some predefined interval.
- Parameters
lower_value (int (required)) – A numeric value which represents the expected lower bound of the length of the string values in the specified feature.
upper_value (int (required)) – A numeric value which represents the expected upper bound of the length of the string values in the specified feature.
name (str) – Name of the constraint used for reporting
verbose (bool (optional)) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
ValueConstraint - a value constraint for checking if a feature’s string values’
length is in a predefined interval
- whylogs.core.statistics.constraints.quantileBetweenConstraint(quantile_value: Union[int, float], lower_value: Union[int, float], upper_value: Union[int, float], name=None, verbose: bool = False)¶
Defines a summary constraint on the n-th quantile value of a numeric feature. The n-th quantile can be defined to be between two values. The defined interval is a closed interval, which includes both of its limit points.
- Parameters
quantile_value (numeric (required)) – The n-the quantile for which the constraint will be executed
lower_value (numeric (required)) – Represents the lower value limit of the interval for the n-th quantile.
upper_value (numeric (required)) – Represents the upper value limit of the interval for the n-th quantile.
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
SummaryConstraint - a summary constraint defining a closed interval of valid values
for the n-th quantile value of a specific feature
- whylogs.core.statistics.constraints.columnUniqueValueCountBetweenConstraint(lower_value: int, upper_value: int, name=None, verbose: bool = False)¶
Defines a summary constraint on the cardinality of a specific feature. The cardinality can be defined to be between two values. The defined interval is a closed interval, which includes both of its limit points. Useful for checking the unique count of values for discrete features.
- Parameters
lower_value (numeric (required)) – Represents the lower value limit of the interval for the feature cardinality.
upper_value (numeric (required)) – Represents the upper value limit of the interval for the feature cardinality.
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
SummaryConstraint - a summary constraint defining a closed interval
for the valid cardinality of a specific feature
- whylogs.core.statistics.constraints.columnUniqueValueProportionBetweenConstraint(lower_fraction: float, upper_fraction: float, name=None, verbose: bool = False)¶
Defines a summary constraint on the proportion of unique values of a specific feature. The proportion of unique values can be defined to be between two values. The defined interval is a closed interval, which includes both of its limit points. Useful for checking the frequency of unique values for discrete features.
- Parameters
lower_fraction (fraction between 0 and 1 (required)) – Represents the lower fraction limit of the interval for the feature unique value proportion.
upper_fraction (fraction between 0 and 1 (required)) – Represents the upper fraction limit of the interval for the feature cardinality.
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
SummaryConstraint - a summary constraint defining a closed interval
for the valid proportion of unique values of a specific feature
- whylogs.core.statistics.constraints.columnExistsConstraint(column: str, name=None, verbose=False)¶
Defines a constraint on the data set schema. Checks if the user-supplied column, identified by column, is present in the data set schema.
- Parameters
column (str (required)) – Represents the name of the column to be checked for existence in the data set.
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
SummaryConstraint - a summary constraint which checks the existence of a column
in the current data set.
- whylogs.core.statistics.constraints.numberOfRowsConstraint(n_rows: int, name=None, verbose=False)¶
Defines a constraint on the data set schema. Checks if the number of rows in the data set equals the user-supplied number of rows.
- Parameters
n_rows (int (required)) – Represents the user-supplied expected number of rows.
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Return type
SummaryConstraint - a summary constraint which checks the number of rows in the data set
- whylogs.core.statistics.constraints.columnsMatchSetConstraint(reference_set: Set[str], name=None, verbose=False)¶
Defines a constraint on the data set schema. Checks if the set of columns in the data set is equal to the user-supplied set of expected columns.
- Parameters
reference_set (Set[str] (required)) – Represents the expected columns in the current data set.
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
SummaryConstraint - a summary constraint which checks if the column set
of the current data set matches the expected column set
- whylogs.core.statistics.constraints.columnMostCommonValueInSetConstraint(value_set: Set[Any], name=None, verbose=False)¶
Defines a summary constraint on the most common value of a feature. The most common value of the feature should be in the set of user-supplied values, value_set. Useful for categorical features, for checking if the most common value of a feature belongs in an expected set of common categories.
- Parameters
value_set (Set[Any] (required)) – Represents the set of expected values for a feature. The provided values can be of any type. If the most common value of the feature is not in the values of the user-specified value_set, the constraint will fail.
name (str) – The name of the constraint.
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
SummaryConstraint - a summary constraint defining a constraint on the most common value of a feature
to belong to a set of user-specified expected values
- whylogs.core.statistics.constraints.columnValuesNotNullConstraint(name=None, verbose=False)¶
Defines a non-null summary constraint on the value of a feature. Useful for features for which there is no tolerance for missing values. The constraint will fail if there is at least one missing value in the specified feature.
- Parameters
name (str) – The name of the constraint.
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
SummaryConstraint - a summary constraint defining that no missing values
are allowed for the specified feature
- whylogs.core.statistics.constraints.missingValuesProportionBetweenConstraint(lower_fraction: float, upper_fraction: float, name: str = None, verbose: bool = False)¶
Defines a summary constraint on the proportion of missing values of a specific feature. The proportion of missing values can be defined to be between two frequency values. The defined interval is a closed interval, which includes both of its limit points. Useful for checking features with expected amounts of missing values.
- Parameters
lower_fraction (fraction between 0 and 1 (required)) – Represents the lower fraction limit of the interval for the feature missing value proportion.
upper_fraction (fraction between 0 and 1 (required)) – Represents the upper fraction limit of the interval for the feature missing value proportion.
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
SummaryConstraint - a summary constraint defining a closed interval
for the valid proportion of missing values of a specific feature
- whylogs.core.statistics.constraints.columnValuesTypeEqualsConstraint(expected_type: Union[whylogs.proto.InferredType, int], name=None, verbose: bool = False)¶
Defines a summary constraint on the type of the feature values. The type of values should be equal to the user-provided expected type.
- Parameters
expected_type (Union[InferredType, int]) –
whylogs.proto.InferredType.Type - Enumeration of allowed inferred data types If supplied as integer value, should be one of:
UNKNOWN = 0 NULL = 1 FRACTIONAL = 2 INTEGRAL = 3 BOOLEAN = 4 STRING = 5
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
equal to a user-provided expected type
- Return type
SummaryConstraint - a summary constraint defining that the feature values type should be
- whylogs.core.statistics.constraints.columnValuesTypeInSetConstraint(type_set: Set[int], name=None, verbose: bool = False)¶
Defines a summary constraint on the type of the feature values. The type of values should be in the set of to the user-provided expected types.
- Parameters
type_set (Set[int]) –
whylogs.proto.InferredType.Type - Enumeration of allowed inferred data types If supplied as integer value, should be one of:
UNKNOWN = 0 NULL = 1 FRACTIONAL = 2 INTEGRAL = 3 BOOLEAN = 4 STRING = 5
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
in the set of user-provided expected types
- Return type
SummaryConstraint - a summary constraint defining that the feature values type should be
- whylogs.core.statistics.constraints.approximateEntropyBetweenConstraint(lower_value: Union[int, float], upper_value: float, name=None, verbose=False)¶
Defines a summary constraint specifying the expected interval of the features estimated entropy. The defined interval is a closed interval, which includes both of its limit points.
- Parameters
lower_value (numeric (required)) – Represents the lower value limit of the interval for the feature’s estimated entropy.
upper_value (numeric (required)) – Represents the upper value limit of the interval for the feature’s estimated entropy.
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
SummaryConstraint - a summary constraint defining the interval of valid values
of the feature’s estimated entropy
- whylogs.core.statistics.constraints.parametrizedKSTestPValueGreaterThanConstraint(reference_distribution: Union[List[float], numpy.ndarray], p_value=0.05, name=None, verbose=False)¶
Defines a summary constraint specifying the expected upper limit of the p-value for rejecting the null hypothesis of the KS test. Can be used only for continuous data.
- Parameters
reference_distribution (Array-like) – Represents the reference distribution for calculating the KS Test p_value of the column, should be an array-like object with floating point numbers, Only numeric distributions are accepted
p_value (float) – Represents the reference p_value value to compare with the p_value of the test Should be between 0 and 1, inclusive
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
SummaryConstraint - a summary constraint specifying the upper limit of the
KS test p-value for rejecting the null hypothesis
- whylogs.core.statistics.constraints.columnKLDivergenceLessThanConstraint(reference_distribution: Union[List[Any], numpy.ndarray], threshold: float = 0.5, name=None, verbose: bool = False)¶
Defines a summary constraint specifying the expected upper limit of the threshold for the KL divergence of the specified feature.
- Parameters
reference_distribution (Array-like) – Represents the reference distribution for calculating the KL Divergence of the column, should be an array-like object with floating point numbers, or integers, strings and booleans, but not both Both numeric and categorical distributions are accepted
threshold (float) – Represents the threshold value which if exceeded from the KL Divergence, the constraint would fail
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
SummaryConstraint - a summary constraint specifying the upper threshold of the
feature’s KL divergence
- whylogs.core.statistics.constraints.columnChiSquaredTestPValueGreaterThanConstraint(reference_distribution: Union[List[Any], numpy.ndarray, Mapping[str, int]], p_value: float = 0.05, name=None, verbose: bool = False)¶
Defines a summary constraint specifying the expected upper limit of the p-value for rejecting the null hypothesis of the Chi-Squared test. Can be used only for discrete data.
- Parameters
reference_distribution (Array-like) – Represents the reference distribution for calculating the Chi-Squared test, should be an array-like object with integer, string or boolean values or a mapping of type key: value where the keys are the items and the values are the per-item counts Only categorical distributions are accepted
p_value (float) – Represents the reference p_value value to compare with the p_value of the test Should be between 0 and 1, inclusive
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
SummaryConstraint - a summary constraint specifying the upper limit of the
Chi-Squared test p-value for rejecting the null hypothesis
- whylogs.core.statistics.constraints.columnValuesAGreaterThanBConstraint(column_A: str, column_B: str, name=None, verbose: bool = False)¶
Defines a multi-column value constraint which specifies that each value in column A, specified in column_A, is greater than the corresponding value of column B, specified in column_B in the same row.
- Parameters
column_A (str) – The name of column A
column_B (str) – The name of column B
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
MultiColumnValueConstraint - multi-column value constraint specifying that values from column A
should always be greater than the corresponding values of column B
- whylogs.core.statistics.constraints.columnValuesAGreaterThanEqualBConstraint(column_A: str, column_B: str, name=None, verbose: bool = False)¶
Defines a multi-column value constraint which specifies that each value in column A, specified in column_A, is greater than or equal to the corresponding value of column B, specified in column_B in the same row.
- Parameters
column_A (str) – The name of column A
column_B (str) – The name of column B
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
MultiColumnValueConstraint - multi-column value constraint specifying that values from column A
should always be greater than or equal to the corresponding values of column B
- whylogs.core.statistics.constraints.columnValuesALessThanBConstraint(column_A: str, column_B: str, name=None, verbose: bool = False)¶
Defines a multi-column value constraint which specifies that each value in column A, specified in column_A, is less than the corresponding value of column B, specified in column_B in the same row.
- Parameters
column_A (str) – The name of column A
column_B (str) – The name of column B
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
MultiColumnValueConstraint - multi-column value constraint specifying that values from column A
should always be less the corresponding values of column B
- whylogs.core.statistics.constraints.columnValuesALessThanEqualBConstraint(column_A: str, column_B: str, name=None, verbose: bool = False)¶
Defines a multi-column value constraint which specifies that each value in column A, specified in column_A, is less than or equal to the corresponding value of column B, specified in column_B in the same row.
- Parameters
column_A (str) – The name of column A
column_B (str) – The name of column B
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
MultiColumnValueConstraint - multi-column value constraint specifying that values from column A
should always be less than or equal to the corresponding values of column B
- whylogs.core.statistics.constraints.columnValuesAEqualBConstraint(column_A: str, column_B: str, name=None, verbose: bool = False)¶
Defines a multi-column value constraint which specifies that each value in column A, specified in column_A, is equal to the corresponding value of column B, specified in column_B in the same row.
- Parameters
column_A (str) – The name of column A
column_B (str) – The name of column B
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
MultiColumnValueConstraint - multi-column value constraint specifying that values from column A
should always be equal to the corresponding values of column B
- whylogs.core.statistics.constraints.columnValuesANotEqualBConstraint(column_A: str, column_B: str, name=None, verbose: bool = False)¶
Defines a multi-column value constraint which specifies that each value in column A, specified in column_A, is different from the corresponding value of column B, specified in column_B in the same row.
- Parameters
column_A (str) – The name of column A
column_B (str) – The name of column B
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
MultiColumnValueConstraint - multi-column value constraint specifying that values from column A
should always be different from the corresponding values of column B
- whylogs.core.statistics.constraints.sumOfRowValuesOfMultipleColumnsEqualsConstraint(columns: Union[List[str], Set[str], numpy.array], value: Union[float, int, str], name=None, verbose: bool = False)¶
Defines a multi-column value constraint which specifies that the sum of the values in each row of the provided columns, specified in columns, should be equal to the user-predefined value, specified in value, or to the corresponding value of another column, which will be specified with a name in the value parameter.
- Parameters
columns (List[str]) – List of columns for which the sum of row values should equal the provided-value
value (Union[float, int, str]) – Numeric value to compare with the sum of the column row values, or a string indicating a column name for which the row value will be compared with the sum
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Return type
MultiColumnValueConstraint - specifying the expected value of the sum of the values in multiple columns
- whylogs.core.statistics.constraints.columnPairValuesInSetConstraint(column_A: str, column_B: str, value_set: Set[Tuple[Any, Any]], name=None, verbose: bool = False)¶
Defines a multi-column value constraint which specifies that the pair of values of columns A and B, should be in a user-predefined set of expected pairs of values.
- Parameters
column_A (str) – The name of the first column
column_B (str) – The name of the second column
value_set (Set[Tuple[Any, Any]]) – A set of expected pairs of values for the columns A and B, in that order
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Return type
MultiColumnValueConstraint - specifying the expected set of value pairs of two columns in the data set
- whylogs.core.statistics.constraints.columnValuesUniqueWithinRow(column_A: str, name=None, verbose: bool = False)¶
Defines a multi-column value constraint which specifies that the values of column A should be unique within each row of the data set.
- Parameters
column_A (str) – The name of the column for which it is expected that the values are unique within each row
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Return type
MultiColumnValueConstraint - specifying that the provided column’s values are unique within each row