Creating Metric Constraints on Condition Count Metrics#

whylogs profiles contain summarized information about our data. This means that it’s a lossy process, and once we get the profiles, we don’t have access anymore to the complete set of data.

This makes some types of constraints impossible to be created from standard metrics itself. For example, suppose you need to check every row of a column to check that there are no textual information that matches a credit card number or email information. Or maybe you’re interested in ensuring that there are no even numbers in a certain column. How do we do that if we don’t have access to the complete data?

The answer is that you need to define a Condition Count Metric to be tracked before logging your data. This metric will count the number of times the values of a given column meets a user-defined condition. When the profile is generated, you’ll have that information to check against the constraints you’ll create.

In this example, you’ll learn how to: - Define additional Condition Count Metrics - Define actions to be triggered whenever those conditions are met during the logging process. - Use the Condition Count Metrics to create constraints against said conditions

If you want more information on Condition Count Metrics, you can see this example and also the documentation for Data Validation

Installing whylogs#

[ ]:

# Note: you may need to restart the kernel to use updated packages.
%pip install whylogs

Context#

Let’s assume we have a DataFrame for which we wish to log standard metrics through whylogs’ default logging process. But additionally, we want specific information on two columns:

url: Regex pattern validation: the values in this column should always start with https:://www.mydomain.com/profile
subscription_date: Date Format validation: the values in this column should be a string with a date format of %Y-%m-%d

In addition, we consider these cases to be critical, so we wish to make certain actions whenever the condition fails. In this example we will:

Send an alert in Slack whenever subscription_date fails the condition
Send an alert in Slack and pull a symbolic Andon Cord whenever url is not from the domain we expect

Let’s first create a simple DataFrame to demonstrate:

[2]:

import pandas as pd
data = {
        "name": ["Alice", "Bob", "Charles"],
        "age": [31,0,25],
        "url": ["https://www.mydomain.com/profile/123", "www.wrongdomain.com", "http://mydomain.com/unsecure"],
        "subscription_date": ["2021-12-28","2019-29-11","04/08/2021"],
    }

df = pd.DataFrame(data)

In this case, both url and subscription_date has 2 values out of 3 that are not what we expect.

Defining the Relations#

Let’s first define the relations that will actually check whether the value passes our constraint. For the date format validation, we’ll use the datetime module in a user defined function. As for the Regex pattern matching, we will use whylogs’ Predicates along with regular expressions, which allows us to build simple relations intuitively.

[3]:

import datetime
from typing import Any
from whylogs.core.relations import Predicate


def date_format(x: Any) -> bool:
    date_format = '%Y-%m-%d'
    try:
        datetime.datetime.strptime(x, date_format)
        return True
    except ValueError:
        return False

# matches accept a regex expression
matches_domain_url = Predicate().matches("^https:\/\/www.mydomain.com\/profile")

Defining the Actions#

Next, we need to define the actions that will be triggered whenever the conditions fail.

We will define two placeholder functions that, in a real scenario, would execute the defined actions.

[4]:

from typing import Any

def pull_andon_cord(validator_name, condition_name: str, value: Any):
    print("Validator: {}\n    Condition name {} failed for value {}".format(validator_name, condition_name, value))
    print("    Pulling andon cord....")
    # Do something here to respond to the constraint violation
    return

def send_slack_alert(validator_name, condition_name: str, value: Any):
    print("Validator: {}\n    Condition name {} failed for value {}".format(validator_name, condition_name, value))
    print("    Sending slack alert....")
    # Do something here to respond to the constraint violation
    return

Conditions = Relations + Actions#

Conditions are defined by the combination of a relation and a set of actions. Now that we have both relations and actions, we can create two sets of conditions - in this example, each set contain a single condition, but we could have multiple.

[5]:

from whylogs.core.metrics.condition_count_metric import Condition

has_date_format = {
    "Y-m-d format": Condition(date_format, actions=[send_slack_alert]),
}

regex_conditions = {"url_matches_domain": Condition(matches_domain_url, actions=[pull_andon_cord,send_slack_alert])}

ints_conditions = {
    "integer_zeros": Condition(Predicate().equals(0)),
}

Passing the conditions to the Logger#

Now, we need to let the logger aware of our Conditions. This can be done by creating a custom schema object that will be passed to why.log().

To create the schema object, we will use the Declarative Schema, which is an auxiliary class that will enable us to create a schema in a simple way.

In this case, we want our schema to start with the default behavior (standard metrics for the default datatypes). Then, we want to add two condition count metrics based on the conditions we defined earlier and the name of the column we want to bind those conditions to. We can do so by calling the schema’s add_condition_count_metric method:

[6]:

from whylogs.core.resolvers import STANDARD_RESOLVER
from whylogs.core.specialized_resolvers import ConditionCountMetricSpec
from whylogs.core.schema import DeclarativeSchema

schema = DeclarativeSchema(STANDARD_RESOLVER)

schema.add_resolver_spec(column_name="subscription_date", metrics=[ConditionCountMetricSpec(has_date_format)])
schema.add_resolver_spec(column_name="url", metrics=[ConditionCountMetricSpec(regex_conditions)])
schema.add_resolver_spec(column_name="age", metrics=[ConditionCountMetricSpec(ints_conditions)])

Now, let’s pass the schema to why.log() and start logging our data:

[7]:

import whylogs as why
profile_view = why.log(df, schema=schema).profile().view()

Validator: condition_count
    Condition name url_matches_domain failed for value www.wrongdomain.com
    Pulling andon cord....
Validator: condition_count
    Condition name url_matches_domain failed for value www.wrongdomain.com
    Sending slack alert....
Validator: condition_count
    Condition name url_matches_domain failed for value http://mydomain.com/unsecure
    Pulling andon cord....
Validator: condition_count
    Condition name url_matches_domain failed for value http://mydomain.com/unsecure
    Sending slack alert....
Validator: condition_count
    Condition name Y-m-d format failed for value 2019-29-11
    Sending slack alert....
Validator: condition_count
    Condition name Y-m-d format failed for value 04/08/2021
    Sending slack alert....

You can see that during the logging process, our actions were triggered whenever the condition failed. We can see the name of the failed condition and the specific value that triggered it.

We see the actions were triggered, but we also expect the Condition Count Metrics to be generated. Let’s see if this is the case:

[8]:

profile_view.to_pandas()

[8]:

	cardinality/est	cardinality/lower_1	cardinality/upper_1	condition_count/integer_zeros	condition_count/total	counts/inf	counts/n	counts/nan	counts/null	distribution/max	distribution/mean	distribution/median	distribution/min	distribution/n	distribution/q_01	distribution/q_05	distribution/q_10	distribution/q_25	distribution/q_75	distribution/q_90	distribution/q_95	distribution/q_99	distribution/stddev	frequent_items/frequent_strings	ints/max	ints/min	type	types/boolean	types/fractional	types/integral	types/object	types/string	types/tensor	condition_count/Y-m-d format	condition_count/url_matches_domain
column
age	3.0	3.0	3.00015	1.0	3.0	0	3	0	0	31.0	18.666667	25.0	0.0	3	0.0	0.0	0.0	0.0	31.0	31.0	31.0	31.0	16.441817	[FrequentItem(value='25', est=1, upper=1, lowe...	31.0	0.0	SummaryType.COLUMN	0	0	3	0	0	0	NaN	NaN
name	3.0	3.0	3.00015	NaN	NaN	0	3	0	0	NaN	0.000000	NaN	NaN	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000	[FrequentItem(value='Alice', est=1, upper=1, l...	NaN	NaN	SummaryType.COLUMN	0	0	0	0	3	0	NaN	NaN
subscription_date	3.0	3.0	3.00015	NaN	3.0	0	3	0	0	NaN	0.000000	NaN	NaN	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000	[FrequentItem(value='2019-29-11', est=1, upper...	NaN	NaN	SummaryType.COLUMN	0	0	0	0	3	0	1.0	NaN
url	3.0	3.0	3.00015	NaN	3.0	0	3	0	0	NaN	0.000000	NaN	NaN	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.000000	[FrequentItem(value='www.wrongdomain.com', est...	NaN	NaN	SummaryType.COLUMN	0	0	0	0	3	0	NaN	1.0

At the far right of our summary dataframe, you can find the Condition Count Metrics: the Y-m-d format condition was met only once of a total of 3. The same happens for the url_matches_domain. Note that for columns where the condition was not defined, a NaN is displayed.

Creating Metric Constraints based on Condition Count Metrics#

So far, we created Condition Count Metrics for both of the desired conditions. During the logging process, the set of actions defined for each of the conditions were triggered whenever the conditions failed to be met.

Now, we wish to create Metric Constraints on top of the Condition Count Metrics, so we can generate a Constraints Report. This can be done by using the condition_meets helper constraint. You only need to specify the column name and the name of the condition you want to check:

[13]:

from whylogs.core.constraints.factories import condition_meets, condition_never_meets, condition_count_below
from whylogs.core.constraints import ConstraintsBuilder

builder = ConstraintsBuilder(dataset_profile_view=profile_view)

builder.add_constraint(condition_meets(column_name="subscription_date", condition_name="Y-m-d format"))
builder.add_constraint(condition_never_meets(column_name="url", condition_name="url_matches_domain"))
builder.add_constraint(condition_count_below(column_name="age", condition_name="integer_zeros", max_count=1))

constraints = builder.build()
constraints.generate_constraints_report()

[13]:

[ReportResult(name='subscription_date meets condition Y-m-d format', passed=0, failed=1, summary=None),
 ReportResult(name='url never meets condition url_matches_domain', passed=0, failed=1, summary=None),
 ReportResult(name='age.integer_zeros lower than or equal to 1', passed=1, failed=0, summary=None)]

The condition_meets constraint will fail if the said condition is not met at least once. In other words, if condition_count/condition_name is smaller than condition_count/total.

The condition_never_meets constraint will fail if the said condition is met at least once. In other words, if condition_count/condition_name is greater than 0.

The condition_count_below constraint will fail if the said condition is met more than a specified number of times.

Visualizing the Report#

You can visualize the Constraints Report as usual by calling NotebookProfileVisualizer’s constraints_report:

[14]:

from whylogs.viz import NotebookProfileVisualizer

visualization = NotebookProfileVisualizer()
visualization.constraints_report(constraints, cell_height=300)

[14]:

By hovering on the status, you can view the number of times the condition failed, and the total number of times the condition was checked.