Creating Metric Constraints on Condition Count Metrics#

whylogs profiles contain summarized information about our data. This means that it’s a lossy process, and once we get the profiles, we don’t have access anymore to the complete set of data.

This makes some types of constraints impossible to be created from standard metrics itself. For example, suppose you need to check every row of a column to check that there are no textual information that matches a credit card number or email information. Or maybe you’re interested in ensuring that there are no even numbers in a certain column. How do we do that if we don’t have access to the complete data?

The answer is that you need to define a Condition Count Metric to be tracked before logging your data. This metric will count the number of times the values of a given column meets a user-defined condition. When the profile is generated, you’ll have that information to check against the constraints you’ll create.

In this example, you’ll learn how to: - Define additional Condition Count Metrics - Define actions to be triggered whenever those conditions are met during the logging process. - Use the Condition Count Metrics to create constraints against said conditions

If you want more information on Condition Count Metrics, you can see this example and also the documentation for Data Validation

Installing whylogs#

[ ]:
# Note: you may need to restart the kernel to use updated packages.
%pip install whylogs

Context#

Let’s assume we have a DataFrame for which we wish to log standard metrics through whylogs’ default logging process. But additionally, we want specific information on two columns:

  • url: Regex pattern validation: the values in this column should always start with https:://www.mydomain.com/profile

  • subscription_date: Date Format validation: the values in this column should be a string with a date format of %Y-%m-%d

In addition, we consider these cases to be critical, so we wish to make certain actions whenever the condition fails. In this example we will:

  • Send an alert in Slack whenever subscription_date fails the condition

  • Send an alert in Slack and pull a symbolic Andon Cord whenever url is not from the domain we expect

Let’s first create a simple DataFrame to demonstrate:

[2]:
import pandas as pd
data = {
        "name": ["Alice", "Bob", "Charles"],
        "age": [31,0,25],
        "url": ["https://www.mydomain.com/profile/123", "www.wrongdomain.com", "http://mydomain.com/unsecure"],
        "subscription_date": ["2021-12-28","2019-29-11","04/08/2021"],
    }

df = pd.DataFrame(data)

In this case, both url and subscription_date has 2 values out of 3 that are not what we expect.

Defining the Relations#

Let’s first define the relations that will actually check whether the value passes our constraint. For the date format validation, we’ll use the datetime module in a user defined function. As for the Regex pattern matching, we will use whylogs’ Predicates along with regular expressions, which allows us to build simple relations intuitively.

[3]:
import datetime
from typing import Any
from whylogs.core.relations import Predicate


def date_format(x: Any) -> bool:
    date_format = '%Y-%m-%d'
    try:
        datetime.datetime.strptime(x, date_format)
        return True
    except ValueError:
        return False

# matches accept a regex expression
matches_domain_url = Predicate().matches("^https:\/\/www.mydomain.com\/profile")

Defining the Actions#

Next, we need to define the actions that will be triggered whenever the conditions fail.

We will define two placeholder functions that, in a real scenario, would execute the defined actions.

[4]:
from typing import Any

def pull_andon_cord(validator_name, condition_name: str, value: Any):
    print("Validator: {}\n    Condition name {} failed for value {}".format(validator_name, condition_name, value))
    print("    Pulling andon cord....")
    # Do something here to respond to the constraint violation
    return

def send_slack_alert(validator_name, condition_name: str, value: Any):
    print("Validator: {}\n    Condition name {} failed for value {}".format(validator_name, condition_name, value))
    print("    Sending slack alert....")
    # Do something here to respond to the constraint violation
    return

Conditions = Relations + Actions#

Conditions are defined by the combination of a relation and a set of actions. Now that we have both relations and actions, we can create two sets of conditions - in this example, each set contain a single condition, but we could have multiple.

[5]:
from whylogs.core.metrics.condition_count_metric import Condition

has_date_format = {
    "Y-m-d format": Condition(date_format, actions=[send_slack_alert]),
}

regex_conditions = {"url_matches_domain": Condition(matches_domain_url, actions=[pull_andon_cord,send_slack_alert])}

ints_conditions = {
    "integer_zeros": Condition(Predicate().equals(0)),
}

Passing the conditions to the Logger#

Now, we need to let the logger aware of our Conditions. This can be done by creating a custom schema object that will be passed to why.log().

To create the schema object, we will use the Declarative Schema, which is an auxiliary class that will enable us to create a schema in a simple way.

In this case, we want our schema to start with the default behavior (standard metrics for the default datatypes). Then, we want to add two condition count metrics based on the conditions we defined earlier and the name of the column we want to bind those conditions to. We can do so by calling the schema’s add_condition_count_metric method:

[6]:
from whylogs.core.resolvers import STANDARD_RESOLVER
from whylogs.core.specialized_resolvers import ConditionCountMetricSpec
from whylogs.core.schema import DeclarativeSchema

schema = DeclarativeSchema(STANDARD_RESOLVER)

schema.add_resolver_spec(column_name="subscription_date", metrics=[ConditionCountMetricSpec(has_date_format)])
schema.add_resolver_spec(column_name="url", metrics=[ConditionCountMetricSpec(regex_conditions)])
schema.add_resolver_spec(column_name="age", metrics=[ConditionCountMetricSpec(ints_conditions)])

Now, let’s pass the schema to why.log() and start logging our data:

[7]:
import whylogs as why
profile_view = why.log(df, schema=schema).profile().view()
Validator: condition_count
    Condition name url_matches_domain failed for value www.wrongdomain.com
    Pulling andon cord....
Validator: condition_count
    Condition name url_matches_domain failed for value www.wrongdomain.com
    Sending slack alert....
Validator: condition_count
    Condition name url_matches_domain failed for value http://mydomain.com/unsecure
    Pulling andon cord....
Validator: condition_count
    Condition name url_matches_domain failed for value http://mydomain.com/unsecure
    Sending slack alert....
Validator: condition_count
    Condition name Y-m-d format failed for value 2019-29-11
    Sending slack alert....
Validator: condition_count
    Condition name Y-m-d format failed for value 04/08/2021
    Sending slack alert....

You can see that during the logging process, our actions were triggered whenever the condition failed. We can see the name of the failed condition and the specific value that triggered it.

We see the actions were triggered, but we also expect the Condition Count Metrics to be generated. Let’s see if this is the case:

[8]:
profile_view.to_pandas()
[8]:
cardinality/est cardinality/lower_1 cardinality/upper_1 condition_count/integer_zeros condition_count/total counts/inf counts/n counts/nan counts/null distribution/max distribution/mean distribution/median distribution/min distribution/n distribution/q_01 distribution/q_05 distribution/q_10 distribution/q_25 distribution/q_75 distribution/q_90 distribution/q_95 distribution/q_99 distribution/stddev frequent_items/frequent_strings ints/max ints/min type types/boolean types/fractional types/integral types/object types/string types/tensor condition_count/Y-m-d format condition_count/url_matches_domain
column
age 3.0 3.0 3.00015 1.0 3.0 0 3 0 0 31.0 18.666667 25.0 0.0 3 0.0 0.0 0.0 0.0 31.0 31.0 31.0 31.0 16.441817 [FrequentItem(value='25', est=1, upper=1, lowe... 31.0 0.0 SummaryType.COLUMN 0 0 3 0 0 0 NaN NaN
name 3.0 3.0 3.00015 NaN NaN 0 3 0 0 NaN 0.000000 NaN NaN 0 NaN NaN NaN NaN NaN NaN NaN NaN 0.000000 [FrequentItem(value='Alice', est=1, upper=1, l... NaN NaN SummaryType.COLUMN 0 0 0 0 3 0 NaN NaN
subscription_date 3.0 3.0 3.00015 NaN 3.0 0 3 0 0 NaN 0.000000 NaN NaN 0 NaN NaN NaN NaN NaN NaN NaN NaN 0.000000 [FrequentItem(value='2019-29-11', est=1, upper... NaN NaN SummaryType.COLUMN 0 0 0 0 3 0 1.0 NaN
url 3.0 3.0 3.00015 NaN 3.0 0 3 0 0 NaN 0.000000 NaN NaN 0 NaN NaN NaN NaN NaN NaN NaN NaN 0.000000 [FrequentItem(value='www.wrongdomain.com', est... NaN NaN SummaryType.COLUMN 0 0 0 0 3 0 NaN 1.0

At the far right of our summary dataframe, you can find the Condition Count Metrics: the Y-m-d format condition was met only once of a total of 3. The same happens for the url_matches_domain. Note that for columns where the condition was not defined, a NaN is displayed.

Creating Metric Constraints based on Condition Count Metrics#

So far, we created Condition Count Metrics for both of the desired conditions. During the logging process, the set of actions defined for each of the conditions were triggered whenever the conditions failed to be met.

Now, we wish to create Metric Constraints on top of the Condition Count Metrics, so we can generate a Constraints Report. This can be done by using the condition_meets helper constraint. You only need to specify the column name and the name of the condition you want to check:

[13]:
from whylogs.core.constraints.factories import condition_meets, condition_never_meets, condition_count_below
from whylogs.core.constraints import ConstraintsBuilder

builder = ConstraintsBuilder(dataset_profile_view=profile_view)

builder.add_constraint(condition_meets(column_name="subscription_date", condition_name="Y-m-d format"))
builder.add_constraint(condition_never_meets(column_name="url", condition_name="url_matches_domain"))
builder.add_constraint(condition_count_below(column_name="age", condition_name="integer_zeros", max_count=1))

constraints = builder.build()
constraints.generate_constraints_report()
[13]:
[ReportResult(name='subscription_date meets condition Y-m-d format', passed=0, failed=1, summary=None),
 ReportResult(name='url never meets condition url_matches_domain', passed=0, failed=1, summary=None),
 ReportResult(name='age.integer_zeros lower than or equal to 1', passed=1, failed=0, summary=None)]

The condition_meets constraint will fail if the said condition is not met at least once. In other words, if condition_count/condition_name is smaller than condition_count/total.

The condition_never_meets constraint will fail if the said condition is met at least once. In other words, if condition_count/condition_name is greater than 0.

The condition_count_below constraint will fail if the said condition is met more than a specified number of times.

Visualizing the Report#

You can visualize the Constraints Report as usual by calling NotebookProfileVisualizer’s constraints_report:

[14]:
from whylogs.viz import NotebookProfileVisualizer

visualization = NotebookProfileVisualizer()
visualization.constraints_report(constraints, cell_height=300)
[14]:

By hovering on the status, you can view the number of times the condition failed, and the total number of times the condition was checked.