đźš© Create a free WhyLabs account to get more value out of whylogs!

Did you know you can store, visualize, and monitor whylogs profiles with theWhyLabs Observability Platform? Sign up for afree WhyLabs accountto leverage the power of whylogs and WhyLabs together!

Condition Count Metrics#

Open in Colab

By default, whylogs tracks several metrics, such as type counts, distribution metrics, cardinality and frequent items. Those are general metrics that are useful for a lot of use cases, but often we need metrics tailored for our application.

Condition Count Metrics gives you the flexibility to define your own customized metrics. It will return the results as counters, which is the number of times the condition was met for a given column. With it, you can define conditions such as regex matching for strings, equalities or inequalities for numerical features, and even define your own function to check for any given condition.

In this example, we will cover:

  1. Create metrics for regex matching

    • Examples: contains email/credit card number (String features)

  2. Create metrics for (in)equalities

    • Examples: equal, less, greater, less than, greater than (Numerical features)

  3. Combining metrics with logic operators (and, or, not)

    • Examples: Between range, outside of range, not equal (Numerical features)

  4. Creating metrics with custom functions

    • Examples: is even number, is text preprocessed (Any type)

  5. Going Further: Combining this example with other whylogs’ features

  6. (APPENDIX) Complete code snippets - The complete code snippets (to make it easir to copy and paste)

Installing whylogs and importing modules#

[1]:
# Note: you may need to restart the kernel to use updated packages.
%pip install whylogs

Let’s import all the dependencies for this example upfront:

[2]:
import pandas as pd
from typing import Any

import whylogs as why
from whylogs.core.resolvers import STANDARD_RESOLVER
from whylogs.core.specialized_resolvers import ConditionCountMetricSpec
from whylogs.core.datatypes import Fractional, Integral
from whylogs.core.metrics.condition_count_metric import Condition
from whylogs.core.relations import Not, Predicate
from whylogs.core.schema import DeclarativeSchema

1. Regex Matching#

Suppose we have textual columns in our data in which we want to make sure certain elements are present / not present.

For example, for privacy and security issues, we might be interested in tracking the number of times a credit card number appears on a given column, or if we have sensitive email information in another column.

With whylogs, we can define metrics that will count the number of occurences a certain regex pattern is met for a given column.

Creating sample dataframe#

Let’s create a simple dataframe.

In this scenario, the emails column should have only a valid email, nothing else. As for the trascriptions column, we want to make sure existing credit card number was properly masked or removed.

[3]:
data = {
    "emails": ["my email is my_email_1989@gmail.com","invalidEmail@xyz.toolong","this.is.ok@hotmail.com","not an email"],
    "transcriptions": ["Bob's credit card number is 4000000000000", "Alice's credit card is XXXXXXXXXXXXX", "Hi, my name is Bob", "Hi, I'm Alice"],
}
df = pd.DataFrame(data=data)

The conditions are defined through a whylogs’ Condition object. There are several different ways of assembling a condition. In the following example, we will define two different regex patterns, one for each column. Since we can define multiple conditions for a single column, we’ll assemble the conditions into dictionaries, where the key is the condition name. Each dictionary will be later attached to the relevant column.

[4]:
emails_conditions = {
    "containsEmail": Condition(Predicate().fullmatch("[\w.]+[\._]?[a-z0-9]+[@]\w+[.]\w{2,3}")),
}

transcriptions_conditions = {
    "containsCreditCard": Condition(Predicate().matches(".*4[0-9]{12}(?:[0-9]{3})?"))
}

whylogs must be aware of those conditions while profiling the data. We can do that by creating a Standard Schema, and then simply adding the conditions to the schema with add_resolver_spec. That way, we can pass our enhanced schema when calling why.log() later.

[5]:
schema = DeclarativeSchema(STANDARD_RESOLVER)
schema.add_resolver_spec(column_name="emails", metrics=[ConditionCountMetricSpec(emails_conditions)])
schema.add_resolver_spec(column_name="transcriptions", metrics=[ConditionCountMetricSpec(transcriptions_conditions)])

Note: The regex expressions are for demonstrational purposes only. These expressions are not general - there will be emails and credit cards whose patterns will not be met by the expression.

Now, we only need to pass our schema when logging our data. Let’s also take a look at the metrics, to make sure everythins was tracked correctly:

[6]:
prof_view = why.log(df, schema=schema).profile().view()
prof_view.to_pandas()[['condition_count/containsEmail', 'condition_count/containsCreditCard', 'condition_count/total']]
[6]:
condition_count/containsEmail condition_count/containsCreditCard condition_count/total
column
emails 1.0 NaN 4
transcriptions NaN 1.0 4

Let’s check the numbers:

For emails feature, only one occurence was met for containsEmail. That is expected, because the only valid row is the third one (“this.is.ok@hotmail.com”). Others either don’t contain an email, are invalid emails or have extra text that are not an email (note we’re using fullmatch as the predicate for the email condition).

For transcriptions column, we also have only one match. That is well, since only the first row has a match with the given pattern, and others either don’t have a credit card number or are properly “hidden”. Note that in this case we want to check for the pattern inside a broader text, so we’re using .* before the pattern, so the text doesn’t have to start with the pattern (whylogs’ Predicate.matches uses python’s re.compile().match() under the hood.)

The available relations for regex matching are the ones used in this example:

  • matches

  • fullmatch

2. Numerical Equalities and Inequalities#

For this one, let’s create integer and floats columns:

[7]:
data = {
    "ints_column": [1,12,42,4],
    "floats_column": [1.2, 12.3, 42.2, 4.8]

}
df = pd.DataFrame(data=data)

As before, we will create our set of conditions for each column and pass both to our schema:

[9]:
ints_conditions = {
    "equals42": Condition(Predicate().equals(42)),
    "lessthan5": Condition(Predicate().less_than(5)),
    "morethan40": Condition(Predicate().greater_than(40)),

}

floats_conditions = {
    "equals42.2": Condition(Predicate().equals(42.2)),
    "lessthan5": Condition(Predicate().less_than(5)),
    "morethan40": Condition(Predicate().greater_than(40)),
}

schema = DeclarativeSchema(STANDARD_RESOLVER)
schema.add_resolver_spec(column_type=Integral, metrics=[ConditionCountMetricSpec(ints_conditions)])
schema.add_resolver_spec(column_type=Fractional, metrics=[ConditionCountMetricSpec(floats_conditions)])

Let’s log and check the metrics:

[10]:
prof_view = why.log(df, schema=schema).profile().view()
prof_view.to_pandas()[['types/fractional','types/integral','condition_count/lessthan5', 'condition_count/morethan40','condition_count/equals42','condition_count/equals42.2', 'condition_count/total']]

[10]:
types/fractional types/integral condition_count/lessthan5 condition_count/morethan40 condition_count/equals42 condition_count/equals42.2 condition_count/total
column
floats_column 4 0 2 1 NaN 1.0 4
ints_column 0 4 2 1 1.0 NaN 4

We can simply check the original data to verify that the metrics are correct. We used equals, less_than and greater_than in this example, but here’s the complete list of available relations:

  • equals - equal to

  • less_than - less than

  • less_or_equals - less than or equal to

  • greater_than - greater than

  • greater_or_equals - greater than or equal to

  • not_equal - not equal to

3. Combining metrics with logical operators - AND, OR, NOT#

You can also combine relations with logical operators such as AND, OR and NOT.

Let’s stick with the numerical features to show how you can combine relations to assemble conditions such as:

  • Value is between a certain range

  • Value is outside a certin range

  • Value is NOT a certain number

[11]:
conditions = {
    "between10and50": Condition(Predicate().greater_than(10).and_(Predicate().less_than(50))),
    "outside10and50": Condition(Predicate().less_than(10).or_(Predicate().greater_than(50))),
    "not_42": Condition(Not(Predicate().equals(42))),  # could also use X.not_equal(42)  or  X.not_.equals(42)
}

schema = DeclarativeSchema(STANDARD_RESOLVER)
schema.add_resolver_spec(column_name="ints_column", metrics=[ConditionCountMetricSpec(conditions)])
schema.add_resolver_spec(column_name="floats_column", metrics=[ConditionCountMetricSpec(conditions)])

prof_view = why.log(df, schema=schema).profile().view()
prof_view.to_pandas()[['condition_count/between10and50', 'condition_count/outside10and50', 'condition_count/not_42', 'condition_count/total']]

[11]:
condition_count/between10and50 condition_count/outside10and50 condition_count/not_42 condition_count/total
column
floats_column 2 2 4 4
ints_column 2 2 3 4

Available logical operators are: - and_ - or_ - not_ - Not

Note that and_, or_, and not_ are methods called on a Predicate and passed another Predicate, while Not is a function that takes a single Predicate argument. Even though we showed these operators with numerical features, this also works with regex matching conditions shown previously.

4. Custom Condition with User-defined functions#

If none of the previously conditions are suited to your use case, you are free to define your own custom function to create your own metrics.

Let’s see a simple example: suppose we want to check if a certain number is even.

We can define a even predicate function, as simple as:

[12]:
def even(x: Any) -> bool:
    return x % 2 == 0

And then we proceed as usual, defining our condition and adding it to the schema:

We only have to pass the name of the function to conditions as a Condition object, like below:

[13]:
conditions = {
    "isEven": Condition(Predicate().is_(even)),
}

schema = DeclarativeSchema(STANDARD_RESOLVER)
schema.add_resolver_spec(column_name="ints_column", metrics=[ConditionCountMetricSpec(conditions)])
schema.add_resolver_spec(column_name="floats_column", metrics=[ConditionCountMetricSpec(conditions)])

prof_view = why.log(df, schema=schema).profile().view()
prof_view.to_pandas()[['condition_count/isEven', 'condition_count/total']]

[13]:
condition_count/isEven condition_count/total
column
floats_column 0 4
ints_column 3 4

NLP example#

For user-defined functions, the sky’s the limit for what you can do.

Let’s think of another simple cenario for NLP. Suppose our model assumes text to be a certain way. Maybe it was trained and expects:

  • lowercased characters

  • no digits

  • no stopwords

Let’s check these conditions for the data below:

[14]:
data = {
    "transcriptions": ["I AM BOB AND I LIKE TO SCREAM","i am bob","am alice and am xx years old","am bob and am 42 years old"],
    "ints": [0,1,2,3],
}
df = pd.DataFrame(data=data)

Once again, let’s define our function:

[15]:
def preprocessed(x: Any) -> bool:
    stopwords = ["i", "me", "myself"]
    if not isinstance(x, str):
        return False

    # should have only lowercase letters and space (no digits)
    if not all(c.islower() or c.isspace() for c in x):
        return False
    # should not contain any words in our stopwords list
    if any(c in stopwords for c in x.split()):
        return False
    return True

Since this is an example, our stopwords list is only a placeholder for the real thing.

The rest is the same as before:

[16]:
conditions = {
    "isPreprocessed": Condition(Predicate().is_(preprocessed)),
}

schema = DeclarativeSchema(STANDARD_RESOLVER)
schema.add_resolver_spec(column_name="transcriptions", metrics=[ConditionCountMetricSpec(conditions)])

prof_view = why.log(df, schema=schema).profile().view()
prof_view.to_pandas()[['condition_count/isPreprocessed', 'condition_count/total']]
[16]:
condition_count/isPreprocessed condition_count/total
column
ints NaN NaN
transcriptions 1.0 4.0

For the transcriptions feature, we can see that only the second row is properly preprocessed (“am alice and am xx years old”). The first one contained uppercase characters, the third contained a stopword and the last one contained digits. For the integers column, isPreprocessed returns 0, since it’s not a string value.

5. Going Further#

You can combine this example with other whylogs’ features to cover even more scenarios.

Here are some pointers for some possible use cases:

Appendix - Complete Code Snippets#

Here are the complete code snippets - just to make it easier to copy/paste!

Regex example#

[17]:
import pandas as pd

import whylogs as why
from whylogs.core.resolvers import STANDARD_RESOLVER
from whylogs.core.specialized_resolvers import ConditionCountMetricSpec
from whylogs.core.metrics.condition_count_metric import Condition
from whylogs.core.relations import Predicate
from whylogs.core.schema import DeclarativeSchema

data = {
    "emails": ["my email is my_email_1989@gmail.com","invalidEmail@xyz.toolong","this.is.ok@hotmail.com","not an email"],
    "transcriptions": ["Bob's credit card number is 4000000000000", "Alice's credit card is XXXXXXXXXXXXX", "Hi, my name is Bob", "Hi, I'm Alice"],
}
df = pd.DataFrame(data=data)

emails_conditions = {
    "containsEmail": Condition(Predicate().fullmatch("[\w.]+[\._]?[a-z0-9]+[@]\w+[.]\w{2,3}")),
}

transcriptions_conditions = {
    "containsCreditCard": Condition(Predicate().matches(".*4[0-9]{12}(?:[0-9]{3})?"))
}

schema = DeclarativeSchema(STANDARD_RESOLVER)
schema.add_resolver_spec(column_name="emails", metrics=[ConditionCountMetricSpec(emails_conditions)])
schema.add_resolver_spec(column_name="transcriptions", metrics=[ConditionCountMetricSpec(transcriptions_conditions)])

prof_view = why.log(df, schema=schema).profile().view()
prof_view.to_pandas()[['condition_count/containsEmail', 'condition_count/containsCreditCard', 'condition_count/total']]
[17]:
condition_count/containsEmail condition_count/containsCreditCard condition_count/total
column
emails 1.0 NaN 4
transcriptions NaN 1.0 4

Equalities Example#

[18]:
import pandas as pd

import whylogs as why
from whylogs.core.resolvers import STANDARD_RESOLVER
from whylogs.core.specialized_resolvers import ConditionCountMetricSpec
from whylogs.core.datatypes import Fractional, Integral
from whylogs.core.metrics.condition_count_metric import Condition
from whylogs.core.relations import Predicate
from whylogs.core.schema import DeclarativeSchema

data = {
    "ints_column": [1,12,42,4],
    "floats_column": [1.2, 12.3, 42.2, 4.8]

}
df = pd.DataFrame(data=data)

ints_conditions = {
    "equals42": Condition(Predicate().equals(42)),
    "lessthan5": Condition(Predicate().less_than(5)),
    "morethan40": Condition(Predicate().greater_than(40)),

}

floats_conditions = {
    "equals42.2": Condition(Predicate().equals(42.2)),
    "lessthan5": Condition(Predicate().less_than(5)),
    "morethan40": Condition(Predicate().greater_than(40)),
}

schema = DeclarativeSchema(STANDARD_RESOLVER)
schema.add_resolver_spec(column_type=Integral, metrics=[ConditionCountMetricSpec(ints_conditions)])
schema.add_resolver_spec(column_type=Fractional, metrics=[ConditionCountMetricSpec(floats_conditions)])

prof_view = why.log(df, schema=schema).profile().view()
prof_view.to_pandas()[['types/fractional','types/integral','condition_count/lessthan5', 'condition_count/morethan40','condition_count/equals42','condition_count/equals42.2', 'condition_count/total']]

[18]:
types/fractional types/integral condition_count/lessthan5 condition_count/morethan40 condition_count/equals42 condition_count/equals42.2 condition_count/total
column
floats_column 4 0 2 1 NaN 1.0 4
ints_column 0 4 2 1 1.0 NaN 4

Logical Operators Example#

[19]:
import pandas as pd

import whylogs as why
from whylogs.core.resolvers import STANDARD_RESOLVER
from whylogs.core.specialized_resolvers import ConditionCountMetricSpec
from whylogs.core.metrics.condition_count_metric import Condition
from whylogs.core.relations import Predicate
from whylogs.core.schema import DeclarativeSchema
from whylogs.core.relations import Not

data = {
    "ints_column": [1,12,42,4],
    "floats_column": [1.2, 12.3, 42.2, 4.8]

}
df = pd.DataFrame(data=data)

conditions = {
    "between10and50": Condition(Predicate().greater_than(10).and_(Predicate().less_than(50))),
    "outside10and50": Condition(Predicate().less_than(10).or_(Predicate().greater_than(50))),
    "not_42": Condition(Not(Predicate().equals(42))),  # could also use X.not_equal(42)  or  X.not_.equals(42)
}

schema = DeclarativeSchema(STANDARD_RESOLVER)
schema.add_resolver_spec(column_name="ints_column", metrics=[ConditionCountMetricSpec(conditions)])
schema.add_resolver_spec(column_name="floats_column", metrics=[ConditionCountMetricSpec(conditions)])

prof_view = why.log(df, schema=schema).profile().view()
prof_view.to_pandas()[['condition_count/between10and50', 'condition_count/outside10and50', 'condition_count/not_42', 'condition_count/total']]
[19]:
condition_count/between10and50 condition_count/outside10and50 condition_count/not_42 condition_count/total
column
floats_column 2 2 4 4
ints_column 2 2 3 4

User-defined function - even#

[20]:
import pandas as pd
from typing import Any

import whylogs as why
from whylogs.core.resolvers import STANDARD_RESOLVER
from whylogs.core.specialized_resolvers import ConditionCountMetricSpec
from whylogs.core.metrics.condition_count_metric import Condition
from whylogs.core.relations import Predicate
from whylogs.core.schema import DeclarativeSchema

def even(x: Any) -> bool:
    return x % 2 == 0

def preprocessed(x: Any) -> bool:
    stopwords = ["i", "me", "myself"]
    if not isinstance(x, str):
        return False

    # should have only lowercase letters and space (no digits)
    if not all(c.islower() or c.isspace() for c in x):
        return False
    # should not contain any words in our stopwords list
    if any(c in stopwords for c in x.split()):
        return False
    return True

data = {
    "transcriptions": ["I AM BOB AND I LIKE TO SCREAM","i am bob","am alice and am xx years old","am bob and am 42 years old"],
    "ints_column": [1,12,42,4],
    "floats_column": [1.2, 12.3, 42.2, 4.8]
}
df = pd.DataFrame(data=data)

transcriptions_conditions = {
    "isPreprocessed": Condition(Predicate().is_(preprocessed)),
}
numerical_conditions = {
    "isEven": Condition(Predicate().is_(even)),
}

schema = DeclarativeSchema(STANDARD_RESOLVER)
schema.add_resolver_spec(column_name="ints_column", metrics=[ConditionCountMetricSpec(numerical_conditions)])
schema.add_resolver_spec(column_name="floats_column", metrics=[ConditionCountMetricSpec(numerical_conditions)])
schema.add_resolver_spec(column_name="transcriptions", metrics=[ConditionCountMetricSpec(transcriptions_conditions)])

prof_view = why.log(df, schema=schema).profile().view()
prof_view.to_pandas()[['condition_count/isPreprocessed','condition_count/isEven', 'condition_count/total']]
[20]:
condition_count/isPreprocessed condition_count/isEven condition_count/total
column
floats_column NaN 0.0 4
ints_column NaN 3.0 4
transcriptions 1.0 NaN 4