đźš© Create a free WhyLabs account to get more value out of whylogs!

Did you know you can store, visualize, and monitor whylogs profiles with theWhyLabs Observability Platform? Sign up for afree WhyLabs accountto leverage the power of whylogs and WhyLabs together!

Schema Configuration for Tracking Metrics#

Open in Colab

When logging data, whylogs outputs certain metrics according to the column type. While whylogs provide a default behaviour, you can configure it in order to only track metrics that are important to you.

In this example, we’ll see how you can configure the Schema for a dataset level to control which metrics you want to calculate. We’ll see how to specify metrics:

  1. Per data type

  2. Per column name

But first, let’s talk briefly about whylogs’ data types and basic metrics.

Installing whylogs#

[1]:
# Note: you may need to restart the kernel to use updated packages.
%pip install whylogs
Installing collected packages: whylogs-sketching, types-urllib3, types-requests, whylabs-client, whylogs
Successfully installed types-requests-2.31.0.2 types-urllib3-1.26.25.14 whylabs-client-0.5.4 whylogs-1.3.0 whylogs-sketching-3.4.1.dev3

whylogs DataTypes#

whylogs maps different data types, like numpy arrays, list, integers, etc. to specific whylogs data types. The three most important whylogs data types are:

  • Integral

  • Fractional

  • String

Anything that doesn’t end up matching the above types will have an AnyType type.

To check which type a certain Python type is mapped to in whylogs, you can use the StandardTypeMapper:

[2]:
from whylogs.core.datatypes import StandardTypeMapper

type_mapper = StandardTypeMapper()

type_mapper(list)
[2]:
<whylogs.core.datatypes.AnyType at 0x7dde641a70d0>

Basic Metrics#

The standard metrics available in whylogs are grouped in namespaces. They are:

  • counts: Counters, such as number of samples and null values

  • types: Inferred types, such as boolean, string or fractional

  • ints: Max and Min Values

  • distribution: min,max, median, quantile values

  • cardinality: Number of different values

  • frequent_items: Most common values

  • unicode_range: Count of characters used in string values

  • condition_count: Count how often values meet specified conditions

Configuring Metrics in the Dataset Schema#

Now, let’s see how we can control which metrics are tracked according to the column’s type or column name.

Metrics per Type#

Let’s assume you’re not interested in every metric listed above, and you have a performance-critical application, so you’d like to do as few calculations as possible.

For example, you might only be interested in:

  • Counts/Types metrics for every data type

  • Distribution metrics for Fractional

  • Frequent Items for Integral

Let’s see how we can configure our Schema to track only the above metrics for the related types.

Let’s create a sample dataframe to illustrate:

[ ]:
# Install pandas if you don't have it already
%pip install pandas

[4]:
import pandas as pd
d = {"col1": [1, 2, 3], "col2": [3.0, 4.0, 5.0], "col3": ["a", "b", "c"], "col4": [3.0, 4.0, 5.0]}
df = pd.DataFrame(data=d)

whylogs uses Resolvers in order to define how a column name or data type gets mapped to different metrics.

We will create a custom Resolver class in order to customize it.

[5]:
from whylogs.core.resolvers import Resolver
from whylogs.core.datatypes import DataType, Fractional, Integral
from typing import Dict, List
from whylogs.core.metrics import StandardMetric
from whylogs.core.metrics.metrics import Metric

class MyCustomResolver(Resolver):
    """Resolver that keeps distribution metrics for Fractional and frequent items for Integral, and counters and types metrics for all data types."""

    def resolve(self, name: str, why_type: DataType, column_schema) -> Dict[str, Metric]:
        metrics: List[StandardMetric] = [StandardMetric.counts, StandardMetric.types]
        if isinstance(why_type, Fractional):
            metrics.append(StandardMetric.distribution)
        if isinstance(why_type, Integral):
            metrics.append(StandardMetric.frequent_items)


        result: Dict[str, Metric] = {}
        for m in metrics:
            result[m.name] = m.zero(column_schema.cfg)
        return result

In the case above, the name parameter is not being used, as the column name is not relevant to map the metrics, only the why_type.

We basically initialize metrics with metrics of both counts and types namespaces regardless of the data type. Then, we check for the whylogs data type in order to add the desired metric namespace (distribution for Fractional columns and frequent_items for Integral columns)

Now we can proceed with the normal process of logging a dataframe. Resolvers are passed to whylogs through a Dataset Schema, so we can pass a DatasetSchema object to log’s schema parameter as follows:

[6]:
import whylogs as why
from whylogs.core import DatasetSchema
result = why.log(df, schema=DatasetSchema(resolvers=MyCustomResolver()))
prof = result.profile()
prof_view = prof.view()
pd.set_option("display.max_columns", None)
prof_view.to_pandas()
WARNING:whylogs.api.whylabs.session.session_manager:No session found. Call whylogs.init() to initialize a session and authenticate. See https://docs.whylabs.ai/docs/whylabs-whylogs-init for more information.
[6]:
counts/inf counts/n counts/nan counts/null frequent_items/frequent_strings type types/boolean types/fractional types/integral types/object types/string types/tensor distribution/max distribution/mean distribution/median distribution/min distribution/n distribution/q_01 distribution/q_05 distribution/q_10 distribution/q_25 distribution/q_75 distribution/q_90 distribution/q_95 distribution/q_99 distribution/stddev
column
col1 0 3 0 0 [FrequentItem(value='1', est=1, upper=1, lower... SummaryType.COLUMN 0 0 3 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
col2 0 3 0 0 NaN SummaryType.COLUMN 0 3 0 0 0 0 5.0 4.0 4.0 3.0 3.0 3.0 3.0 3.0 3.0 5.0 5.0 5.0 5.0 1.0
col3 0 3 0 0 NaN SummaryType.COLUMN 0 0 0 0 3 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
col4 0 3 0 0 NaN SummaryType.COLUMN 0 3 0 0 0 0 5.0 4.0 4.0 3.0 3.0 3.0 3.0 3.0 3.0 5.0 5.0 5.0 5.0 1.0

Notice we have counts and types metrics for every type, distribution metrics only for col2 and col4 (floats) and frequent_items only for col1 (ints).

That’s precisely what we wanted.

Metrics per Column#

Now, suppose we don’t want to specify the tracked metrics per data type, and rather by each specific columns.

For example, we might want to track:

  • Count metrics for col1

  • Distribution Metrics for col2

  • Cardinality for col3

  • Distribution Metrics + Cardinality for col4

The process is similar to the previous case. We only need to change the if clauses to check for the name instead of why_type, like this:

[7]:
from whylogs.core.resolvers import Resolver
from whylogs.core.datatypes import DataType, Fractional, Integral
from typing import Dict, List
from whylogs.core.metrics import StandardMetric
from whylogs.core.metrics.metrics import Metric

class MyCustomResolver(Resolver):
    """Resolver that keeps distribution metrics for Fractional and frequent items for Integral, and counters and types metrics for all data types."""

    def resolve(self, name: str, why_type: DataType, column_schema) -> Dict[str, Metric]:
        metrics = []
        if name=='col1':
            metrics.append(StandardMetric.counts)
        if name=='col2':
            metrics.append(StandardMetric.distribution)
        if name=='col3':
            metrics.append(StandardMetric.cardinality)
        if name=='col4':
            metrics.append(StandardMetric.distribution)
            metrics.append(StandardMetric.cardinality)



        result: Dict[str, Metric] = {}
        for m in metrics:
            result[m.name] = m.zero(column_schema.cfg)
        return result

Since there’s no common metrics for all columns, we can initialize metrics as an empty list, and then append the relevant metrics for each columns.

Now, we create a custom schema, just like before:

[8]:
import whylogs as why
from whylogs.core import DatasetSchema
df['col5'] = 0
result = why.log(df, schema=DatasetSchema(resolvers=MyCustomResolver()))
prof = result.profile()
prof_view = prof.view()
pd.set_option("display.max_columns", None)
prof_view.to_pandas()
[8]:
counts/inf counts/n counts/nan counts/null type distribution/max distribution/mean distribution/median distribution/min distribution/n distribution/q_01 distribution/q_05 distribution/q_10 distribution/q_25 distribution/q_75 distribution/q_90 distribution/q_95 distribution/q_99 distribution/stddev cardinality/est cardinality/lower_1 cardinality/upper_1
column
col1 0.0 3.0 0.0 0.0 SummaryType.COLUMN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
col2 NaN NaN NaN NaN SummaryType.COLUMN 5.0 4.0 4.0 3.0 3.0 3.0 3.0 3.0 3.0 5.0 5.0 5.0 5.0 1.0 NaN NaN NaN
col3 NaN NaN NaN NaN SummaryType.COLUMN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 3.0 3.0 3.00015
col4 NaN NaN NaN NaN SummaryType.COLUMN 5.0 4.0 4.0 3.0 3.0 3.0 3.0 3.0 3.0 5.0 5.0 5.0 5.0 1.0 3.0 3.0 3.00015
col5 NaN NaN NaN NaN SummaryType.COLUMN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

Note that existing columns that are not specified in your custom resolver won’t have any metrics tracked. In the example above, we added a col5 column, but since we didn’t link any metrics to it, all of the metrics are NaNs.

Declarative Schema#

In the previous section, we created subclasses of Resolver and implemented its resolve() method using control flow. The DeclarativeSchema allows us to customize the metrics present in a column by simply listing the metrics we want by data type or column name without implementing a Resolver subclass.

Declarative Schema Specification#

A ResolverSpec specifies a list of metrics to use for columns that match it. We can match columns by name or by type. The column name takes precedence if both are given. Each ResolverSpec has a list of MetricSpec that specify the Metrics (and optionally custom configurations) to apply to matching metrics. For example:

[9]:
from whylogs.core.metrics.condition_count_metric import (
    Condition,
    ConditionCountConfig,
    ConditionCountMetric,
)
from whylogs.core.relations import Predicate
from whylogs.core.resolvers import COLUMN_METRICS, MetricSpec, ResolverSpec
from whylogs.core.schema import DeclarativeSchema
from whylogs.core.datatypes import AnyType, DataType, Fractional, Integral, String

X = Predicate()


schema = DeclarativeSchema(
    [
        ResolverSpec(
            column_name="col1",
            metrics=[
                MetricSpec(StandardMetric.distribution.value),
                MetricSpec(
                    ConditionCountMetric,
                    ConditionCountConfig(
                        conditions={
                            "below 42": Condition(lambda x: x < 42),
                            "above 42": Condition(lambda x: x > 42),
                        }
                    ),
                ),
            ],
        ),
        ResolverSpec(
            column_type=String,
            metrics=[
                MetricSpec(StandardMetric.frequent_items.value),
                MetricSpec(
                    ConditionCountMetric,
                    ConditionCountConfig(
                        conditions={
                            "alpha": Condition(X.matches("[a-zA-Z]+")),
                            "digit": Condition(X.matches("[0-9]+")),
                        }
                    ),
                ),
            ],
        ),
    ]
)

d = {"col1": [1, 2, 3], "col2": [3.0, 4.0, 5.0], "col3": ["a", "b", "c"], "col4": [3.0, 4.0, 5.0]}
df = pd.DataFrame(data=d)
result = why.log(df, schema=schema)
prof_view = result.profile().view()
prof_view.to_pandas()
[9]:
condition_count/above 42 condition_count/below 42 condition_count/total distribution/max distribution/mean distribution/median distribution/min distribution/n distribution/q_01 distribution/q_05 distribution/q_10 distribution/q_25 distribution/q_75 distribution/q_90 distribution/q_95 distribution/q_99 distribution/stddev type condition_count/alpha condition_count/digit frequent_items/frequent_strings
column
col1 0.0 3.0 3.0 3.0 2.0 2.0 1.0 3.0 1.0 1.0 1.0 1.0 3.0 3.0 3.0 3.0 1.0 SummaryType.COLUMN NaN NaN NaN
col2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN SummaryType.COLUMN NaN NaN NaN
col3 NaN NaN 3.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN SummaryType.COLUMN 3.0 0.0 [FrequentItem(value='c', est=1, upper=1, lower...
col4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN SummaryType.COLUMN NaN NaN NaN

We can now pass schema to why.log() to log data according to the schema. Note that we pass the Metric class to the the MetricSpec constructor, not an instance. In this example, col1 will have a ConditionCountMetric that tracks how often the column entries are above or below 42. Any string column will track how many entries are alphabetic and how many are numeric.

whylogs.core.resolvers.COLUMN_METRICS is a list of MetricSpecs for the metrics WhyLabs expects in each column. There are also some predefined ResolverSpec lists to cover common use cases. For example, STANDARD_RESOLVER specifies the same metrics as the StandardResolver:

[10]:
STANDARD_RESOLVER = [
    ResolverSpec(
        column_type=Integral,
        metrics=COLUMN_METRICS
        + [
            MetricSpec(StandardMetric.distribution.value),
            MetricSpec(StandardMetric.ints.value),
            MetricSpec(StandardMetric.cardinality.value),
            MetricSpec(StandardMetric.frequent_items.value),
        ],
    ),
    ResolverSpec(
        column_type=Fractional,
        metrics=COLUMN_METRICS
        + [
            MetricSpec(StandardMetric.distribution.value),
            MetricSpec(StandardMetric.cardinality.value),
        ],
    ),
    ResolverSpec(
        column_type=String,
        metrics=COLUMN_METRICS
        + [
            MetricSpec(StandardMetric.unicode_range.value),
            MetricSpec(StandardMetric.distribution.value),
            MetricSpec(StandardMetric.cardinality.value),
            MetricSpec(StandardMetric.frequent_items.value),
        ],
    ),
    ResolverSpec(column_type=AnyType, metrics=COLUMN_METRICS),
]

There are also declarations for * LIMITED_TRACKING_RESOLVER just tracks the metrics required by WhyLogs, plus the distribution metric for numeric columns. * NO_FI_RESOLVER is the same as STANDARD_RESOLVER but omits the frequent item metrics. * HISTOGRAM_COUNTING_TRACKING_RESOLVER tracks only the distribution metric for each column.

These provide handy starting places if we just want to add one or two metrics to one of these standard schema using the add_resolver() method:

[11]:
from whylogs.core.resolvers import STANDARD_RESOLVER

schema = DeclarativeSchema(STANDARD_RESOLVER)
extra_metric = ResolverSpec(
    column_name="col1",
    metrics=[
        MetricSpec(StandardMetric.distribution.value),
        MetricSpec(
            ConditionCountMetric,
            ConditionCountConfig(
                conditions={
                    "below 42": Condition(lambda x: x < 42),
                    "above 42": Condition(lambda x: x > 42),
                }
            ),
        ),
    ],
)
schema.add_resolver(extra_metric)

result = why.log(df, schema=schema)
prof_view = result.profile().view()
prof_view.to_pandas()
WARNING:whylogs.core.resolvers:Conflicting resolvers for distribution metric in column 'col1' of type int
[11]:
cardinality/est cardinality/lower_1 cardinality/upper_1 condition_count/above 42 condition_count/below 42 condition_count/total counts/inf counts/n counts/nan counts/null distribution/max distribution/mean distribution/median distribution/min distribution/n distribution/q_01 distribution/q_05 distribution/q_10 distribution/q_25 distribution/q_75 distribution/q_90 distribution/q_95 distribution/q_99 distribution/stddev frequent_items/frequent_strings ints/max ints/min type types/boolean types/fractional types/integral types/object types/string types/tensor
column
col1 3.0 3.0 3.00015 0.0 3.0 3.0 0 3 0 0 3.0 2.0 2.0 1.0 3 1.0 1.0 1.0 1.0 3.0 3.0 3.0 3.0 1.0 [FrequentItem(value='1', est=1, upper=1, lower... 3.0 1.0 SummaryType.COLUMN 0 0 3 0 0 0
col2 3.0 3.0 3.00015 NaN NaN NaN 0 3 0 0 5.0 4.0 4.0 3.0 3 3.0 3.0 3.0 3.0 5.0 5.0 5.0 5.0 1.0 NaN NaN NaN SummaryType.COLUMN 0 3 0 0 0 0
col3 3.0 3.0 3.00015 NaN NaN NaN 0 3 0 0 NaN 0.0 NaN NaN 0 NaN NaN NaN NaN NaN NaN NaN NaN 0.0 [FrequentItem(value='c', est=1, upper=1, lower... NaN NaN SummaryType.COLUMN 0 0 0 0 3 0
col4 3.0 3.0 3.00015 NaN NaN NaN 0 3 0 0 5.0 4.0 4.0 3.0 3 3.0 3.0 3.0 3.0 5.0 5.0 5.0 5.0 1.0 NaN NaN NaN SummaryType.COLUMN 0 3 0 0 0 0

This example adds a condition count metric to col1 in addition to the usual default metrics.

Default Resolver#

If you instantiate a DeclarativeResolver without passing it a list of ResolverSpecs, it will use the value of the variable whylogs.core.resovlers.DEFAULT_RESOLVER. Initially this has the value of STANDARD_RESOLVER which matches whylog’s default behavior. You can set the value to one of the other pre-defined resolver lists or your own custom resolver list to customize the default resolving behavior.

Similarly, there is a whylogs.experimental.core.metrics.udf_metric.DEFAULT_UDF_RESOLVER variable that specifies the default resolvers for the submetrics in a UdfMetric.

Excluding Metrics#

The ResolverSpec has an exclude field. If this is set to true, the metrics listed in the ResolverSpec are excluded from columns that match it. This can be handy for preventing sensitive information from “leaking” via a frequent items metric:

[15]:
from whylogs.core.resolvers import DEFAULT_RESOLVER

data = pd.DataFrame({"Sensitive": ["private", "secret"], "Boring": ["normal", "stuff"]})
schema = DeclarativeSchema(
    DEFAULT_RESOLVER + [ResolverSpec(
        column_name = "Sensitive",
        metrics = [MetricSpec(StandardMetric.frequent_items.value)],
        exclude = True
    )]
)
result = why.log(data, schema=schema)
result.profile().view().to_pandas()["frequent_items/frequent_strings"]
[15]:
column
Boring       [FrequentItem(value='normal', est=1, upper=1, ...
Sensitive                                                  NaN
Name: frequent_items/frequent_strings, dtype: object

The frequent items metrics has been excluded from the Sensitive column without affecting the DEFAULT_RESOLVER’s treatment of other columns.