🚩 Create a free WhyLabs account to get more value out of whylogs!

Did you know you can store, visualize, and monitor whylogs profiles with theWhyLabs Observability Platform? Sign up for afree WhyLabs accountto leverage the power of whylogs and WhyLabs together!

Data Validation with Metric Constraints#

Open in Colab

This is an example for whylogs versions 1.0.0 and above. If you’re interested in constraints for versions <1.0.0, please see these examples: Constraints Suite, Constraints-Distributional Measures, and Creating Customized Constraints

[ ]:
# Note: you may need to restart the kernel to use updated packages.
%pip install 'whylogs[viz]'

Starting with the basic pandas dataframe logging, consider the following input. We will generate whylogs profile view from this

[2]:
import pandas as pd
import whylogs as why

data = {
    "animal": ["cat", "hawk", "snake", "cat", "mosquito"],
    "legs": [4, 2, 0, 4, 6],
    "weight": [4.3, 1.8, 1.3, 4.1, 5.5e-6],
}

results = why.log(pd.DataFrame(data))
profile_view = results.view()

The profile view can be display as a pandas dataframe where the columns are metric/component paths

[3]:
profile_view.to_pandas()
[3]:
cardinality/est cardinality/lower_1 cardinality/upper_1 counts/inf counts/n counts/nan counts/null distribution/max distribution/mean distribution/median ... distribution/stddev frequent_items/frequent_strings type types/boolean types/fractional types/integral types/object types/string ints/max ints/min
column
animal 4.0 4.0 4.00020 0 5 0 0 NaN 0.000000 NaN ... 0.000000 [FrequentItem(value='cat', est=2, upper=2, low... SummaryType.COLUMN 0 0 0 0 5 NaN NaN
legs 4.0 4.0 4.00020 0 5 0 0 6.0 3.200000 4.0 ... 2.280351 [FrequentItem(value='4', est=2, upper=2, lower... SummaryType.COLUMN 0 0 5 0 0 6.0 0.0
weight 5.0 5.0 5.00025 0 5 0 0 4.3 2.300001 1.8 ... 1.856069 NaN SummaryType.COLUMN 0 5 0 0 0 NaN NaN

3 rows × 30 columns

In the above output notice that we have a metrics on the number of legs these animals have in the “legs” column. Let’s say we want to define some constraints on the number of “legs” we expect for animals.

[4]:
from whylogs.core.constraints import Constraints, ConstraintsBuilder, MetricsSelector, MetricConstraint
column_view = profile_view.get_column("legs")

# constraint session bound to profile_view
builder = ConstraintsBuilder(profile_view)

# A constraint builder lets you generate a set of contraints using the passed in profile_view's list of columns and metrics.
# lets explore what kind of column profiles and metrics we have avalaible in the profile view

# We can specify a metric by selecting a (column_name, metric_name)
# lets look at the column names again:
column_names = profile_view.get_columns().keys()
print(f"columns: {column_names}")

# And here are the metric names on the "legs" column
metric_names = profile_view.get_column("legs").get_metric_names()
print(f"metric names: {metric_names}")

# If you want to the full set of possibilities you can ask the builder for all MetricSelectors
# which covers the unique combinations of (column_name, metric_name)
selectors = builder.get_metric_selectors()
i = 6
print(f"here is selector at index {i}: {selectors[i]} there are a total of {len(selectors)}")


columns: dict_keys(['animal', 'legs', 'weight'])
metric names: ['counts', 'types', 'distribution', 'ints', 'cardinality', 'frequent_items']
here is selector at index 6: MetricsSelector(metric_name='types', column_name='legs', metrics_resolver=None) there are a total of 15
[5]:
# Lets say we're interested in defining a constraint on the number of "legs". From output above we see
# that there are the following metrics on column "legs": [counts, types, distribution, ints, cardinality, frequent_items]
# lets look at what the distribution metric contains:
distribution_values = profile_view.get_column("legs").get_metric("distribution").to_summary_dict()
distribution_values
[5]:
{'mean': 3.2,
 'stddev': 2.280350850198276,
 'n': 5,
 'max': 6.0,
 'min': 0.0,
 'q_01': 0.0,
 'q_05': 0.0,
 'q_10': 0.0,
 'q_25': 2.0,
 'median': 4.0,
 'q_75': 4.0,
 'q_90': 6.0,
 'q_95': 6.0,
 'q_99': 6.0}

Ok, let’s come back to how to use the ConstraintsBuilder to add a couple constraints

[6]:
# the constraints builder add_constraint() takes in a MetricConstraint, which requires three things to define it:
# 1. A metric selector, this is a way of selecting which metric and on which column you want to apply a constraint.
#   let's choose MetricsSelector(metric_name='distribution', column_name='legs', metrics_resolver=None)
# 2. an expression on the selected metric, for distribution, we have numeric properties such as max, min, stddev
#   and others we can reference. For this we'll require animal legs < 12 (sorry centipedes)!
# 3. a name for this constraint, let's go with "legs < 12"

distribution_legs = MetricsSelector(metric_name='distribution', column_name='legs')

# this lambda takes in a distribution metric, which has convenience properties on this metric for max/min,
# but we could also call to_summary_dict() and use any of the keys we saw in 'distribution_values' above
legs_under_12 = lambda x: x.max < 12

constraint_name = "legs < 12"

legs_constraint = MetricConstraint(
        name=constraint_name,
        condition=legs_under_12,
        metric_selector=distribution_legs)
[7]:
# now that we have a legs_constraint defined we can add it to the builder:
builder.add_constraint(legs_constraint)

# we could add more constraints using this pattern to the builder, maybe we realize negative values are invalid
not_negative = lambda x: x.min >= 0
builder.add_constraint(MetricConstraint(
    name="legs >= 0",
    condition=not_negative,
    metric_selector=distribution_legs
))

# ok lets build these constraints
constraints: Constraints = builder.build()

# A Constraints object contains a collection of contraints and can call validate to get a pass/fail
# or report for display
constraints_valid = constraints.validate()
print(f"Constraints valid: {constraints_valid}")

# And a simple report of the [constraint name, pass, fail] can be generated like this:
constraints_report = constraints.generate_constraints_report()
print(f"Constraints report [constraint name, pass, fail, summary]: {constraints_report}")

Constraints valid: True
Constraints report [constraint name, pass, fail, summary]: [ReportResult(name='legs < 12', passed=1, failed=0, summary=None), ReportResult(name='legs >= 0', passed=1, failed=0, summary=None)]

Ok lets add a few more! and rebuild the constraints

[8]:
stddev_below_3 = lambda x: x.stddev < 3.0
builder.add_constraint(MetricConstraint(
    name="legs stddev < 3.0",
    condition=stddev_below_3,
    metric_selector=distribution_legs
))

distribution_weight = MetricsSelector(metric_name='distribution', column_name='weight')
builder.add_constraint(MetricConstraint(
    name="weight >= 0",
    condition=not_negative,
    metric_selector=distribution_weight
))

reasonable_constraints = builder.build()


builder.add_constraint(MetricConstraint(
    name="animal count >= 1000",
    condition=lambda x: x.n.value > 1000,
    metric_selector=MetricsSelector(metric_name='counts', column_name='animal')
))

reasonable_constraints_over_1000_rows = builder.build()
[9]:
from whylogs.viz import NotebookProfileVisualizer

# You can also pass the constraints to the NotebookProfileVisualizer and generate a report
visualization = NotebookProfileVisualizer()
visualization.constraints_report(constraints, cell_height=300)


[9]:

If you hover on the Passed/Fail icons, you’ll be able to check the summary of the metric that was used to build the constraints. In this case, legs<12 passed because the max metric component is 6, which is below the number 12.

Similarly, legs >= 0 passed, because min is 0, which is above or equal 0.

[10]:
# a slightly more interesting report
visualization.constraints_report(reasonable_constraints, cell_height=400)
[10]:
[11]:
# a failing report (because we don't have enough animals!)
visualization.constraints_report(reasonable_constraints_over_1000_rows, cell_height=400)
[11]: