🚩 Create a free WhyLabs account to get more value out of whylogs!

Did you know you can store, visualize, and monitor whylogs profiles with theWhyLabs Observability Platform? Sign up for afree WhyLabs accountto leverage the power of whylogs and WhyLabs together!

Inspecting Profiles#

Open in Colab

In this notebook, we’ll show how you can use whylog’s Profile Viewer (profile.view()) to find useful statistics in a dataset.

This includes:

  • Counters, such as number of samples and null values

  • Inferred types, such as integral, fractional, boolean, and strings

  • Estimated cardinality

  • Frequent items

  • Distribution metrics: min, max, mean, median, standard deviation, and quantile values

Setup#

We’ll need the whylogs and pandas libraries for this example.

We’ll also populate a dataframe with some data to inspect.

[1]:
# install whylogs & pandas if needed
# Note: you may need to restart the kernel to use updated packages.
%pip install whylogs
%pip install pandas
[2]:
# import whylogs and pandas
import whylogs as why
import pandas as pd

# Set to show all columns in dataframe
pd.set_option("display.max_columns", None)
[3]:
# create a simple test dataset
data = {
    "animal": ["lion", "shark", "cat", "bear", "jellyfish", "kangaroo",
                                      "jellyfish", "jellyfish", "fish"],
    "legs": [4, 0, 4, 4.0, None, 2, None, None, "fins"],
    "weight": [14.3, 11.8, 4.3, 30.1,2.0,120.0,2.7,2.2, 1.2],
}

# Create dataframe with test dataset
df = pd.DataFrame(data)

Log data with whylogs, create a profile, and view statistics:#

[4]:
# Log data with whylogs & create profile
results = why.log(pandas=df)
profile = results.profile()

# Create profile view dataframe
prof_view = profile.view()
prof_df = prof_view.to_pandas()
[4]:
# View Profile dataframe for dataset statistics
prof_df
[4]:
counts/n counts/null types/integral types/fractional types/boolean types/string types/object cardinality/est cardinality/upper_1 cardinality/lower_1 frequent_items/frequent_strings type distribution/mean distribution/stddev distribution/n distribution/max distribution/min distribution/q_10 distribution/q_25 distribution/median distribution/q_75 distribution/q_90
column
legs 9 3 4 1 0 1 0 4.0 4.00020 4.0 [FrequentItem(value='4.000000', est=3, upper=3... SummaryType.COLUMN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
weight 9 0 0 9 0 0 0 9.0 9.00045 9.0 NaN SummaryType.COLUMN 20.955556 38.29749 9.0 120.0 1.2 1.2 2.2 4.3 14.3 120.0
animal 9 0 0 0 0 9 0 7.0 7.00035 7.0 [FrequentItem(value='jellyfish', est=3, upper=... SummaryType.COLUMN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

The number of rows of our dataframe will be equal to the number of columns in the logged data. Each column of the statistics’ dataframe contains a specific dimension of a given Metric.

Taking a quick look at the generated statistics:

animal#

The animal row shows there are 9 entries (counts/n). All the data types are strings. Cardinality estimates that 7 different animal types are in the dataset. Frequent items show jellyfish appearing the most.

weight#

Our weight data contains 9 entries. All of them are fractional values. Cardinality shows that all 9 values are estimated to be unique. Since all entries were numerical the distribution statistics are generated.

legs#

We can see that there are 9 entries for leg values, but they’re several different data types. 3 null, 4 integrals, 1 float, and 1 string. Cardinality estimates 5 unique values. The most frequent number of legs that appear in the dataset is 4.

Selecting a single value#

A single cell can be selected to see full results if needed.

[64]:
# Select a single statistic by feature and row
prof_df['frequent_items/frequent_strings']['animal']
[64]:
[FrequentItem(value='jellyfish', est=3, upper=3, lower=3),
 FrequentItem(value='cat', est=1, upper=1, lower=1),
 FrequentItem(value='lion', est=1, upper=1, lower=1),
 FrequentItem(value='fish', est=1, upper=1, lower=1),
 FrequentItem(value='shark', est=1, upper=1, lower=1),
 FrequentItem(value='kangaroo', est=1, upper=1, lower=1),
 FrequentItem(value='bear', est=1, upper=1, lower=1)]

Understanding The whylogs Profile Statistics#

By default whylogs will automatically generate these metrics based on data types.

The standard metrics available in whylogs are grouped in namespaces. They are:

Counts and inferred data types track how many entries exist and what type data they contain.

  • counts/n - the total number of entries in a feature

  • counts/null the number of null values

  • types/integral - the number of values consisting of an integral (whole number)

  • types/fractional - the number of values consisting of a fractional value (float)

  • types/boolean - the number of values consisting of a boolean

  • types/string - the number of values consisting of a string

  • types/object - the number of values consisting of an object. If the data is not of any of the previous types, it will be assumed as an object

Cardinality tracks an approximate unique value for each feature

  • cardinality/est - the estimated unique values for each feature

  • cardinality/upper_1 - upper bound for the cardinality estimation. The actual cardinality will always be below this number.

  • cardinality/lower_1 - lower bound for the cardinality estimation. The actual cardinality will always be above this number.

Frequent items track which items show up the most.

  • frequent_items/frequent_strings - the most frequent items

Distribution statistics are generated when a feature contains numerical data.

  • distribution/mean - the calculated mean of the feature data

  • distribution/stddev - the calculated standard deviation of the feature data

  • distribution/n - the number of rows belonging to the feature

  • distribution/max - the highest (max) value in the feature

  • distribution/min - the smallest (min) value in the feature

  • distribution/median - the median value of the feature data

  • distribution/q_xx - the xx-th quantile value of the data’s distribution

Data Types and Metrics#

whylogs maps different data types, like numpy arrays, list, integers, etc. to specific whylogs data types. The three most important whylogs data types are:

  • Integral

  • Fractional

  • String

By default, whylogs will track the following metrics according to the column’s inferred data type:

  • Integral:

    • counts

    • types

    • distribution

    • ints

    • cardinality

    • frequent_items

  • Fractional:

    • counts

    • types

    • cardinality

    • distribution

  • String:

    • counts

    • types

    • cardinality

    • frequent_items

If you want to know how you can customize this configuration, selecting the metrics according to the data type or column name, please go to the Schema Configuration example

That’s it! If you want to know more about whylogs, check our documentation.