🚩 Create a free WhyLabs account to get more value out of whylogs!

Did you know you can store, visualize, and monitor whylogs profiles with theWhyLabs Observability Platform? Sign up for afree WhyLabs accountto leverage the power of whylogs and WhyLabs together!

MLflow Logging#

MLflow is an open-source model platform that can track, manage and help users deploy their models to production with a very consistent API and good software engineering practices. Whylogs users can benefit from our API to seamlessly log profiles to their Mlflow environment. Let’s see how.

Setup#

For this tutorial we will simplify the approach by using MLflow’s local client. One of MLflow’s advantages is that it uses the exact same API to work both locally and in the cloud. So with a minor setup, the code shown here can be easily extended if you’re working with MLflow in Kubernetes or in Databricks, for example. In order to get started, make sure you have both mlflow and whylogs installed in your environment by uncommenting the following cells:

[1]:

# Note: you may need to restart the kernel to use updated packages.
%pip install 'whylogs[mlflow]'

We are also installing pandas, scikit-learn and matplotlib in order to have a very simple training example and show you how you can start profiling your training data with whylogs. So, if you still haven’t, also run the following cell:

[2]:

%pip install -q scikit-learn matplotlib pandas mlflow-skinny

Get the data#

Now let us get an example dataset from the scikit-learn library and create a function that returns an aggregated dataframe with it. We will use this same function later on!

[1]:

import pandas as pd
from sklearn.datasets import load_iris

def get_data() -> pd.DataFrame:
    iris_data = load_iris()
    dataframe = pd.DataFrame(iris_data.data, columns=iris_data.feature_names)
    dataframe["target"] = pd.DataFrame(iris_data.target)
    return dataframe

[2]:

df = get_data()

[3]:

df.head()

[3]:

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2
3	4.6	3.1	1.5	0.2
4	5.0	3.6	1.4	0.2

Train a model#

Let’s define the simplest model to be trained with scikit-learn. We aren’t interested in model performance nor deep ML concepts, but only in having some baseline model being trained and having the overall idea of how to use whylogs with your existing training pipeline.

[4]:

from sklearn.tree import DecisionTreeClassifier


def train(dataframe: pd.DataFrame) -> None:
    model = DecisionTreeClassifier(max_depth=2)
    model.fit(dataframe.drop("target", axis=1), y=dataframe["target"])

We could serialize a model, but we will take a shortcut here taking advantage of mlflow’s awesome autolog method.

[ ]:

import mlflow

with mlflow.start_run() as run:
    mlflow.sklearn.autolog()

    df = get_data()
    train(dataframe=df)

    run_id = run.info.run_id

    mlflow.end_run()

And now we should see that a mlruns/ directory was created and that we already have our trained model in there!

[6]:

import os
os.listdir(f"mlruns/0/{run_id}/artifacts/model")

[6]:

['python_env.yaml', 'requirements.txt', 'MLmodel', 'model.pkl', 'conda.yaml']

Profile the training data with `whylogs`#

Now in order to profile your training data with whylogs, you’ll basically need to use our logger API, which is as simple as:

[7]:

import whylogs as why

profile_result = why.log(df)
profile_view = profile_result.view()

[8]:

profile_view.to_pandas()

[8]:

	counts/n	counts/null	types/integral	types/fractional	types/boolean	types/string	types/object	distribution/mean	distribution/stddev	distribution/n	...	distribution/q_90	distribution/q_95	distribution/q_99	ints/max	ints/min	cardinality/est	cardinality/upper_1	cardinality/lower_1	frequent_items/frequent_strings	type
column
target	150	0	150	0	0	0	0	1.000000	0.819232	150	...	2.0	2.0	2.0	2.0	0.0	3.000000	3.000150	3.0	[FrequentItem(value='0.000000', est=50, upper=...	SummaryType.COLUMN
petal width (cm)	150	0	0	150	0	0	0	1.199333	0.762238	150	...	2.2	2.3	2.5	NaN	NaN	22.000001	22.001100	22.0	NaN	SummaryType.COLUMN
sepal width (cm)	150	0	0	150	0	0	0	3.057333	0.435866	150	...	3.7	3.8	4.2	NaN	NaN	23.000001	23.001150	23.0	NaN	SummaryType.COLUMN
petal length (cm)	150	0	0	150	0	0	0	3.758000	1.765298	150	...	5.8	6.1	6.7	NaN	NaN	43.000004	43.002151	43.0	NaN	SummaryType.COLUMN
sepal length (cm)	150	0	0	150	0	0	0	5.843333	0.828066	150	...	6.9	7.3	7.7	NaN	NaN	35.000003	35.001750	35.0	NaN	SummaryType.COLUMN

5 rows × 28 columns

Writing your profile to `mlflow`#

Now even more interesting than writing this profile locally is the ability to use mlflow‘s API together with whylogs’, in order to store the training data profile and analyze the results of your experiments over time. For that, we basically need to define a function that will

Profile our training data
Log the profile as an mlflow artifact

Let’s see how this function can be written:

[10]:

def log_profile(dataframe: pd.DataFrame) -> None:
    profile_result = why.log(dataframe)
    profile_result.writer("mlflow").write()

And we can call that function we defined in our mlflow run experiment, like this:

[11]:

with mlflow.start_run() as run:
    mlflow.sklearn.autolog()

    df = get_data()
    train(dataframe=df)

    log_profile(dataframe=df)

    run_id = run.info.run_id

    mlflow.end_run()

If we inspect the recently created experiment folder, we will see that a whylogs directory was created there with our profile.

[12]:

os.listdir(f"mlruns/0/{run_id}/artifacts/whylogs")

[12]:

['whylogs_profile_4724587f9aa146b6a19be2f4268c5005.bin']

And we can even use mlflow’s API to fetch and read back our profile, like:

[13]:

from mlflow.tracking import MlflowClient

client = MlflowClient()

local_dir = "/tmp/artifact_downloads"
if not os.path.exists(local_dir):
    os.mkdir(local_dir)
local_path = client.download_artifacts(run_id, "whylogs", local_dir)

[14]:

os.listdir(local_path)

[14]:

['whylogs_profile_4724587f9aa146b6a19be2f4268c5005.bin']

[ ]:

profile_name = os.listdir(local_path)[0]
result = why.read(path=f"{local_path}/{profile_name}")

[16]:

result.view().to_pandas()

[16]:

	counts/n	counts/null	types/integral	types/fractional	types/boolean	types/string	types/object	cardinality/est	cardinality/upper_1	cardinality/lower_1	...	distribution/q_25	distribution/median	distribution/q_75	distribution/q_90	distribution/q_95	distribution/q_99	type	ints/max	ints/min	frequent_items/frequent_strings
column
petal length (cm)	150	0	0	150	0	0	0	43.000004	43.002151	43.0	...	1.6	4.4	5.1	5.8	6.1	6.7	SummaryType.COLUMN	NaN	NaN	NaN
petal width (cm)	150	0	0	150	0	0	0	22.000001	22.001100	22.0	...	0.3	1.3	1.8	2.2	2.3	2.5	SummaryType.COLUMN	NaN	NaN	NaN
sepal length (cm)	150	0	0	150	0	0	0	35.000003	35.001750	35.0	...	5.1	5.8	6.4	6.9	7.3	7.7	SummaryType.COLUMN	NaN	NaN	NaN
sepal width (cm)	150	0	0	150	0	0	0	23.000001	23.001150	23.0	...	2.8	3.0	3.3	3.7	3.8	4.2	SummaryType.COLUMN	NaN	NaN	NaN
target	150	0	150	0	0	0	0	3.000000	3.000150	3.0	...	0.0	1.0	2.0	2.0	2.0	2.0	SummaryType.COLUMN	2.0	0.0	[FrequentItem(value='0.000000', est=50, upper=...

5 rows × 28 columns

And with those few lines we have successfully fetched the profile artifact from our experiment. Over time, we will be able to track down some very relevant information on how our data behaves, why is our model generating the results and walk towards a more Robust and Responsible AI field.

Hope this tutorial will help you get started with whylogs. Stay tuned to our Github repo and also our community Slack to get the latest from whylogs.

See you soon!