🚩 Create a free WhyLabs account to get more value out of whylogs!

Did you know you can store, visualize, and monitor whylogs profiles with theWhyLabs Observability Platform? Sign up for afree WhyLabs accountto leverage the power of whylogs and WhyLabs together!

PySpark Integration#

Hi! Perhaps you’re already feeling confident with our library, but you really wish there was an easy way to plug our profiling into your existing PySpark jobs. Well, glad you’ve made it here, because this is what we are going to cover in this example notebook 😃

If you wish to have other insights on how to use whylogs, feel free to check our other existing examples, as they might be extremely useful!

Installing the extra dependency#

As we want to enable users to have exactly what they need to use from whylogs, the pyspark integration comes as an extra dependency. In order to have it available, simply uncomment and run the following cell:

[2]:

# Note: you may need to restart the kernel to use updated packages.
%pip install 'whylogs[spark]'

Initializing a SparkSession#

Here we will initialize a SparkSession. I’m also setting the pyarrow execution config, because it makes our methods even more performant.

IMPORTANT: Make sure you have Spark 3.0+ available in your environment, as our implementation relies on it for a smoother integration

[3]:

from pyspark.sql import SparkSession

[ ]:

spark = SparkSession.builder.appName('whylogs-testing').getOrCreate()
arrow_config_key = "spark.sql.execution.arrow.pyspark.enabled"
spark.conf.set(arrow_config_key, "true")

Reading the data#

For the sake of simplicity (and computational efforts, so you can run this notebook from your local machine), we will read the Wine Quality dataset, available in this URL: “http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv”.

[5]:

from pyspark import SparkFiles

data_url = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
spark.sparkContext.addFile(data_url)

[6]:

spark_dataframe = spark.read.option("delimiter", ";").option("inferSchema", "true").csv(SparkFiles.get("winequality-red.csv"), header=True)

[7]:

spark_dataframe.show(n=1, vertical=True)

-RECORD 0----------------------
 fixed acidity        | 7.4
 volatile acidity     | 0.7
 citric acid          | 0.0
 residual sugar       | 1.9
 chlorides            | 0.076
 free sulfur dioxide  | 11.0
 total sulfur dioxide | 34.0
 density              | 0.9978
 pH                   | 3.51
 sulphates            | 0.56
 alcohol              | 9.4
 quality              | 5
only showing top 1 row

[8]:

spark_dataframe.printSchema()

root
 |-- fixed acidity: double (nullable = true)
 |-- volatile acidity: double (nullable = true)
 |-- citric acid: double (nullable = true)
 |-- residual sugar: double (nullable = true)
 |-- chlorides: double (nullable = true)
 |-- free sulfur dioxide: double (nullable = true)
 |-- total sulfur dioxide: double (nullable = true)
 |-- density: double (nullable = true)
 |-- pH: double (nullable = true)
 |-- sulphates: double (nullable = true)
 |-- alcohol: double (nullable = true)
 |-- quality: integer (nullable = true)

Profiling the data with whylogs#

Now that we have a Spark DataFrame in place, let’s see how easy it is to profile our data with whylogs.

[9]:

from whylogs.api.pyspark.experimental import collect_column_profile_views

column_views_dict = collect_column_profile_views(spark_dataframe)

Yeap. It’s done. It is that easy.

But what do we get with a column_views_dict?

[10]:

print(column_views_dict)

{'alcohol': <whylogs.core.view.column_profile_view.ColumnProfileView object at 0x11d240e20>, 'chlorides': <whylogs.core.view.column_profile_view.ColumnProfileView object at 0x11d29ec70>, 'citric acid': <whylogs.core.view.column_profile_view.ColumnProfileView object at 0x11d2a2d00>, 'density': <whylogs.core.view.column_profile_view.ColumnProfileView object at 0x11d2a6d90>, 'fixed acidity': <whylogs.core.view.column_profile_view.ColumnProfileView object at 0x11d2a3e20>, 'free sulfur dioxide': <whylogs.core.view.column_profile_view.ColumnProfileView object at 0x11d2aaeb0>, 'pH': <whylogs.core.view.column_profile_view.ColumnProfileView object at 0x11d2adf40>, 'quality': <whylogs.core.view.column_profile_view.ColumnProfileView object at 0x11d2b7100>, 'residual sugar': <whylogs.core.view.column_profile_view.ColumnProfileView object at 0x11d2b7d60>, 'sulphates': <whylogs.core.view.column_profile_view.ColumnProfileView object at 0x11d2bddf0>, 'total sulfur dioxide': <whylogs.core.view.column_profile_view.ColumnProfileView object at 0x11d2c0d90>, 'volatile acidity': <whylogs.core.view.column_profile_view.ColumnProfileView object at 0x11d2c4e20>}

It is a dictionary with one ColumnProfileView object per column in your dataset. And we can inspect some of the metrics on each one of them, such as the counts for a given column

[11]:

column_views_dict["density"].get_metric("counts").n.value, spark_dataframe.count()

[11]:

(1599, 1599)

Or their mean value:

[12]:

column_views_dict["density"].get_metric("distribution").mean.value

[12]:

0.9967466791744841

And now let’s check how accurate whylogs did store that mean calculation.

[13]:

from pyspark.sql.functions import mean
spark_dataframe.select(mean("density")).show()

+------------------+
|      avg(density)|
+------------------+
|0.9967466791744831|
+------------------+

It is not the literal exact value, but it gets really close, right? That is because we are not extracting the exact information, but we are also not sampling the data. whylogs will look at every data point and statistically decide wether or not that data point is relevant to the final calculation.

Is it just me or this is extremely powerful? Yes, it is.

“Cool! But what can I do with a bunch of ColumnProfileView’s from my Dataset? I want to see everything together

Well, you’ve come to the right place, because we will inspect the next method that does just that!

[14]:

from whylogs.api.pyspark.experimental import collect_dataset_profile_view

dataset_profile_view = collect_dataset_profile_view(input_df=spark_dataframe)

Yes, that easy. You now have a DatasetProfileView. As you might have seen from other example notebooks in our repo, you can turn this lightweight object into a pandas DataFrame, and visualize all the important metrics that we’ve profiled, like this:

[15]:

import pandas as pd

dataset_profile_view.to_pandas().head()

[15]:

	counts/n	counts/null	types/integral	types/fractional	types/boolean	types/string	types/object	cardinality/est	cardinality/upper_1	cardinality/lower_1	...	distribution/min	distribution/q_10	distribution/q_25	distribution/median	distribution/q_75	distribution/q_90	type	ints/max	ints/min	frequent_items/frequent_strings
column
alcohol	1599	0	0	1599	0	0	0	65.000010	65.003256	65.000000	...	8.40000	9.30000	9.5000	10.20000	11.10000	12.00000	SummaryType.COLUMN	NaN	NaN	NaN
chlorides	1599	0	0	1599	0	0	0	153.000058	153.007697	153.000000	...	0.01200	0.06000	0.0700	0.07900	0.09100	0.10900	SummaryType.COLUMN	NaN	NaN	NaN
citric acid	1599	0	0	1599	0	0	0	80.000016	80.004010	80.000000	...	0.00000	0.01000	0.0900	0.26000	0.43000	0.53000	SummaryType.COLUMN	NaN	NaN	NaN
density	1599	0	0	1599	0	0	0	439.557368	445.310933	433.943761	...	0.99007	0.99451	0.9956	0.99675	0.99786	0.99914	SummaryType.COLUMN	NaN	NaN	NaN
fixed acidity	1599	0	0	1599	0	0	0	96.000023	96.004816	96.000000	...	4.60000	6.60000	7.1000	7.90000	9.20000	10.70000	SummaryType.COLUMN	NaN	NaN	NaN

5 rows × 24 columns

Persisting as a file#

After collecting profiles, it is a good practice to store them as a file. This will allow you to later on read them back, merge with future profiles and track how is your data behaving along the way.

[17]:

dataset_profile_view.write(path="my_super_awesome_profile.bin")

And that’s it, you have just written a profile generated with spark to your local environment! If you wish to upload to different locations, such as s3, whylabs or others, please make sure to check out our other examples page.

Hopefully this tutorial will help you get started to profile and observe your data behaviour in your Spark jobs with almost no friction :)

Important note#

As you might have seen from the imports, currently this pyspark implementation is the experimental phase. We ran some benchmark ourselves with it, and for the sake of example, a 90Gb dataset with 80M rows could be profiled in under 3 minutes! Cool, right? But we still want more users to try this on their own, see if there are places to be improved and give us feedback before we make it officially the spark module here. Please, feel free to reach out to our community Slack and interact with us there. We will love to hear from you :)