đźš© Create a free WhyLabs account to get more value out of whylogs!

Did you know you can store, visualize, and monitor whylogs profiles with theWhyLabs Observability Platform? Sign up for afree WhyLabs accountto leverage the power of whylogs and WhyLabs together!

Merging Profiles#

Open in Colab

Sometimes we may want to profile a dataset in chunks. For example, we may have our dataset distributed across multiple files or nodes, or perhaps our dataset is too large to fit in memory. Maybe we already profiled our dataset for several different date ranges and we want to see a holistic view of our data across the entire range.

In any case, merging profiles is a solution!

Installing whylogs#

whylogs is made available as a Python package. You can get the latest version from PyPI with pip install whylogs:

[1]:
# Note: you may need to restart the kernel to use updated packages.
%pip install whylogs

Loading a Pandas DataFrame#

Before profiling data, lets create a Pandas DataFrame from a public dataset.

[2]:
import pandas as pd

df_full= pd.read_csv("https://whylabs-public.s3.us-west-2.amazonaws.com/datasets/tour/current.csv")

print('row count: {}'.format(df_full.shape[0]))
df_full.sample(10)
row count: 945
[2]:
Transaction ID Customer ID Quantity Item Price Total Tax Total Amount Store Type Product Category Product Subcategory Gender Transaction Type Age
682 T74278458640 C267835 2 63.2 13.2720 139.6720 TeleShop Books DIY M Purchase 33.0
256 T54377774372 C270496 5 75.5 39.6375 417.1375 MBR Electronics Audio and video M Purchase 23.0
67 T64030190529 C269524 5 52.5 27.5625 290.0625 e-Shop Bags Mens F Purchase 24.0
762 T18970114223 C272730 2 80.6 16.9260 178.1260 e-Shop Home and kitchen Kitchen M Purchase 39.0
94 T94404065446 C271648 5 48.0 25.2000 265.2000 e-Shop Clothing Kids F Purchase 34.0
197 T30540748600 C269603 1 127.4 13.3770 140.7770 TeleShop Footwear Women M Purchase 41.0
161 T78998671169 C270907 1 104.9 11.0145 115.9145 MBR Books DIY F Purchase 41.0
574 T19424023275 C270462 2 127.5 26.7750 281.7750 TeleShop Electronics Cameras F Purchase 38.0
583 T7986658313 C269047 3 25.9 8.1585 85.8585 e-Shop Clothing Women M Purchase 25.0
805 T36786634925 C267437 3 50.0 15.7500 165.7500 TeleShop Books Children M Purchase 35.0

This dataset contains 945 rows and contains a mix of numeric and categorical features. Lets split this DataFrame into 3 chunks of different sizes.

Splitting the DataFrame#

[3]:
df_subset1= df_full[0:100]
df_subset2= df_full[100:400]
df_subset3= df_full[400:]

print('Row Counts:')
print('Subset 1: {}'.format(df_subset1.shape[0]))
print('Subset 2: {}'.format(df_subset2.shape[0]))
print('Subset 3: {}'.format(df_subset3.shape[0]))
Row Counts:
Subset 1: 100
Subset 2: 300
Subset 3: 545

Profiling a Single Dataset#

Lets profile the first subset.

[4]:
import whylogs as why

results = why.log(df_subset1)
profile = results.profile()

The code above generates a ProfileResultSet instance and assigns it to the results variable. We then call the profile method on this object to generate a DatasetProfile instance which we assign to the profile variable.

We can inspect our profile by generating a pandas DataFrame from it. Lets view the first few rows.

[5]:
subset1_profile_df = profile.view().to_pandas()
subset1_profile_df.head()
[5]:
counts/n counts/null types/integral types/fractional types/boolean types/string types/object cardinality/est cardinality/upper_1 cardinality/lower_1 ... distribution/q_05 distribution/q_10 distribution/q_25 distribution/median distribution/q_75 distribution/q_90 distribution/q_95 distribution/q_99 ints/max ints/min
column
Gender 100 0 0 0 0 100 0 2.000000 2.000100 2.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Total Amount 100 0 0 100 0 0 0 99.000024 99.004967 99.0 ... -153.816 8.619 66.521 216.359 321.555 580.788 642.5575 795.6 NaN NaN
Customer ID 100 0 0 0 0 100 0 98.000024 98.004917 98.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Item Price 100 0 0 100 0 0 0 97.000023 97.004866 97.0 ... 10.000 25.700 40.700 76.800 111.100 135.200 139.8000 148.9 NaN NaN
Transaction ID 100 0 0 0 0 100 0 99.000024 99.004967 99.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows Ă— 28 columns

From the counts/n column, we can see that our subset of data contained 100 rows, as expected. Before we start merging new profiles, lets grab the mean of the “Item Price” column for another point of reference.

[6]:
"Mean Item Price for Subset 1: {}".format(subset1_profile_df['distribution/mean'].loc['Item Price'])
[6]:
'Mean Item Price for Subset 1: 74.189'

Merging Profiles#

We can call the track method on our profile to profile a new dataset and merge this with our existing profile in one step. This can be done successively for multiple subsets of data.

[7]:
profile.track(df_subset2)
profile.track(df_subset3)

Lets now inspect the merged profile as a Pandas DataFrame

[8]:
full_profile_df = profile.view().to_pandas()
full_profile_df.head()
[8]:
counts/n counts/null types/integral types/fractional types/boolean types/string types/object cardinality/est cardinality/upper_1 cardinality/lower_1 ... distribution/q_05 distribution/q_10 distribution/q_25 distribution/median distribution/q_75 distribution/q_90 distribution/q_95 distribution/q_99 ints/max ints/min
column
Gender 945 0 0 0 0 945 0 2.000000 2.000100 2.000000 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Total Amount 945 0 0 945 0 0 0 844.069184 855.117588 833.289540 ... -233.376 14.365 79.0075 179.452 356.915 580.788 654.16 804.44 NaN NaN
Customer ID 945 0 0 0 0 945 0 869.683985 881.067672 858.577213 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Item Price 945 0 0 945 0 0 0 705.028228 714.256661 696.024282 ... 13.800 22.300 45.0000 80.600 116.600 138.200 145.10 149.00 NaN NaN
Transaction ID 945 0 0 0 0 945 0 935.275741 947.517988 923.331294 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows Ă— 28 columns

We now see that each column has a count of 945 which we expect. Lets revisit the mean of the Items Price column.

[9]:
"Mean Item Price from merged profile: {}".format(full_profile_df['distribution/mean'].loc['Item Price'])
[9]:
'Mean Item Price from merged profile: 79.84814814814818'

Lets compare this with the mean we get using the the mean method from Pandas.

[10]:
df_full['Item Price'].mean()
[10]:
79.84814814814808

Its nearly an exact match! Note that in this example, we profiled 3 datasets of unequal sizes independently and merged together 3 profiles. This merged profile captured telemetry describing our entire dataset.

This property of mergeability makes whylogs particularly powerful. It allows us to profile datasets which live in distributed pipeline even if our data is never together in one place at any time.

Mergeability also makes it a trivial matter to roll up from hourly to daily, weekly, or monthly level views of your data.

Merging Profile Views#

Another option is to merge Profile Views.

A ProfileView object can be generated from a DatasetProfile object which allows for inspection of individiaul profiles, as well as the ability to visualize profiles using the our visualization module.

This is a good option if users wish to inspect profiles of their entire dataset while maintaining the ability to inspect individual profiles of the subsets of data.

[11]:
results = why.log(df_subset1)
profile_view1 = results.profile().view()

results = why.log(df_subset2)
profile_view2 = results.profile().view()

results = why.log(df_subset3)
profile_view3 = results.profile().view()

Similar to the previous example, we find that the first profile view counted 100 rows in the subset of data it profiled.

[12]:
profile_view1.to_pandas().head()
[12]:
counts/n counts/null types/integral types/fractional types/boolean types/string types/object cardinality/est cardinality/upper_1 cardinality/lower_1 ... distribution/q_05 distribution/q_10 distribution/q_25 distribution/median distribution/q_75 distribution/q_90 distribution/q_95 distribution/q_99 ints/max ints/min
column
Gender 100 0 0 0 0 100 0 2.000000 2.000100 2.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Total Amount 100 0 0 100 0 0 0 99.000024 99.004967 99.0 ... -153.816 8.619 66.521 216.359 321.555 580.788 642.5575 795.6 NaN NaN
Customer ID 100 0 0 0 0 100 0 98.000024 98.004917 98.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Item Price 100 0 0 100 0 0 0 97.000023 97.004866 97.0 ... 10.000 25.700 40.700 76.800 111.100 135.200 139.8000 148.9 NaN NaN
Transaction ID 100 0 0 0 0 100 0 99.000024 99.004967 99.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows Ă— 28 columns

We can merge these ProfileView objects using the merge method. We assign the result to a new variable and view a few rows of the profile’s DataFrame.

[13]:
merged_profile_view = profile_view1.merge(profile_view2).merge(profile_view3)

merged_profile_view.to_pandas().head()
[13]:
counts/n counts/null types/integral types/fractional types/boolean types/string types/object frequent_items/frequent_strings cardinality/est cardinality/upper_1 ... distribution/q_05 distribution/q_10 distribution/q_25 distribution/median distribution/q_75 distribution/q_90 distribution/q_95 distribution/q_99 ints/max ints/min
column
Gender 945 0 0 0 0 945 0 [FrequentItem(value='M', est=489, upper=489, l... 2.000000 2.000100 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Total Amount 945 0 0 945 0 0 0 NaN 849.185065 860.300432 ... -233.376 14.365 78.676 178.126 357.0255 580.346 657.475 804.44 NaN NaN
Customer ID 945 0 0 0 0 945 0 [FrequentItem(value='C273096', est=3, upper=2,... 858.998625 873.131713 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Item Price 945 0 0 945 0 0 0 NaN 701.002487 710.178225 ... 15.000 22.700 45.200 81.200 116.7000 138.200 145.600 149.00 NaN NaN
Transaction ID 945 0 0 0 0 945 0 [FrequentItem(value='T79960195196', est=3, upp... 942.466233 957.972612 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows Ă— 28 columns

As expected, we see 945 rows. Unlike the track method, the merge method doesn’t update the original objects directly. In other words, we can still inspect the individual profiles views from our subsets of data.

Keep in mind that the track method only works on DatasetProfile objects, while the merge method only operates on DatasetProfileView objects.

[14]:
profile_view1.to_pandas().head()
[14]:
counts/n counts/null types/integral types/fractional types/boolean types/string types/object cardinality/est cardinality/upper_1 cardinality/lower_1 ... distribution/q_05 distribution/q_10 distribution/q_25 distribution/median distribution/q_75 distribution/q_90 distribution/q_95 distribution/q_99 ints/max ints/min
column
Gender 100 0 0 0 0 100 0 2.000000 2.000100 2.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Total Amount 100 0 0 100 0 0 0 99.000024 99.004967 99.0 ... -153.816 8.619 66.521 216.359 321.555 580.788 642.5575 795.6 NaN NaN
Customer ID 100 0 0 0 0 100 0 98.000024 98.004917 98.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Item Price 100 0 0 100 0 0 0 97.000023 97.004866 97.0 ... 10.000 25.700 40.700 76.800 111.100 135.200 139.8000 148.9 NaN NaN
Transaction ID 100 0 0 0 0 100 0 99.000024 99.004967 99.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows Ă— 28 columns

Mergeability in WhyLabs#

In WhyLabs, profile merging is done automatically. If you have a WhyLabs dataset with a daily batch frequency of 1 day, then any profiles uploaded during that day will automatically merged for a day-level view of your data.

What’s Next?#

There’s a lot you can do with the profiles you just created. You can take a look at our other examples at https://whylogs.readthedocs.io/en/latest/examples !