🚩 Create a free WhyLabs account to get more value out of whylogs!

Did you know you can store, visualize, and monitor whylogs profiles with theWhyLabs Observability Platform? Sign up for afree WhyLabs accountto leverage the power of whylogs and WhyLabs together!

Weather Forecast Dataset - Usage Example#

This an example demonstrating the usage of the Weather Forecast Dataset.

For more information about the dataset itself, check the documentation on : https://whylogs.readthedocs.io/en/latest/datasets/weather.html

Installing the datasets module#

Uncomment the cell below if you don’t have the datasets module installed:

[1]:

# Note: you may need to restart the kernel to use updated packages.
%pip install 'whylogs[datasets]'

Loading the Dataset#

You can load the dataset of your choice by calling it from the datasets module:

[2]:

from whylogs.datasets import Weather

dataset = Weather(version="in_domain")

This will create a folder in the current directory named whylogs_data with the csv files for the Weather Dataset. If the files already exist, the module will not redownload the files.

Notice we’re specifying the version of the dataset. A dataset can have multiple versions that can be used for differente purposes. In this case, the version “in_domain” has data from the same domain between baseline and inference subsets (data from the same set of regions - tropical, dry, polar, etc.).

If we’re interested in assessing drift issues, the version “out_domain” could be used, in which we have out-of-domain data in the inference subset, when compare to the baseline.

Similarly, datasets could have other versions for other purposes, such as assessing data quality or outlier detection strategies.

Discovering Information#

To know what are the available versions for a given dataset, you can call:

[3]:

Weather.describe_versions()

[3]:

('in_domain', 'out_domain')

To get access to overall description of the dataset:

[4]:

print(Weather.describe()[:1000])

Weather Forecast Dataset
========================

The Weather Forecast Dataset contains meteorological features at a particular place (defined by latitude and longitude features) and time. This dataset can present data distribution shifts over both time and space.

The original data was sourced from the `Weather Prediction Dataset <https://github.com/Shifts-Project/shifts>`_. From the source data additional transformations were made, such as: feature renaming, feature selection and subsampling.
The original dataset is described in `Shifts: A Dataset of Real Distributional Shift Across Multiple Large-Scale Tasks <https://arxiv.org/pdf/2107.07455.pdf>`_, by **Malinin, Andrey, et al.**

Usage
-----

You can follow this guide to see how to use the weather dataset:

.. toctree::
    :maxdepth: 1

    ../examples/datasets/weather


Versions and Data Partitions
----------------------------

Currently the dataset contains two versions: **in_domain** and **out_domain**. The task is the same fo

note: the output was truncated to first 1000 characters as describe() will print a rather lengthy description.

Getting Baseline Data#

You can access data from two different partitions: the baseline dataset and inference dataset.

The baseline can be accessed as a whole, whereas the inference dataset can be accessed in periodic batches, defined by the user.

To get a baseline object, just call dataset.get_baseline():

[5]:

from whylogs.datasets import Weather

dataset = Weather(version="out_domain")

baseline = dataset.get_baseline()

baseline will contain different attributes - one timestamp and five dataframes.

timestamp: the batch’s timestamp (at the start)
data: the complete dataframe
features: input features
target: output feature(s)
prediction: output prediction and, possibly, features such as uncertainty, confidence, probability
misc: metadata features that are not of any of the previous categories, but still contain relevant information about the data.

[6]:

baseline.timestamp

[6]:

datetime.datetime(2022, 9, 12, 0, 0, tzinfo=datetime.timezone.utc)

[7]:

baseline.extra.head()

[7]:

	meta_latitude	meta_longitude	meta_climate
date
2022-09-12 00:00:00+00:00	28.702900	-105.964996	dry
2022-09-12 00:00:00+00:00	-35.165298	147.466003	mild temperate
2022-09-12 00:00:00+00:00	29.607300	-95.158798	mild temperate
2022-09-12 00:00:00+00:00	39.077999	-77.557503	mild temperate
2022-09-12 00:00:00+00:00	26.152599	-81.775299	mild temperate

Setting Parameters#

With set_parameters, you can specify the timestamps for both baseline and inference datasets, as well as the inference interval.

By default, the timestamp is set as: - Current date for baseline dataset - Tomorrow’s date for inference dataset

These timestamps can be defined by the user to any given day, including the dataset’s original date.

The inference_interval defines the interval for each batch: ‘1d’ means that we will have daily batches, while ‘7d’ would mean weekly batches.

To set the timestamps to the original dataset’s date, set original to true, like below:

[8]:

# Currently, the inference interval takes a str in the format "Xd", where X is an integer between 1-30
dataset.set_parameters(inference_interval="1d", original=True)

[9]:

baseline = dataset.get_baseline()
baseline.timestamp

[9]:

datetime.datetime(2018, 9, 1, 0, 0, tzinfo=datetime.timezone.utc)

You can set timestamp by using the baseline_timestamp and inference_start_timestamp, and the inference interval like below:

[10]:

from datetime import datetime, timezone
now = datetime.now(timezone.utc)
dataset.set_parameters(baseline_timestamp=now, inference_start_timestamp=now, inference_interval="1d")

Note that we are passing the datetime converted to the UTC timezone. If a naive datetime is passed (no information on timezones), local time zone will be assumed. The local timestamp, however, will be converted to the proper datetime in UTC timezone. Passing a naive datetime will trigger a warning, letting you know of this behavior.

Note that if both original and a timestamp (baseline or inference) is passed simultaneously, the defined timestamp will be overwritten by the original dataset timestamp.

Getting Inference Data #1 - By Date#

You can get inference data in two different ways. The first is to specify the exact date you want, which will return a single batch:

[11]:

batch = dataset.get_inference_data(target_date=now)

You can access the attributes just as showed before:

[12]:

batch.timestamp

[12]:

datetime.datetime(2022, 9, 12, 0, 0, tzinfo=datetime.timezone.utc)

[13]:

batch.data

[13]:

	height_sea_level	sun_elevation	pressure	cmc_temperature_grad	cmc_temperature	dew_point_temperature	absolute_humidity	snow_depth	rain_accumulated	snow_accumulated	...	snow_accumulated_grad	ice_rain_grad	iced_graupel_grad	cloud_coverage_grad	meta_latitude	meta_longitude	meta_climate	prediction_temperature	temperature	uncertainty
date
2022-09-12 00:00:00+00:00	166.0	24.134473	749.287193	-0.670923	289.282080	285.220886	0.0090	0.000000	2.641950	0.00000	...	0.0	0.0	0.0	-2.0	46.516667	29.483333	snow	17.459501	19.0	5.046475
2022-09-12 00:00:00+00:00	180.0	36.168942	738.731879	-3.726770	290.226721	284.868256	0.0095	0.000000	0.149825	0.00000	...	0.0	0.0	0.0	-23.0	46.521900	26.910299	snow	15.650873	15.0	10.590467
2022-09-12 00:00:00+00:00	25.0	4.931765	753.034922	7.741565	280.471216	279.016144	0.0054	0.000000	3.536025	0.00000	...	0.0	0.0	0.0	0.0	39.033333	125.783333	snow	9.651232	11.0	5.512176
2022-09-12 00:00:00+00:00	-11.0	22.337882	754.533835	2.323230	277.726721	274.868256	0.0043	0.181638	1.822488	0.00000	...	0.0	0.0	0.0	-70.0	55.281898	-77.765297	snow	7.948395	8.0	3.395677
2022-09-12 00:00:00+00:00	119.0	58.290232	767.426533	0.235266	290.554565	283.905655	0.0116	0.000000	2.715500	0.00000	...	0.0	0.0	0.0	-1.0	38.519901	-28.715900	polar	18.248093	18.0	2.023753
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2022-09-12 00:00:00+00:00	260.0	15.023227	741.406826	8.416382	279.342383	274.999222	0.0048	0.000000	14.099825	0.00000	...	0.0	0.0	0.0	0.0	39.843498	-85.897102	snow	8.233539	7.0	5.549324
2022-09-12 00:00:00+00:00	48.0	30.655498	758.121661	-2.092969	303.854272	293.944244	0.0194	0.000000	1.182225	0.00000	...	0.0	0.0	0.0	-9.0	-5.911420	-35.247700	polar	30.618527	30.0	2.319395
2022-09-12 00:00:00+00:00	99.0	19.245194	752.505533	-2.072693	290.389832	282.104599	0.0067	0.000000	0.088375	0.00000	...	0.0	0.0	0.0	30.0	58.100000	38.683333	snow	16.601422	17.0	4.060273
2022-09-12 00:00:00+00:00	296.0	38.102269	734.381076	2.616138	268.365942	267.126648	0.0024	1.002448	0.000000	0.00366	...	0.0	0.0	0.0	-1.0	66.580000	-61.620000	polar	1.004967	-1.0	3.510967
2022-09-12 00:00:00+00:00	48.0	-10.442588	755.390211	-0.808435	281.321216	277.766144	0.0060	0.000000	0.158425	0.00000	...	0.0	0.0	0.0	-26.0	60.289167	5.226389	snow	7.546170	7.0	2.232020

100 rows × 54 columns

[14]:

batch.prediction.head()

[14]:

	prediction_temperature	uncertainty
date
2022-09-12 00:00:00+00:00	17.459501	5.046475
2022-09-12 00:00:00+00:00	15.650873	10.590467
2022-09-12 00:00:00+00:00	9.651232	5.512176
2022-09-12 00:00:00+00:00	7.948395	3.395677
2022-09-12 00:00:00+00:00	18.248093	2.023753

Getting Inference Data #2 - By Number of Batches#

The second way is to specify the number of batches you want and also the date for the first batch.

You can then iterate over the returned object to get the batches. You can then use the batch any way you want. Here’s an example that retrieves daily batches for a period of 5 days and logs each one with whylogs, saving the binary profiles to disk:

[15]:

import whylogs as why
batches = dataset.get_inference_data(number_batches=5)

for batch in batches:
  print("logging batch of size {} for {}".format(len(batch.data),batch.timestamp))
  profile = why.log(batch.data).profile()
  profile.set_dataset_timestamp(batch.timestamp)
  profile.view().write("batch_{}".format(batch.timestamp))

logging batch of size 100 for 2022-09-12 00:00:00+00:00
logging batch of size 227 for 2022-09-13 00:00:00+00:00
logging batch of size 186 for 2022-09-14 00:00:00+00:00
logging batch of size 197 for 2022-09-15 00:00:00+00:00
logging batch of size 194 for 2022-09-16 00:00:00+00:00