whylogs.core.datasetprofile

Defines the primary interface class for tracking dataset statistics.

Module Contents

Classes

DatasetProfile

Statistics tracking for a dataset.

Functions

columns_chunk_iterator(iterator, marker: str)

Create an iterator to return column messages in batches

flatten_summary(dataset_summary: whylogs.proto.DatasetSummary) → dict

Flatten a DatasetSummary

_quantile_strings(quantiles: list)

flatten_dataset_quantiles(dataset_summary: whylogs.proto.DatasetSummary)

Flatten quantiles from a dataset summary

flatten_dataset_histograms(dataset_summary: whylogs.proto.DatasetSummary)

Flatten histograms from a dataset summary

flatten_dataset_frequent_numbers(dataset_summary: whylogs.proto.DatasetSummary)

Flatten frequent number counts from a dataset summary

flatten_dataset_frequent_strings(dataset_summary: whylogs.proto.DatasetSummary)

Flatten frequent strings summaries from a dataset summary

get_dataset_frame(dataset_summary: whylogs.proto.DatasetSummary, mapping: dict = None)

Get a dataframe from scalar values flattened from a dataset summary

dataframe_profile(df: pandas.DataFrame, name: str = None, timestamp: datetime.datetime = None)

Generate a dataset profile for a dataframe

array_profile(x: numpy.ndarray, name: str = None, timestamp: datetime.datetime = None, columns: list = None)

Generate a dataset profile for an array

Attributes

logger

cudfDataFrame

COLUMN_CHUNK_MAX_LEN_IN_BYTES

TYPENUM_COLUMN_NAMES

SCALAR_NAME_MAPPING

whylogs.core.datasetprofile.logger
whylogs.core.datasetprofile.cudfDataFrame
whylogs.core.datasetprofile.COLUMN_CHUNK_MAX_LEN_IN_BYTES
whylogs.core.datasetprofile.TYPENUM_COLUMN_NAMES
whylogs.core.datasetprofile.SCALAR_NAME_MAPPING
class whylogs.core.datasetprofile.DatasetProfile(name: str, dataset_timestamp: datetime.datetime = None, session_timestamp: datetime.datetime = None, columns: dict = None, tags: Dict[str, str] = None, metadata: Dict[str, str] = None, session_id: str = None, model_profile: whylogs.core.model_profile.ModelProfile = None, constraints: whylogs.core.statistics.constraints.DatasetConstraints = None)

Statistics tracking for a dataset.

A dataset refers to a collection of columns.

Parameters
  • name (str) – A human readable name for the dataset profile. Could be model name. This is stored under “name” tag

  • dataset_timestamp (datetime.datetime) – The timestamp associated with the data (i.e. batch run). Optional.

  • session_timestamp (datetime.datetime) – Timestamp of the dataset

  • columns (dict) – Dictionary lookup of `ColumnProfile`s

  • tags (dict) – A dictionary of key->value. Can be used upstream for aggregating data. Tags must match when merging with another dataset profile object.

  • metadata (dict) – Metadata that can store arbitrary string mapping. Metadata is not used when aggregating data and can be dropped when merging with another dataset profile object.

  • session_id (str) – The unique session ID run. Should be a UUID.

  • constraints (DatasetConstraints) – Static assertions to be applied to tracked numeric data and profile summaries.

property name(self)
property tags(self)
property metadata(self)
property session_timestamp(self)
property session_timestamp_ms(self)

Return the session timestamp value in epoch milliseconds.

add_output_field(self, field: Union[str, List[str]])
track_metrics(self, targets: List[Union[str, bool, float, int]], predictions: List[Union[str, bool, float, int]], scores: List[float] = None, model_type: whylogs.proto.ModelType = None, target_field: str = None, prediction_field: str = None, score_field: str = None)

Function to track metrics based on validation data.

user may also pass the associated attribute names associated with target, prediction, and/or score.

Parameters
  • targets (List[Union[str, bool, float, int]]) – actual validated values

  • predictions (List[Union[str, bool, float, int]]) – inferred/predicted values

  • scores (List[float], optional) – assocaited scores for each inferred, all values set to 1 if not passed

  • target_field (str, optional) – Description

  • prediction_field (str, optional) – Description

  • score_field (str, optional) – Description

  • model_type (ModelType, optional) – Defaul is Classification type.

  • target_field

  • prediction_field

  • score_field

  • score_field

track(self, columns, data=None)

Add value(s) to tracking statistics for column(s).

Parameters
  • columns (str, dict) – Either the name of a column, or a dictionary specifying column names and the data (value) for each column If a string, data must be supplied. Otherwise, data is ignored.

  • data (object, None) – Value to track. Specify if columns is a string.

track_datum(self, column_name, data)
track_array(self, x: numpy.ndarray, columns=None)

Track statistics for a numpy array

Parameters
  • x (np.ndarray) – 2D array to track.

  • columns (list) – Optional column labels

track_dataframe(self, df: pandas.DataFrame)

Track statistics for a dataframe

Parameters

df (pandas.DataFrame) – DataFrame to track

to_properties(self)

Return dataset profile related metadata

Returns

properties – The metadata as a protobuf object.

Return type

DatasetProperties

to_summary(self)

Generate a summary of the statistics

Returns

summary – Protobuf summary message.

Return type

DatasetSummary

generate_constraints(self)whylogs.core.statistics.constraints.DatasetConstraints

Assemble a sparse dict of constraints for all features.

Returns

summary – Protobuf constraints message.

Return type

DatasetConstraints

flat_summary(self)

Generate and flatten a summary of the statistics.

See flatten_summary() for a description

_column_message_iterator(self)
chunk_iterator(self)

Generate an iterator to iterate over chunks of data

validate(self)

Sanity check for this object. Raises an AssertionError if invalid

merge(self, other)

Merge this profile with another dataset profile object.

We will use metadata and timestamps from the current DatasetProfile in the result.

This operation will drop the metadata from the ‘other’ profile object.

Parameters

other (DatasetProfile) –

Returns

merged – New, merged DatasetProfile

Return type

DatasetProfile

_do_merge(self, other)
merge_strict(self, other)

Merge this profile with another dataset profile object. This throws exception if session_id, timestamps and tags don’t match.

This operation will drop the metadata from the ‘other’ profile object.

Parameters

other (DatasetProfile) –

Returns

merged – New, merged DatasetProfile

Return type

DatasetProfile

serialize_delimited(self)bytes

Write out in delimited format (data is prefixed with the length of the datastream).

This is useful when you are streaming multiple dataset profile objects

Returns

data – A sequence of bytes

Return type

bytes

to_protobuf(self)whylogs.proto.DatasetProfileMessage

Return the object serialized as a protobuf message

Returns

message

Return type

DatasetProfileMessage

write_protobuf(self, protobuf_path: str, delimited_file: bool = True)

Write the dataset profile to disk in binary format

Parameters
  • protobuf_path – the local path for storage. The parent directory must already exist

  • delimited_file – whether to prefix the data with the length of output or not. Default is True

static read_protobuf(protobuf_path: str, delimited_file: bool = True)

Parse a protobuf file and return a DatasetProfile object

Parameters
  • protobuf_path – the path of the protobuf data

  • delimited_file – whether the data is delimited or not. Default is True

Returns

a DatasetProfile object if successful

Return type

whylogs.DatasetProfile

static from_protobuf(message: whylogs.proto.DatasetProfileMessage)

Load from a protobuf message

Parameters

message (DatasetProfileMessage) – The protobuf message. Should match the output of DatasetProfile.to_protobuf()

Returns

dataset_profile

Return type

DatasetProfile

static from_protobuf_string(data: bytes)

Deserialize a serialized DatasetProfileMessage

Parameters

data (bytes) – The serialized message

Returns

profile – The deserialized dataset profile

Return type

DatasetProfile

static _parse_delimited_generator(data: bytes)
static parse_delimited_single(data: bytes, pos=0)

Parse a single delimited entry from a byte stream :param data: The bytestream :type data: bytes :param pos: The starting position. Default is zero :type pos: int

Returns

  • pos (int) – Current position in the stream after parsing

  • profile (DatasetProfile) – A dataset profile

static parse_delimited(data: bytes)

Parse delimited data (i.e. data prefixed with the message length).

Java protobuf writes delimited messages, which is convenient for storing multiple dataset profiles. This means that the main data is prefixed with the length of the message.

Parameters

data (bytes) – The input byte stream

Returns

profiles – List of all Dataset profile objects

Return type

list

apply_summary_constraints(self, summary_constraints: Optional[Mapping[str, whylogs.core.statistics.constraints.SummaryConstraints]] = None)
whylogs.core.datasetprofile.columns_chunk_iterator(iterator, marker: str)

Create an iterator to return column messages in batches

Parameters
  • iterator – An iterator which returns protobuf column messages

  • marker – Value used to mark a group of column messages

whylogs.core.datasetprofile.flatten_summary(dataset_summary: whylogs.proto.DatasetSummary)dict

Flatten a DatasetSummary

Parameters

dataset_summary (DatasetSummary) – Summary to flatten

Returns

data

A dictionary with the following keys:

summarypandas.DataFrame

Per-column summary statistics

histpandas.Series

Series of histogram Series with (column name, histogram) key, value pairs. Histograms are formatted as a pandas.Series

frequent_stringspandas.Series

Series of frequent string counts with (column name, counts) key, val pairs. counts are a pandas Series.

Return type

dict

Notes

Some relevant info on the summary mapping:

>>> from whylogs.core.datasetprofile import SCALAR_NAME_MAPPING
>>> import json
>>> print(json.dumps(SCALAR_NAME_MAPPING, indent=2))
whylogs.core.datasetprofile._quantile_strings(quantiles: list)
whylogs.core.datasetprofile.flatten_dataset_quantiles(dataset_summary: whylogs.proto.DatasetSummary)

Flatten quantiles from a dataset summary

whylogs.core.datasetprofile.flatten_dataset_histograms(dataset_summary: whylogs.proto.DatasetSummary)

Flatten histograms from a dataset summary

whylogs.core.datasetprofile.flatten_dataset_frequent_numbers(dataset_summary: whylogs.proto.DatasetSummary)

Flatten frequent number counts from a dataset summary

whylogs.core.datasetprofile.flatten_dataset_frequent_strings(dataset_summary: whylogs.proto.DatasetSummary)

Flatten frequent strings summaries from a dataset summary

whylogs.core.datasetprofile.get_dataset_frame(dataset_summary: whylogs.proto.DatasetSummary, mapping: dict = None)

Get a dataframe from scalar values flattened from a dataset summary

Parameters
  • dataset_summary (DatasetSummary) – The dataset summary.

  • mapping (dict, optional) – Override the default variable mapping.

Returns

summary – Scalar values, flattened and re-named according to mapping

Return type

pd.DataFrame

whylogs.core.datasetprofile.dataframe_profile(df: pandas.DataFrame, name: str = None, timestamp: datetime.datetime = None)

Generate a dataset profile for a dataframe

Parameters
  • df (pandas.DataFrame) – Dataframe to track, treated as a complete dataset.

  • name (str) – Name of the dataset

  • timestamp (datetime.datetime, float) – Timestamp of the dataset. Defaults to current UTC time. Can be a datetime or UTC epoch seconds.

Returns

prof

Return type

DatasetProfile

whylogs.core.datasetprofile.array_profile(x: numpy.ndarray, name: str = None, timestamp: datetime.datetime = None, columns: list = None)

Generate a dataset profile for an array

Parameters
  • x (np.ndarray) – Array-like object to track. Will be treated as an full dataset

  • name (str) – Name of the dataset

  • timestamp (datetime.datetime) – Timestamp of the dataset. Defaults to current UTC time

  • columns (list) – Optional column labels

Returns

prof

Return type

DatasetProfile