whylogs.core.datasetprofile
¶
Defines the primary interface class for tracking dataset statistics.
Module Contents¶
Classes¶
Statistics tracking for a dataset. |
Functions¶
|
Create an iterator to return column messages in batches |
|
Flatten a DatasetSummary |
|
|
|
Flatten quantiles from a dataset summary |
|
Flatten histograms from a dataset summary |
|
Flatten frequent number counts from a dataset summary |
|
Flatten frequent strings summaries from a dataset summary |
|
Get a dataframe from scalar values flattened from a dataset summary |
|
Generate a dataset profile for a dataframe |
|
Generate a dataset profile for an array |
Attributes¶
-
whylogs.core.datasetprofile.
logger
¶
-
whylogs.core.datasetprofile.
cudfDataFrame
¶
-
whylogs.core.datasetprofile.
COLUMN_CHUNK_MAX_LEN_IN_BYTES
¶
-
whylogs.core.datasetprofile.
TYPENUM_COLUMN_NAMES
¶
-
whylogs.core.datasetprofile.
SCALAR_NAME_MAPPING
¶
-
class
whylogs.core.datasetprofile.
DatasetProfile
(name: str, dataset_timestamp: datetime.datetime = None, session_timestamp: datetime.datetime = None, columns: dict = None, tags: Dict[str, str] = None, metadata: Dict[str, str] = None, session_id: str = None, model_profile: whylogs.core.model_profile.ModelProfile = None, constraints: whylogs.core.statistics.constraints.DatasetConstraints = None)¶ Statistics tracking for a dataset.
A dataset refers to a collection of columns.
- Parameters
name (str) – A human readable name for the dataset profile. Could be model name. This is stored under “name” tag
dataset_timestamp (datetime.datetime) – The timestamp associated with the data (i.e. batch run). Optional.
session_timestamp (datetime.datetime) – Timestamp of the dataset
columns (dict) – Dictionary lookup of `ColumnProfile`s
tags (dict) – A dictionary of key->value. Can be used upstream for aggregating data. Tags must match when merging with another dataset profile object.
metadata (dict) – Metadata that can store arbitrary string mapping. Metadata is not used when aggregating data and can be dropped when merging with another dataset profile object.
session_id (str) – The unique session ID run. Should be a UUID.
constraints (DatasetConstraints) – Static assertions to be applied to tracked numeric data and profile summaries.
-
property
name
(self)¶
-
property
metadata
(self)¶
-
property
session_timestamp
(self)¶
-
property
session_timestamp_ms
(self)¶ Return the session timestamp value in epoch milliseconds.
-
add_output_field
(self, field: Union[str, List[str]])¶
-
track_metrics
(self, targets: List[Union[str, bool, float, int]], predictions: List[Union[str, bool, float, int]], scores: List[float] = None, model_type: whylogs.proto.ModelType = None, target_field: str = None, prediction_field: str = None, score_field: str = None)¶ Function to track metrics based on validation data.
user may also pass the associated attribute names associated with target, prediction, and/or score.
- Parameters
targets (List[Union[str, bool, float, int]]) – actual validated values
predictions (List[Union[str, bool, float, int]]) – inferred/predicted values
scores (List[float], optional) – assocaited scores for each inferred, all values set to 1 if not passed
target_field (str, optional) – Description
prediction_field (str, optional) – Description
score_field (str, optional) – Description
model_type (ModelType, optional) – Defaul is Classification type.
target_field –
prediction_field –
score_field –
score_field –
-
track
(self, columns, data=None)¶ Add value(s) to tracking statistics for column(s).
- Parameters
columns (str, dict) – Either the name of a column, or a dictionary specifying column names and the data (value) for each column If a string, data must be supplied. Otherwise, data is ignored.
data (object, None) – Value to track. Specify if columns is a string.
-
track_datum
(self, column_name, data)¶
-
track_array
(self, x: numpy.ndarray, columns=None)¶ Track statistics for a numpy array
- Parameters
x (np.ndarray) – 2D array to track.
columns (list) – Optional column labels
-
track_dataframe
(self, df: pandas.DataFrame)¶ Track statistics for a dataframe
- Parameters
df (pandas.DataFrame) – DataFrame to track
-
to_properties
(self)¶ Return dataset profile related metadata
- Returns
properties – The metadata as a protobuf object.
- Return type
DatasetProperties
-
to_summary
(self)¶ Generate a summary of the statistics
- Returns
summary – Protobuf summary message.
- Return type
DatasetSummary
-
generate_constraints
(self) → whylogs.core.statistics.constraints.DatasetConstraints¶ Assemble a sparse dict of constraints for all features.
- Returns
summary – Protobuf constraints message.
- Return type
-
flat_summary
(self)¶ Generate and flatten a summary of the statistics.
See
flatten_summary()
for a description
-
_column_message_iterator
(self)¶
-
chunk_iterator
(self)¶ Generate an iterator to iterate over chunks of data
-
validate
(self)¶ Sanity check for this object. Raises an AssertionError if invalid
-
merge
(self, other)¶ Merge this profile with another dataset profile object.
We will use metadata and timestamps from the current DatasetProfile in the result.
This operation will drop the metadata from the ‘other’ profile object.
- Parameters
other (DatasetProfile) –
- Returns
merged – New, merged DatasetProfile
- Return type
-
_do_merge
(self, other)¶
-
merge_strict
(self, other)¶ Merge this profile with another dataset profile object. This throws exception if session_id, timestamps and tags don’t match.
This operation will drop the metadata from the ‘other’ profile object.
- Parameters
other (DatasetProfile) –
- Returns
merged – New, merged DatasetProfile
- Return type
-
serialize_delimited
(self) → bytes¶ Write out in delimited format (data is prefixed with the length of the datastream).
This is useful when you are streaming multiple dataset profile objects
- Returns
data – A sequence of bytes
- Return type
bytes
-
to_protobuf
(self) → whylogs.proto.DatasetProfileMessage¶ Return the object serialized as a protobuf message
- Returns
message
- Return type
DatasetProfileMessage
-
write_protobuf
(self, protobuf_path: str, delimited_file: bool = True)¶ Write the dataset profile to disk in binary format
- Parameters
protobuf_path – the local path for storage. The parent directory must already exist
delimited_file – whether to prefix the data with the length of output or not. Default is True
-
static
read_protobuf
(protobuf_path: str, delimited_file: bool = True)¶ Parse a protobuf file and return a DatasetProfile object
- Parameters
protobuf_path – the path of the protobuf data
delimited_file – whether the data is delimited or not. Default is True
- Returns
a DatasetProfile object if successful
- Return type
-
static
from_protobuf
(message: whylogs.proto.DatasetProfileMessage)¶ Load from a protobuf message
- Parameters
message (DatasetProfileMessage) – The protobuf message. Should match the output of DatasetProfile.to_protobuf()
- Returns
dataset_profile
- Return type
-
static
from_protobuf_string
(data: bytes)¶ Deserialize a serialized DatasetProfileMessage
- Parameters
data (bytes) – The serialized message
- Returns
profile – The deserialized dataset profile
- Return type
-
static
_parse_delimited_generator
(data: bytes)¶
-
static
parse_delimited_single
(data: bytes, pos=0)¶ Parse a single delimited entry from a byte stream :param data: The bytestream :type data: bytes :param pos: The starting position. Default is zero :type pos: int
- Returns
pos (int) – Current position in the stream after parsing
profile (DatasetProfile) – A dataset profile
-
static
parse_delimited
(data: bytes)¶ Parse delimited data (i.e. data prefixed with the message length).
Java protobuf writes delimited messages, which is convenient for storing multiple dataset profiles. This means that the main data is prefixed with the length of the message.
- Parameters
data (bytes) – The input byte stream
- Returns
profiles – List of all Dataset profile objects
- Return type
list
-
apply_summary_constraints
(self, summary_constraints: Optional[Mapping[str, whylogs.core.statistics.constraints.SummaryConstraints]] = None)¶
-
whylogs.core.datasetprofile.
columns_chunk_iterator
(iterator, marker: str)¶ Create an iterator to return column messages in batches
- Parameters
iterator – An iterator which returns protobuf column messages
marker – Value used to mark a group of column messages
-
whylogs.core.datasetprofile.
flatten_summary
(dataset_summary: whylogs.proto.DatasetSummary) → dict¶ Flatten a DatasetSummary
- Parameters
dataset_summary (DatasetSummary) – Summary to flatten
- Returns
data –
A dictionary with the following keys:
- summarypandas.DataFrame
Per-column summary statistics
- histpandas.Series
Series of histogram Series with (column name, histogram) key, value pairs. Histograms are formatted as a pandas.Series
- frequent_stringspandas.Series
Series of frequent string counts with (column name, counts) key, val pairs. counts are a pandas Series.
- Return type
dict
Notes
Some relevant info on the summary mapping:
>>> from whylogs.core.datasetprofile import SCALAR_NAME_MAPPING >>> import json >>> print(json.dumps(SCALAR_NAME_MAPPING, indent=2))
-
whylogs.core.datasetprofile.
_quantile_strings
(quantiles: list)¶
-
whylogs.core.datasetprofile.
flatten_dataset_quantiles
(dataset_summary: whylogs.proto.DatasetSummary)¶ Flatten quantiles from a dataset summary
-
whylogs.core.datasetprofile.
flatten_dataset_histograms
(dataset_summary: whylogs.proto.DatasetSummary)¶ Flatten histograms from a dataset summary
-
whylogs.core.datasetprofile.
flatten_dataset_frequent_numbers
(dataset_summary: whylogs.proto.DatasetSummary)¶ Flatten frequent number counts from a dataset summary
-
whylogs.core.datasetprofile.
flatten_dataset_frequent_strings
(dataset_summary: whylogs.proto.DatasetSummary)¶ Flatten frequent strings summaries from a dataset summary
-
whylogs.core.datasetprofile.
get_dataset_frame
(dataset_summary: whylogs.proto.DatasetSummary, mapping: dict = None)¶ Get a dataframe from scalar values flattened from a dataset summary
- Parameters
dataset_summary (DatasetSummary) – The dataset summary.
mapping (dict, optional) – Override the default variable mapping.
- Returns
summary – Scalar values, flattened and re-named according to mapping
- Return type
pd.DataFrame
-
whylogs.core.datasetprofile.
dataframe_profile
(df: pandas.DataFrame, name: str = None, timestamp: datetime.datetime = None)¶ Generate a dataset profile for a dataframe
- Parameters
df (pandas.DataFrame) – Dataframe to track, treated as a complete dataset.
name (str) – Name of the dataset
timestamp (datetime.datetime, float) – Timestamp of the dataset. Defaults to current UTC time. Can be a datetime or UTC epoch seconds.
- Returns
prof
- Return type
-
whylogs.core.datasetprofile.
array_profile
(x: numpy.ndarray, name: str = None, timestamp: datetime.datetime = None, columns: list = None)¶ Generate a dataset profile for an array
- Parameters
x (np.ndarray) – Array-like object to track. Will be treated as an full dataset
name (str) – Name of the dataset
timestamp (datetime.datetime) – Timestamp of the dataset. Defaults to current UTC time
columns (list) – Optional column labels
- Returns
prof
- Return type