`whylogs`¶

Subpackages¶

Submodules¶

whylogs._version

Package Contents¶

Classes¶

`SessionConfig`	Config for a whylogs session.
`WriterConfig`	Config for whylogs writers
`ColumnProfile`	Statistics tracking for a column (i.e. a feature)
`DatasetProfile`	Statistics tracking for a dataset.

Functions¶

`get_or_create_session`(path_to_config: Optional[str] = None, report_progress: Optional[bool] = False)	Retrieve the current active global session.
`reset_default_session`()	Reset and deactivate the global whylogs logging session.
`start_whylabs_session`(path_to_config: Optional[str] = None, report_progress: Optional[bool] = False)
`enable_mlflow`(session=None) → bool	Enable whylogs in `mlflow` module via `mlflow.whylogs`.

Attributes¶

__version__

whylogs.__version__ = 0.7.8¶

class whylogs.SessionConfig(project: str, pipeline: str, writers: List[WriterConfig], metadata: Optional[MetadataConfig] = None, verbose: bool = False, with_rotation_time: str = None, cache_size: int = 1, report_progress: bool = False)¶

Config for a whylogs session.

See also:

WriterConfigSchema
whylogs.app.writers.Writer
whylogs.app.writers.writer_from_config()

Parameters

type (str) – Destination for the writer output, e.g. ‘local’ or ‘s3’
formats (list) – All output formats. See ALL_SUPPORTED_FORMATS
output_path (str) – Prefix of where to output files. A directory for type = ‘local’, or key prefix for type = ‘s3’
path_template (str, optional) – Templatized path output using standard python string templates. Variables are accessed via $identifier or ${identifier}. See whylogs.app.writers.Writer.template_params() for a list of available identifers. Default = whylogs.app.writers.DEFAULT_PATH_TEMPLATE
filename_template (str, optional) – Templatized output filename using standardized python string templates. Variables are accessed via $identifier or ${identifier}. See whylogs.app.writers.Writer.template_params() for a list of available identifers. Default = whylogs.app.writers.DEFAULT_FILENAME_TEMPLATE

to_yaml(self, stream=None)¶

Serialize this config to YAML

Parameters: stream – If None (default) return a string, else dump the yaml into this stream.

static from_yaml(stream, **kwargs)¶

Load config from yaml

Parameters

stream (str, file-obj) – String or file-like object to load yaml from
kwargs – ignored

Returns

config – Generated config

Return type

WriterConfig

whylogs.get_or_create_session(path_to_config: Optional[str] = None, report_progress: Optional[bool] = False)¶

Retrieve the current active global session.

If no active session exists, attempt to load config and create a new session.

If an active session exists, return the session without loading new config.

Returns: The global active session
Return type: Session

whylogs.reset_default_session()¶: Reset and deactivate the global whylogs logging session.

whylogs.start_whylabs_session(path_to_config: Optional[str] = None, report_progress: Optional[bool] = False)¶

class whylogs.ColumnProfile(name: str, number_tracker: whylogs.core.statistics.NumberTracker = None, string_tracker: whylogs.core.statistics.StringTracker = None, schema_tracker: whylogs.core.statistics.SchemaTracker = None, counters: whylogs.core.statistics.CountersTracker = None, frequent_items: whylogs.util.dsketch.FrequentItemsSketch = None, cardinality_tracker: whylogs.core.statistics.hllsketch.HllSketch = None, constraints: whylogs.core.statistics.constraints.ValueConstraints = None)¶

Statistics tracking for a column (i.e. a feature)

The primary method for

Parameters

name (str (required)) – Name of the column profile
number_tracker (NumberTracker) – Implements numeric data statistics tracking
string_tracker (StringTracker) – Implements string data-type statistics tracking
schema_tracker (SchemaTracker) – Implements tracking of schema-related information
counters (CountersTracker) – Keep count of various things
frequent_items (FrequentItemsSketch) – Keep track of all frequent items, even for mixed datatype features
cardinality_tracker (HllSketch) – Track feature cardinality (even for mixed data types)
constraints (ValueConstraints) – Static assertions to be applied to numeric data tracked in this column
TODO –
- Proper TypedDataConverter type checking
- Multi-threading/parallelism

track(self, value, character_list=None, token_method=None)¶: Add value to tracking statistics.

_unique_count_summary(self) → whylogs.proto.UniqueCountSummary¶

to_summary(self)¶

Generate a summary of the statistics

Returns: summary – Protobuf summary message.
Return type: ColumnSummary

generate_constraints(self) → whylogs.core.statistics.constraints.SummaryConstraints¶

merge(self, other)¶

Merge this columnprofile with another.

Parameters: other (ColumnProfile) –
Returns: merged – A new, merged column profile.
Return type: ColumnProfile

to_protobuf(self)¶

Return the object serialized as a protobuf message

Returns: message
Return type: ColumnMessage

static from_protobuf(message)¶

Load from a protobuf message

Returns: column_profile
Return type: ColumnProfile

class whylogs.DatasetProfile(name: str, dataset_timestamp: datetime.datetime = None, session_timestamp: datetime.datetime = None, columns: dict = None, multi_columns: whylogs.core.MultiColumnProfile = None, tags: Dict[str, str] = None, metadata: Dict[str, str] = None, session_id: str = None, model_profile: whylogs.core.model_profile.ModelProfile = None, constraints: whylogs.core.statistics.constraints.DatasetConstraints = None)¶

Statistics tracking for a dataset.

A dataset refers to a collection of columns.

Parameters

name (str) – A human readable name for the dataset profile. Could be model name. This is stored under “name” tag
dataset_timestamp (datetime.datetime) – The timestamp associated with the data (i.e. batch run). Optional.
session_timestamp (datetime.datetime) – Timestamp of the dataset
columns (dict) – Dictionary lookup of `ColumnProfile`s
tags (dict) – A dictionary of key->value. Can be used upstream for aggregating data. Tags must match when merging with another dataset profile object.
metadata (dict) – Metadata that can store arbitrary string mapping. Metadata is not used when aggregating data and can be dropped when merging with another dataset profile object.
session_id (str) – The unique session ID run. Should be a UUID.
constraints (DatasetConstraints) – Static assertions to be applied to tracked numeric data and profile summaries.

__getstate__(self)¶

__setstate__(self, serialized_profile)¶

property name(self)¶

property tags(self)¶

property metadata(self)¶

property session_timestamp(self)¶

property session_timestamp_ms(self)¶: Return the session timestamp value in epoch milliseconds.

property total_row_number(self)¶

add_output_field(self, field: Union[str, List[str]])¶

track_metrics(self, targets: List[Union[str, bool, float, int]], predictions: List[Union[str, bool, float, int]], scores: List[float] = None, model_type: whylogs.proto.ModelType = None, target_field: str = None, prediction_field: str = None, score_field: str = None)¶

Function to track metrics based on validation data.

user may also pass the associated attribute names associated with target, prediction, and/or score.

Parameters

targets (List[Union[str, bool, float, int]]) – actual validated values
predictions (List[Union[str, bool, float, int]]) – inferred/predicted values
scores (List[float], optional) – assocaited scores for each inferred, all values set to 1 if not passed
target_field (str, optional) – Description
prediction_field (str, optional) – Description
score_field (str, optional) – Description
model_type (ModelType, optional) – Defaul is Classification type.
target_field –
prediction_field –
score_field –
score_field –

track(self, columns, data=None, character_list=None, token_method=None)¶

Add value(s) to tracking statistics for column(s).

Parameters

columns (str, dict) – Either the name of a column, or a dictionary specifying column names and the data (value) for each column If a string, data must be supplied. Otherwise, data is ignored.
data (object, None) – Value to track. Specify if columns is a string.

track_datum(self, column_name, data, character_list=None, token_method=None)¶

track_multi_column(self, columns)¶

track_array(self, x: numpy.ndarray, columns=None)¶

Track statistics for a numpy array

Parameters

x (np.ndarray) – 2D array to track.
columns (list) – Optional column labels

track_dataframe(self, df: pandas.DataFrame, character_list=None, token_method=None)¶

Track statistics for a dataframe

Parameters: df (pandas.DataFrame) – DataFrame to track

to_properties(self)¶

Return dataset profile related metadata

Returns: properties – The metadata as a protobuf object.
Return type: DatasetProperties

to_summary(self)¶

Generate a summary of the statistics

Returns: summary – Protobuf summary message.
Return type: DatasetSummary

generate_constraints(self) → whylogs.core.statistics.constraints.DatasetConstraints¶

Assemble a sparse dict of constraints for all features.

Returns: summary – Protobuf constraints message.
Return type: DatasetConstraints

flat_summary(self)¶

Generate and flatten a summary of the statistics.

See flatten_summary() for a description

_column_message_iterator(self)¶

chunk_iterator(self)¶: Generate an iterator to iterate over chunks of data

validate(self)¶: Sanity check for this object. Raises an AssertionError if invalid

merge(self, other)¶

Merge this profile with another dataset profile object.

We will use metadata and timestamps from the current DatasetProfile in the result.

This operation will drop the metadata from the ‘other’ profile object.

Parameters: other (DatasetProfile) –
Returns: merged – New, merged DatasetProfile
Return type: DatasetProfile

_do_merge(self, other)¶

merge_strict(self, other)¶

Merge this profile with another dataset profile object. This throws exception if session_id, timestamps and tags don’t match.

This operation will drop the metadata from the ‘other’ profile object.

Parameters: other (DatasetProfile) –
Returns: merged – New, merged DatasetProfile
Return type: DatasetProfile

serialize_delimited(self) → bytes¶

Write out in delimited format (data is prefixed with the length of the datastream).

This is useful when you are streaming multiple dataset profile objects

Returns: data – A sequence of bytes
Return type: bytes

to_protobuf(self) → whylogs.proto.DatasetProfileMessage¶

Return the object serialized as a protobuf message

Returns: message
Return type: DatasetProfileMessage

write_protobuf(self, protobuf_path: str, delimited_file: bool = True, transport_parameters: dict = None) → None¶

Write the dataset profile to disk in binary format

Parameters

protobuf_path (str) – local path or any path supported supported by smart_open: https://github.com/RaRe-Technologies/smart_open#how. The parent directory must already exist
delimited_file (bool, optional) – whether to prefix the data with the length of output or not. Default is True

static read_protobuf(protobuf_path: str, delimited_file: bool = True, transport_parameters: dict = None) → DatasetProfile¶

Parse a protobuf file and return a DatasetProfile object

Parameters

protobuf_path (str) – the path of the protobuf data, can be local or any other path supported by smart_open: https://github.com/RaRe-Technologies/smart_open#how
delimited_file (bool, optional) – whether the data is delimited or not. Default is True

Returns

whylogs.DatasetProfile object from the protobuf

Return type

DatasetProfile

static from_protobuf(message: whylogs.proto.DatasetProfileMessage) → DatasetProfile¶

Load from a protobuf message

Parameters: message (DatasetProfileMessage) – The protobuf message. Should match the output of DatasetProfile.to_protobuf()
Returns: dataset_profile
Return type: DatasetProfile

static from_protobuf_string(data: bytes) → DatasetProfile¶

Deserialize a serialized DatasetProfileMessage

Parameters: data (bytes) – The serialized message
Returns: profile – The deserialized dataset profile
Return type: DatasetProfile

static _parse_delimited_generator(data: bytes)¶

static parse_delimited_single(data: bytes, pos=0)¶

Parse a single delimited entry from a byte stream :param data: The bytestream :type data: bytes :param pos: The starting position. Default is zero :type pos: int

Returns

pos (int) – Current position in the stream after parsing
profile (DatasetProfile) – A dataset profile

static parse_delimited(data: bytes)¶

Parse delimited data (i.e. data prefixed with the message length).

Java protobuf writes delimited messages, which is convenient for storing multiple dataset profiles. This means that the main data is prefixed with the length of the message.

Parameters: data (bytes) – The input byte stream
Returns: profiles – List of all Dataset profile objects
Return type: list

apply_summary_constraints(self, summary_constraints: Optional[Mapping[str, whylogs.core.statistics.constraints.SummaryConstraints]] = None)¶

apply_table_shape_constraints(self, table_shape_constraints: Optional[whylogs.core.statistics.constraints.SummaryConstraints] = None)¶

whylogs.enable_mlflow(session=None) → bool¶

Enable whylogs in mlflow module via mlflow.whylogs.

Returns: True if MLFlow has been patched. False otherwise.

Example of whylogs and MLFlow¶

import mlflow
import whylogs

whylogs.enable_mlflow()

import numpy as np
import pandas as pd
pdf = pd.DataFrame(
    data=[[1, 2, 3, 4, True, "x", bytes([1])]],
    columns=["b", "d", "a", "c", "e", "g", "f"],
    dtype=np.object,
)

active_run = mlflow.start_run()

# log a Pandas dataframe under default name
mlflow.whylogs.log_pandas(pdf)

# log a Pandas dataframe with custom name
mlflow.whylogs.log_pandas(pdf, "another dataset")

# Finish the MLFlow run
mlflow.end_run()

whylogs¶

Subpackages¶

Submodules¶

Package Contents¶

Classes¶

Functions¶

Attributes¶

`whylogs`¶