whylogs

Submodules

Package Contents

Classes

SessionConfig

Config for a WhyLogs session.

WriterConfig

Config for WhyLogs writers

ColumnProfile

Statistics tracking for a column (i.e. a feature)

DatasetProfile

Statistics tracking for a dataset.

Functions

get_or_create_session()

Retrieve the current active global session.

reset_default_session()

Reset and deactivate the global WhyLogs logging session.

whylogs.__version__ = 0.0.2b22
class whylogs.SessionConfig(project: str, pipeline: str, writers: List[WriterConfig], verbose: bool = False)

Config for a WhyLogs session.

See also SessionConfigSchema

Parameters
  • project (str) – Project associated with this WhyLogs session

  • pipeline (str) – Name of the associated data pipeline

  • writers (list) – A list of WriterConfig objects defining writer outputs

  • verbose (bool, default=False) – Output verbosity

to_yaml(self, stream=None)

Serialize this config to YAML

Parameters

stream – If None (default) return a string, else dump the yaml into this stream.

static from_yaml(stream)

Load config from yaml

Parameters

stream (str, file-obj) – String or file-like object to load yaml from

Returns

config – Generated config

Return type

SessionConfig

class whylogs.WriterConfig(type: str, formats: List[str], output_path: str, path_template: typing.Optional[str] = None, filename_template: typing.Optional[str] = None)

Config for WhyLogs writers

See also:

Parameters
  • type (str) – Destination for the writer output, e.g. ‘local’ or ‘s3’

  • formats (list) – All output formats. See ALL_SUPPORTED_FORMATS

  • output_path (str) – Prefix of where to output files. A directory for type = ‘local’, or key prefix for type = ‘s3’

  • path_template (str, optional) – Templatized path output using standard python string templates. Variables are accessed via $identifier or ${identifier}. See whylogs.app.writers.Writer.template_params() for a list of available identifers. Default = whylogs.app.writers.DEFAULT_PATH_TEMPLATE

  • filename_template (str, optional) – Templatized output filename using standardized python string templates. Variables are accessed via $identifier or ${identifier}. See whylogs.app.writers.Writer.template_params() for a list of available identifers. Default = whylogs.app.writers.DEFAULT_FILENAME_TEMPLATE

to_yaml(self, stream=None)

Serialize this config to YAML

Parameters

stream – If None (default) return a string, else dump the yaml into this stream.

static from_yaml(stream, **kwargs)

Load config from yaml

Parameters
  • stream (str, file-obj) – String or file-like object to load yaml from

  • kwargs – ignored

Returns

config – Generated config

Return type

WriterConfig

class whylogs.ColumnProfile(name: str, number_tracker: NumberTracker = None, string_tracker: StringTracker = None, schema_tracker: SchemaTracker = None, counters: CountersTracker = None, frequent_items: FrequentItemsSketch = None, cardinality_tracker: HllSketch = None)

Statistics tracking for a column (i.e. a feature)

The primary method for

Parameters
  • name (str (required)) – Name of the column profile

  • number_tracker (NumberTracker) – Implements numeric data statisics tracking

  • string_tracker (StringTracker) – Implements string data-type statistics tracking

  • schema_tracker (SchemaTracker) – Implements tracking of schema-related information

  • counters (CountersTracker) – Keep count of various things

  • frequent_tiems (FrequentItemsSketch) – Keep track of all frequent items, even for mixed datatype features

  • cardinality_tracker (HllSketch) – Track feature cardinality (even for mixed data types)

  • TODO

    • Proper TypedDataConverter type checking

    • Multi-threading/parallelism

track(self, value)

Add value to tracking statistics.

to_summary(self)

Generate a summary of the statistics

Returns

summary – Protobuf summary message.

Return type

ColumnSummary

merge(self, other)

Merge this columprofile with another.

Parameters

other (ColumnProfile) –

Returns

merged – A new, merged column profile.

Return type

ColumnProfile

to_protobuf(self)

Return the object serialized as a protobuf message

Returns

message

Return type

ColumnMessage

static from_protobuf(message)

Load from a protobuf message

Returns

column_profile

Return type

ColumnProfile

class whylogs.DatasetProfile(name: str, dataset_timestamp: datetime.datetime = None, session_timestamp: datetime.datetime = None, columns: dict = None, tags: typing.Dict[str, str] = None, metadata: typing.Dict[str, str] = None, session_id: str = None)

Statistics tracking for a dataset.

A dataset refers to a collection of columns.

Parameters
  • name (str) – A human readable name for the dataset profile. Could be model name. This is stored under “name” tag

  • dataset_timestamp (datetime.datetime) – The timestamp associated with the data (i.e. batch run). Optional.

  • session_timestamp (datetime.datetime) – Timestamp of the dataset

  • columns (dict) – Dictionary lookup of `ColumnProfile`s

  • tags (dict) – A dictionary of key->value. Can be used upstream for aggregating data. Tags must match when merging with another dataset profile object.

  • metadata (dict) – Metadata that can store abirtrary string mapping. Metadata is not used when aggregating data and can be dropped when merging with another dataset profile object.

  • session_id (str) – The unique session ID run. Should be a UUID.

property name(self)
property tags(self)
property metadata(self)
property session_timestamp(self)
property session_timestamp_ms(self)

Return the session timestamp value in epoch milliseconds

track(self, columns, data=None)

Add value(s) to tracking statistics for column(s)

Parameters
  • columns (str, dict) – Either the name of a column, or a dictionary specifying column names and the data (value) for each column If a string, data must be supplied. Otherwise, data is ignored.

  • data (object, None) – Value to track. Specify if columns is a string.

track_datum(self, column_name, data)
track_array(self, x: np.ndarray, columns=None)

Track statistics for a numpy array

Parameters
  • x (np.ndarray) – 2D array to track.

  • columns (list) – Optional column labels

track_dataframe(self, df: pd.DataFrame)

Track statistics for a dataframe

Parameters

df (pandas.DataFrame) – DataFrame to track

to_properties(self)

Return dataset profile related metadata

Returns

properties – The metadata as a protobuf object.

Return type

DatasetProperties

to_summary(self)

Generate a summary of the statistics

Returns

summary – Protobuf summary message.

Return type

DatasetSummary

flat_summary(self)

Generate and flatten a summary of the statistics.

See flatten_summary() for a description

_column_message_iterator(self)
chunk_iterator(self)

Generate an iterator to iterate over chunks of data

validate(self)

Sanity check for this object. Raises an AssertionError if invalid

merge(self, other)

Merge this profile with another dataset profile object.

This operation will drop the metadata from the ‘other’ profile object.

Parameters

other (DatasetProfile) –

Returns

merged – New, merged DatasetProfile

Return type

DatasetProfile

serialize_delimited(self) → bytes

Write out in delimited format (data is prefixed with the length of the datastream).

This is useful when you are streaming multiple dataset profile objects

Returns

data – A sequence of bytes

Return type

bytes

to_protobuf(self) → DatasetProfileMessage

Return the object serialized as a protobuf message

Returns

message

Return type

DatasetProfileMessage

write_protobuf(self, protobuf_path: str, delimited_file: bool = True)

Write the dataset profile to disk in binary format

Parameters
  • protobuf_path – the local path for storage. The parent directory must already exist

  • delimited_file – whether to prefix the data with the length of output or not. Default is True

static read_protobuf(protobuf_path: str, delimited_file: bool = True)

Parse a protobuf file and return a DatasetProfile object

Parameters
  • protobuf_path – the path of the protobuf data

  • delimited_file – whether the data is delimited or not. Default is True

Returns

a DatasetProfile object if successful

Return type

whylogs.DatasetProfile

static from_protobuf(message: DatasetProfileMessage)

Load from a protobuf message

Parameters

message (DatasetProfileMessage) – The protobuf message. Should match the output of DatasetProfile.to_protobuf()

Returns

dataset_profile

Return type

DatasetProfile

static from_protobuf_string(data: bytes)

Deserialize a serialized DatasetProfileMessage

Parameters

data (bytes) – The serialized message

Returns

profile – The deserialized dataset profile

Return type

DatasetProfile

static _parse_delimited_generator(data: bytes)
static parse_delimited_single(data: bytes, pos=0)

Parse a single delimited entry from a byte stream :param data: The bytestream :type data: bytes :param pos: The starting position. Default is zero :type pos: int

Returns

  • pos (int) – Current position in the stream after parsing

  • profile (DatasetProfile) – A dataset profile

static parse_delimited(data: bytes)

Parse delimited data (i.e. data prefixed with the message length).

Java protobuf writes delimited messages, which is convenient for storing multiple dataset profiles. This means that the main data is prefixed with the length of the message.

Parameters

data (bytes) – The input byte stream

Returns

profiles – List of all Dataset profile objects

Return type

list

whylogs.get_or_create_session()

Retrieve the current active global session.

If no active session exists, attempt to load config and create a new session.

If an active session exists, return the session without loading new config.

Returns

session – The global active session

Return type

Session

whylogs.reset_default_session()

Reset and deactivate the global WhyLogs logging session.