whylogs documentation can be found at docs.whylabs.ai

Visit docs.whylabs.ai for up-to-date documentation

whylogs API reference

Profile and monitor your ML data pipeline end-to-end

whylogs is a library for building insights to your data and minimizing data monitoring issues in order to maintain quality and improve communication between teams. To learn more about generating validating, documenting, and profiling your data, read our intro and our Getting Started guide.

Attention

This site is a work in progress. If you have questions, ask them in our Slack channel!

Overview

Introduction

whylogs is an open source data quality library that uses advanced data science statistics to log and monitor data for your AI/ML application. whylogs is designed to scale with your MLOps workflow, from local development to production terabyte-size datasets.

Whether you are running an experimentation or production pipeline, understanding the properties of the data that flows through your application is critical to the success of your ML project. whylogs enables advanced statistical collection using lightweight techniques, such as building sketches for data, that enable complex monitoring and data quality checks for your pipeline.

Key Features

  • Data Insight: whylogs provides complex statistics across different stages of your ML/AI pipelines and applications.

  • Scalability: whylogs scales with your system, from local development mode to live production systems in multi-node clusters, and works well with batch and streaming architectures.

  • Lightweight: whylogs produces small mergeable lightweight outputs in a variety of formats, using sketching algorithms and summarizing statistics.

  • Unified data instrumentation: To enable data engineering pipelines and ML pipelines to share a common framework for tracking data quality and drifts, the whylogs library supports multiple languages and integrations.

  • Observability: In addition to supporting traditional monitoring approaches, whylogs data can support advanced ML-focused analytics, error analysis, and data quality and data drift detection.

Getting Started

whylogs library comes with quickstart CLI to help you initialize the configuration. You can also use the API directly without going through the CLI.

Quick Start

Install the Library

Install our library in a Python 3.6+ environment.

pip install whylogs

Configuration

To get started, you can generate a simple cnofiguration file with whylogs CLI:

whylogs init

A whylogs config file contains the following parameters:

  • project sets the name of the project.

  • pipeline specifies the pipeline to be used.

  • verbose sets output verbosity. Its default value is false.

  • writers specifies how and where output is stored, using path and filename templates that take the following variables:

    • project

    • pipeline

    • dataset_name

    • dataset_timestamp

    • session_timestamp

An example config file can be found here.

whylogs.app.config.load_config() loads your config file. It attempts to load files at the following paths, in order:

  1. The path set in the WHYLOGS_CONFIG environment variable

  2. The current directory’s .whylogs.yaml file

  3. ~/.whylogs.yaml (in the home directory)

  4. /opt/whylogs/.whylogs.yaml

Using whylogs API

Initialize a Logging Session

An example script for creating a logging session can be found here.

Create a Logger

Loggers log statistical information about your data. They have the following parameters:

  • dataset_name sets the name of the dataset, to be used in DatasetProfile metadata and generated filenames.

  • dataset_timestamp sets a timestamp for the data.

  • session_timestamp sets a timestamp for the creation of the session.

  • writers provides a list of writers that will be used to create the DatasetProfile.

  • verbose sets the verbosity of the output.

For more information, see the documentation for the logger class.

This example code uses logger options to control the output location.

Configure a Writer

Writers write the statistics gathered by the logger into an output file. They use the following parameters to create output file paths:

  • output_path sets the location output files will be stored. Use a directory path if your writer type = 'local', or a key prefix for type = 's3'.

  • formats lists all supported output formats.

  • path_template optionally sets an output path using Python string templates.

  • filename_template optionally sets output filenames using Python string templates.

  • dataset_timestamp sets a timestamp for the data.

  • session_timestamp sets a timestamp for the creation of the session.

For more information, see the documentation for the writer class.

Output whylogs data

whylogs supports the following output formats:

  • Protobuf is a lightweight binary format that maps one-to-one with the memory representation of a whylogs object. Use this format if you plan to apply advanced transformations to whylogs output.

  • JSON displays the protobuf data in JSON format.

  • Flat outputs multiple files with both CSV and JSON content to represent different views of the data, including histograms, upperbound, lowerbound, and frequent values.

WhyLabs Platform Sandbox

Check out WhyLabs Platform Sandbox to see how whylogs can be used for large-scale data monitoring and visualization in enterprise settings.

Concepts

  • A batch is a collection of datapoints, often grouped by time.

  • In batch mode, whylogs processes a dataset in batches.

  • A dataset is a collection of related data that will be analyzed together. whylogs accepts tabular data: each column of the table represents a particular variable, and each row represents a record of the dataset. When used alongside a statistical model, the dataset often represents features as columns, with additional columns for the output. More complex data formats will be supported in the future.

  • A DatasetProfile is a collection of summary statistics and related metadata for a dataset that whylogs has processed.

  • Data Sketches are a class of algorithms that efficiently extract information from large or streaming datasets in a single pass. This term is sometimes used to refer specifically to the Apache DataSketches project.

  • A logger represents the whylogs tracking object for a given dataset (in batch mode) or a collection of data points (in streaming mode). A logger is always associated with a timestamp for its creation and a timestamp for the dataset. Different loggers may write to different storage systems using different output formats.

  • Metadata is data that describes either a dataset or information from whylogs’ processing of the dataset.

  • The output formats whylogs supports are protobuf, JSON, and flat. Protobuf is a lightweight binary format that maps one-to-one with the memory representation of a whylogs object. JSON displays the protobuf data in JSON format. Flat outputs multiple files with both CSV and JSON content to represent different views of the data, including histograms, upperbound, lowerbound, and frequent values. To apply advanced transformation on whylogs, we recommend using Protobuf.

  • A pipeline consists of the components data moves through, as well as any infrastructure associated with those components. A project may have multiple ML pipelines, but it’s common to have one pipeline for a multi-stage project.

  • Project refers to the project name. A whylogs project is usually associated with one or more ML models. When logging a dataset without a specified name, the system defaults to the project name.

  • A record is an observation of data. whylogs represents this as a map of keys (string data - feature names) to values (numerical/textual data).

  • A session represents your configuration for how your application interacts with whylogs, including logger configuration, input and output formats. Using a single session for your application is recommended.

  • Storage systems: whylogs supports output to local storage and AWS s3.

  • In streaming mode, whylogs processes individual data points.

  • Summary statistics are metrics that describe, or summarize, a set of observations.

License

Apache License

Version 2.0, January 2004

http://www.apache.org/licenses/

TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

  1. Definitions.

    “License” shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document.

    “Licensor” shall mean the copyright owner or entity authorized by the copyright owner that is granting the License.

    “Legal Entity” shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, “control” means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity.

    “You” (or “Your”) shall mean an individual or Legal Entity exercising permissions granted by this License.

    “Source” form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files.

    “Object” form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types.

    “Work” shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below).

    “Derivative Works” shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof.

    “Contribution” shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, “submitted” means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as “Not a Contribution.”

    “Contributor” shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work.

  2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form.

  3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed.

  4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions:

    1. You must give any other recipients of the Work or Derivative Works a copy of this License; and

    2. You must cause any modified files to carry prominent notices stating that You changed the files; and

    3. You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and

    4. If the Work includes a “NOTICE” text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License.

    You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License.

  5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions.

  6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file.

  7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License.

  8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages.

  9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability.

END OF TERMS AND CONDITIONS

APPENDIX: How to apply the Apache License to your work.

To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets “[]” replaced with your own identifying information. (Don’t include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same “printed page” as the copyright notice for easier identification within third-party archives.

Copyright [yyyy] [name of copyright owner]

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

API Reference

This page contains auto-generated API reference documentation 1.

whylogs

Subpackages

whylogs.app

The whylogs client application API

Submodules
whylogs.app.config

Classes/functions for configuring the whylogs app

Module Contents
Classes

WriterType

Generic enumeration.

TransportParameterConfig

TransportParameterConfigSchema

Marshmallow schema for WriterConfig class.

WriterConfig

Config for whylogs writers

MetadataConfig

Config for whylogs metadata

SessionConfig

Config for a whylogs session.

WriterConfigSchema

Marshmallow schema for WriterConfig class.

MetadataConfigSchema

Marshmallow schema for MetadataConfig class.

SessionConfigSchema

Marshmallow schema for SessionConfig class.

Functions

load_config(path_to_config: str = None)

Load logging configuration, from disk and from the environment.

Attributes

SUPPORTED_WRITERS

WHYLOGS_YML

ALL_SUPPORTED_FORMATS

Supported output formats for whylogs writer configuration

SegmentTag

SegmentTags

class whylogs.app.config.WriterType

Bases: enum.Enum

Generic enumeration.

Derive from this class to define new enumerations.

local
s3
whylabs
mlflow
whylogs.app.config.SUPPORTED_WRITERS
whylogs.app.config.WHYLOGS_YML = .whylogs.yaml
whylogs.app.config.ALL_SUPPORTED_FORMATS

Supported output formats for whylogs writer configuration

whylogs.app.config.SegmentTag
whylogs.app.config.SegmentTags
class whylogs.app.config.TransportParameterConfig(endpoint_url: str, aws_access_key_id: str, aws_secret_access_key: str, region_name: str, verify: str)
class whylogs.app.config.TransportParameterConfigSchema

Bases: marshmallow.Schema

Marshmallow schema for WriterConfig class.

endpoint_url
aws_access_key_id
aws_secret_access_key
region_name
verify
make_writer(self, data, **kwargs)
class whylogs.app.config.WriterConfig(type: str, formats: Optional[List[str]] = None, output_path: Optional[str] = None, path_template: Optional[str] = None, filename_template: Optional[str] = None, data_collection_consent: Optional[bool] = None, transport_parameters: Optional[TransportParameterConfig] = None)

Config for whylogs writers

See also:

Parameters
to_yaml(self, stream=None)

Serialize this config to YAML

Parameters

stream – If None (default) return a string, else dump the yaml into this stream.

static from_yaml(stream, **kwargs)

Load config from yaml

Parameters
  • stream (str, file-obj) – String or file-like object to load yaml from

  • kwargs – ignored

Returns

config – Generated config

Return type

WriterConfig

class whylogs.app.config.MetadataConfig(type: str, output_path: str, input_path: Optional[str] = '', path_template: Optional[str] = None)

Config for whylogs metadata

See also:

Parameters
  • type (str) – Destination for the writer output, e.g. ‘local’ or ‘s3’

  • output_path (str) – Prefix of where to output files. A directory for type = ‘local’, or key prefix for type = ‘s3’

  • input_path (str) – Path to search for pre-calculated segment files. Paths separated by ‘:’.

  • path_template (str, optional) – Templatized path output using standard python string templates. Variables are accessed via $identifier or ${identifier}. See whylogs.app.writers.Writer.template_params() for a list of available identifers. Default = whylogs.app.metadata_writer.DEFAULT_PATH_TEMPLATE

to_yaml(self, stream=None)

Serialize this config to YAML

Parameters

stream – If None (default) return a string, else dump the yaml into this stream.

static from_yaml(stream, **kwargs)

Load config from yaml

Parameters
  • stream (str, file-obj) – String or file-like object to load yaml from

  • kwargs – ignored

Returns

config – Generated config

Return type

WriterConfig

class whylogs.app.config.SessionConfig(project: str, pipeline: str, writers: List[WriterConfig], metadata: Optional[MetadataConfig] = None, verbose: bool = False, with_rotation_time: str = None, cache_size: int = 1, report_progress: bool = False)

Config for a whylogs session.

See also SessionConfigSchema

Parameters
  • project (str) – Project associated with this whylogs session

  • pipeline (str) – Name of the associated data pipeline

  • writers (list) – A list of WriterConfig objects defining writer outputs

  • metadata (MetadataConfig) – A MetadataConfiguration object. If none, will replace with default.

  • verbose (bool, default=False) – Output verbosity

  • with_rotation_time (str, default = None, to rotate profiles with time, takes values of overall rotation interval,) – “s” for seconds “m” for minutes “h” for hours “d” for days

  • cache_size (int default =1, sets how many dataprofiles to cache in logger during rotation) –

to_yaml(self, stream=None)

Serialize this config to YAML

Parameters

stream – If None (default) return a string, else dump the yaml into this stream.

static from_yaml(stream)

Load config from yaml

Parameters

stream (str, file-obj) – String or file-like object to load yaml from

Returns

config – Generated config

Return type

SessionConfig

class whylogs.app.config.WriterConfigSchema

Bases: marshmallow.Schema

Marshmallow schema for WriterConfig class.

class Meta
unknown
type
formats
output_path
path_template
filename_template
transport_parameters
make_writer(self, data, **kwargs)
class whylogs.app.config.MetadataConfigSchema

Bases: marshmallow.Schema

Marshmallow schema for MetadataConfig class.

type
output_path
input_path
path_template
make_metadata(self, data, **kwargs)
class whylogs.app.config.SessionConfigSchema

Bases: marshmallow.Schema

Marshmallow schema for SessionConfig class.

project
pipeline
with_rotation_time
cache
verbose
writers
metadata
make_session(self, data, **kwargs)
whylogs.app.config.load_config(path_to_config: str = None)

Load logging configuration, from disk and from the environment.

Config is loaded by attempting to load files in the following order. The first valid file will be used

  1. Path set in WHYLOGS_CONFIG environment variable

  2. Current directory’s .whylogs.yaml file

  3. ~/.whylogs.yaml (home directory)

  4. /opt/whylogs/.whylogs.yaml path

Returns

config – Config for the logger, if a valid config file is found, else returns None.

Return type

SessionConfig, None

whylogs.app.logger

Class and functions for whylogs logging

Module Contents
Classes

Logger

Class for logging whylogs statistics.

Functions

hash_segment(seg: List[Dict]) → str

Attributes

SegmentTag

Segment

_TAG_PREFIX

_TAG_KEY

_TAG_VALUE

logger

whylogs.app.logger.SegmentTag
whylogs.app.logger.Segment
whylogs.app.logger._TAG_PREFIX = whylogs.tag.
whylogs.app.logger._TAG_KEY = key
whylogs.app.logger._TAG_VALUE = value
whylogs.app.logger.logger
class whylogs.app.logger.Logger(session_id: str, dataset_name: str, dataset_timestamp: Optional[datetime.datetime] = None, session_timestamp: Optional[datetime.datetime] = None, tags: Optional[Dict[str, str]] = None, metadata: Optional[Dict[str, str]] = None, writers: Optional[List[whylogs.app.writers.Writer]] = None, metadata_writer: Optional[whylogs.app.metadata_writer.MetadataWriter] = None, verbose: bool = False, with_rotation_time: Optional[str] = None, interval: int = 1, cache_size: int = 1, segments: Optional[Union[List[Segment], List[str], str]] = None, profile_full_dataset: bool = False, constraints: Optional[whylogs.core.statistics.constraints.DatasetConstraints] = None)

Class for logging whylogs statistics.

Parameters
  • session_id – The session ID value. Should be set by the Session boject

  • dataset_name – The name of the dataset. Gets included in the DatasetProfile metadata and can be used in generated filenames.

  • dataset_timestamp – Optional. The timestamp that the logger represents

  • session_timestamp – Optional. The time the session was created

  • tags – Optional. Dictionary of key, value for aggregating data upstream

  • metadata – Optional. Dictionary of key, value. Useful for debugging (associated with every single dataset profile)

  • writers – Optional. List of Writer objects used to write out the data

  • metadata_writer – Optional. MetadataWriter object used to write non-profile information

  • with_rotation_time – Optional. Log rotation interval, consisting of digits with unit specification, e.g. 30s, 2h, d. units are seconds (“s”), minutes (“m”), hours, (“h”), or days (“d”) Output filenames will have a suffix reflecting the rotation interval.

  • interval – Deprecated: Interval multiplier for with_rotation_time, defaults to 1.

  • verbose – enable debug logging

  • cache_size – dataprofiles to cache

  • segments

    Can be either:
    • Autosegmentation source, one of [“auto”, “local”]

    • List of tag key value pairs for tracking data segments

    • List of tag keys for which we will track every value

    • None, no segments will be used

  • profile_full_dataset – when segmenting dataset, an option to keep the full unsegmented profile of the dataset.

  • constraints – static assertions to be applied to streams and summaries.

__enter__(self)
__exit__(self, exc_type, exc_val, exc_tb)
property profile(self) whylogs.core.DatasetProfile
Returns

the last backing dataset profile

Return type

DatasetProfile

tracking_checks(self)
property segmented_profiles(self) Dict[str, whylogs.core.DatasetProfile]
Returns

the last backing dataset profile

Return type

Dict[str, DatasetProfile]

get_segment(self, segment: Segment) Optional[whylogs.core.DatasetProfile]
set_segments(self, segments: Union[List[Segment], List[str], str]) None
_retrieve_local_segments(self) Union[List[Segment], List[str], str]

Retrieves local segments

_intialize_profiles(self, dataset_timestamp: Optional[datetime.datetime] = datetime.datetime.now(datetime.timezone.utc)) None
_set_rotation(self, with_rotation_time: str = None)
rotate_when(self, time)
should_rotate(self)
_rotate_time(self)

rotate with time add a suffix

flush(self, rotation_suffix: Optional[str] = None)

Synchronously perform all remaining write tasks

full_profile_check(self) bool

returns a bool to determine if unsegmented dataset should be profiled.

close(self) Optional[whylogs.core.DatasetProfile]

Flush and close out the logger, outputs the last profile

Returns

the result dataset profile. None if the logger is closed

log(self, features: Optional[Dict[str, any]] = None, feature_name: Optional[str] = None, value: any = None, character_list: Optional[str] = None, token_method: Optional[Callable] = None)

Logs a collection of features or a single feature (must specify one or the other).

Parameters
  • features – a map of key value feature for model input

  • feature_name – name of a single feature. Cannot be specified if ‘features’ is specified

  • value – value of as single feature. Cannot be specified if ‘features’ is specified

log_segment_datum(self, feature_name, value, character_list: str = None, token_method: Optional[Callable] = None)
log_metrics(self, targets, predictions, scores=None, model_type: whylogs.proto.ModelType = None, target_field=None, prediction_field=None, score_field=None)
log_image(self, image, feature_transforms: Optional[List[Callable]] = None, metadata_attributes: Optional[List[str]] = METADATA_DEFAULT_ATTRIBUTES, feature_name: str = '')

API to track an image, either in PIL format or as an input path

Parameters
  • feature_name – name of the feature

  • metadata_attributes – metadata attributes to extract for the images

  • feature_transforms – a list of callables to transform the input into metrics

log_local_dataset(self, root_dir, folder_feature_name='folder_feature', image_feature_transforms=None, show_progress=False)

Log a local folder dataset It will log data from the files, along with structure file data like metadata, and magic numbers. If the folder has single layer for children folders, this will pick up folder names as a segmented feature

Parameters
  • show_progress – showing the progress bar

  • image_feature_transforms – image transform that you would like to use with the image log

  • root_dir (str) – directory where dataset is located.

  • folder_feature_name (str, optional) – Name for the subfolder features, i.e. class, store etc.

log_annotation(self, annotation_data)

Log structured annotation data ie. JSON like structures

Parameters

annotation_data (Dict or List) – Description

log_csv(self, filepath_or_buffer: Union[str, pathlib.Path, IO[AnyStr]], segments: Optional[Union[List[Segment], List[str]]] = None, profile_full_dataset: bool = False, **kwargs)

Log a CSV file. This supports the same parameters as :func`pandas.read_csv<pandas.read_csv>` function.

Parameters
  • filepath_or_buffer – the path to the CSV or a CSV buffer

  • segments – define either a list of segment keys or a list of segments tags: [ {“key”:<featurename>,”value”: <featurevalue>},… ]

  • profile_full_dataset – when segmenting dataset, an option to keep the full unsegmented profile of the dataset

  • **kwargs – from pandas:read_csv

log_dataframe(self, df, segments: Optional[Union[List[Segment], List[str]]] = None, profile_full_dataset: bool = False)

Generate and log a whylogs DatasetProfile from a pandas dataframe :param profile_full_dataset: when segmenting dataset, an option to keep the full unsegmented profile of the

dataset.

Parameters
  • segments – specify the tag key value pairs for segments

  • df – the Pandas dataframe to log

log_segments(self, data)
log_segments_keys(self, data)
log_fixed_segments(self, data)
log_df_segment(self, df, segment: Segment)
is_active(self)

Return the boolean state of the logger

static _prefix_segment_tags(segment_key_values)
whylogs.app.logger.hash_segment(seg: List[Dict]) str
whylogs.app.metadata_writer
Module Contents
Classes

MetadataWriter

Class for writing metadata to disk

Functions

metadata_from_config(config: whylogs.app.config.MetadataConfig)

Construct a whylogs MetadataWriter from a MetadataConfig

Attributes

DEFAULT_PATH_TEMPLATE

logger

whylogs.app.metadata_writer.DEFAULT_PATH_TEMPLATE = $name/metadata
whylogs.app.metadata_writer.logger
class whylogs.app.metadata_writer.MetadataWriter(output_path: str, input_path: Optional[str] = '', path_template: Optional[str] = None, writer_type: Optional[str] = 'local')

Class for writing metadata to disk

Parameters
  • output_path (str) – Prefix of where to output files. A directory for type = ‘local’, or key prefix for type = ‘s3’

  • path_template (str, optional) – Templatized path output using standard python string templates. Variables are accessed via $identifier or ${identifier}. See MetadataWriter.template_params() for a list of available identifers. Default = DEFAULT_PATH_TEMPLATE

path_suffix(self, name) str

Generate a path string for an output path from the given arguments by applying the path templating defined in self.path_template

autosegmentation_write(self, name: str, segments: Union[List[Dict], List[str]]) None
autosegmentation_read(self)
whylogs.app.metadata_writer.metadata_from_config(config: whylogs.app.config.MetadataConfig)

Construct a whylogs MetadataWriter from a MetadataConfig

Returns

metadata_writer – whylogs metadata writer

Return type

MetadataWriter

whylogs.app.output_formats

Define available output formats

Module Contents
Classes

OutputFormat

List of output formats that we support.

Attributes

SUPPORTED_OUTPUT_FORMATS

class whylogs.app.output_formats.OutputFormat

Bases: enum.Enum

List of output formats that we support.

json

output as a JSON object. This is a deeply nested structure

csv

output as “flat” files. This will generate multiple output files

protobuf

output as a binary protobuf file. This is the most compact format

json
flat
protobuf
whylogs.app.output_formats.SUPPORTED_OUTPUT_FORMATS
whylogs.app.session

whylogs logging session

Module Contents
Classes

_LoggerKey

Create a new logger or return an existing one for a given dataset name.

Session

param project

The project name. We will default to the project name when logging

Functions

session_from_config(config: whylogs.app.config.SessionConfig = None, config_path: Optional[str] = '') → Session

Construct a whylogs session from a SessionConfig or from a config_path

reset_default_session()

Reset and deactivate the global whylogs logging session.

start_whylabs_session(path_to_config: Optional[str] = None, report_progress: Optional[bool] = False)

get_or_create_session(path_to_config: Optional[str] = None, report_progress: Optional[bool] = False)

Retrieve the current active global session.

get_session()

Retrieve the logging session without altering or activating it.

get_logger()

Retrieve the global session logger

Attributes

defaultLoggerArgs

_use_whylabs_client

_session

class whylogs.app.session._LoggerKey

Create a new logger or return an existing one for a given dataset name. If no dataset_name is specified, we default to project name

Parameters
  • metadata

  • dataset_name – str Name of the dataset. Default is the project name

  • dataset_timestamp – datetime.datetime, optional The timestamp associated with the dataset. Could be the timestamp for the batch, or the timestamp for the window that you are tracking

  • tags – dict Tag the data with groupable information. For example, you might want to tag your data with the stage information (development, testing, production etc…)

  • metadata – dict Useful to debug the data source. You can associate non-groupable information in this field such as hostname,

  • session_timestamp – datetime.datetime, optional Override the timestamp associated with the session. Normally you shouldn’t need to override this value

  • segments – Can be either: - List of tag key value pairs for tracking datasetments - List of tag keys for whylogs to split up the data in the backend

dataset_name :Optional[str]
dataset_timestamp :Optional[datetime.datetime]
session_timestamp :Optional[datetime.datetime]
tags :Dict[str, str]
metadata :Dict[str, str]
segments :Optional[Union[List[Dict], List[str]]]
profile_full_dataset :bool = False
with_rotation_time :str
cache_size :int = 1
constraints :whylogs.core.statistics.constraints.DatasetConstraints
whylogs.app.session.defaultLoggerArgs
class whylogs.app.session.Session(project: Optional[str] = None, pipeline: Optional[str] = None, writers: Optional[List[whylogs.app.writers.Writer]] = None, metadata_writer: Optional[whylogs.app.metadata_writer.MetadataWriter] = None, verbose: bool = False, with_rotation_time: str = None, cache_size: int = None, report_progress: bool = False)
Parameters
  • project (str) – The project name. We will default to the project name when logging a dataset if the dataset name is not specified

  • pipeline (str) – Name of the pipeline associated with this session

  • writers (list) – configuration for the output writers. This is where the log data will go

  • verbose (bool) – enable verbose logging for not. Default is False

__enter__(self)
__exit__(self, tpe, value, traceback)
__repr__(self)

Return repr(self).

get_config(self)
is_active(self)
logger(self, dataset_name: Optional[str] = None, dataset_timestamp: Optional[datetime.datetime] = None, session_timestamp: Optional[datetime.datetime] = None, tags: Dict[str, str] = None, metadata: Dict[str, str] = None, segments: Optional[Union[List[Dict], List[str], str]] = None, profile_full_dataset: bool = False, with_rotation_time: str = None, cache_size: int = 1, constraints: whylogs.core.statistics.constraints.DatasetConstraints = None) whylogs.app.logger.Logger

Create a new logger or return an existing one for a given dataset name. If no dataset_name is specified, we default to project name

Parameters
  • dataset_name – name of the dataset

  • dataset_timestamp – timestamp of the dataset. Default to now

  • session_timestamp – timestamp of the session. Inherits from the session

  • tags – metadata associated with the profile

  • metadata – same as tags. Will be deprecated

  • segments – slice of data that the profile belongs to

  • profile_full_dataset – when segmenting dataset, an option to keep the full unsegmented profile of the dataset

  • with_rotation_time – rotation time in minutes our hours (“1m”, “1h”)

  • cache_size – size of the segment cache

  • constraints – whylogs contrainst to monitor against

get_logger(self, dataset_name: str = None)
log_dataframe(self, df: pandas.DataFrame, dataset_name: Optional[str] = None, dataset_timestamp: Optional[datetime.datetime] = None, session_timestamp: Optional[datetime.datetime] = None, tags: Dict[str, str] = None, metadata: Dict[str, str] = None, segments: Optional[Union[List[Dict], List[str], str]] = None, profile_full_dataset: bool = False, constraints: whylogs.core.statistics.constraints.DatasetConstraints = None) Optional[whylogs.core.DatasetProfile]

Perform statistics caluclations and log a pandas dataframe

Parameters
  • df – the dataframe to profile

  • dataset_name – name of the dataset

  • dataset_timestamp – the timestamp for the dataset

  • session_timestamp – the timestamp for the session. Override the default one

  • tags – the tags for the profile. Useful when merging

  • metadata – information about this current profile. Can be discarded when merging

  • segments – Can be either: - Autosegmentation source, one of [“auto”, “local”] - List of tag key value pairs for tracking data segments - List of tag keys for which we will track every value - None, no segments will be used

  • profile_full_dataset – when segmenting dataset, an option to keep the full unsegmented profile of the dataset

Returns

a dataset profile if the session is active

profile_dataframe(self, df: pandas.DataFrame, dataset_name: Optional[str] = None, dataset_timestamp: Optional[datetime.datetime] = None, session_timestamp: Optional[datetime.datetime] = None, tags: Dict[str, str] = None, metadata: Dict[str, str] = None) Optional[whylogs.core.DatasetProfile]

Profile a Pandas dataframe without actually writing data to disk. This is useful when you just want to quickly capture and explore a dataset profile.

Parameters
  • df – the dataframe to profile

  • dataset_name – name of the dataset

  • dataset_timestamp – the timestamp for the dataset

  • session_timestamp – the timestamp for the session. Override the default one

  • tags – the tags for the profile. Useful when merging

  • metadata – information about this current profile. Can be discarded when merging

Returns

a dataset profile if the session is active

new_profile(self, dataset_name: Optional[str] = None, dataset_timestamp: Optional[datetime.datetime] = None, session_timestamp: Optional[datetime.datetime] = None, tags: Dict[str, str] = None, metadata: Dict[str, str] = None) Optional[whylogs.core.DatasetProfile]

Create an empty dataset profile with the metadata from the session.

Parameters
  • dataset_name – name of the dataset

  • dataset_timestamp – the timestamp for the dataset

  • session_timestamp – the timestamp for the session. Override the default one

  • tags – the tags for the profile. Useful when merging

  • metadata – information about this current profile. Can be discarded when merging

Returns

a dataset profile if the session is active

estimate_segments(self, df: pandas.DataFrame, name: str, target_field: str = None, max_segments: int = 30, dry_run: bool = False) Optional[Union[List[Dict], List[str]]]

Estimates the most important features and values on which to segment data profiling using entropy-based methods.

Parameters
  • df – the dataframe of data to profile

  • name – name for discovery in the logger, automatically applied

to loggers with same dataset_name :param target_field: target field (optional) :param max_segments: upper threshold for total combinations of segments, default 30 :param dry_run: run calculation but do not write results to metadata :return: a list of segmentation feature names

close(self)

Deactivate this session and flush all associated loggers

remove_logger(self, dataset_name: str)

Remove a logger from the dataset. This is called by the logger when it’s being closed

Parameters
  • logger (dataset_name the name of the dataset. used to identify the) –

  • None (Returns) –

  • -------

whylogs.app.session._use_whylabs_client = False
whylogs.app.session.session_from_config(config: whylogs.app.config.SessionConfig = None, config_path: Optional[str] = '') Session

Construct a whylogs session from a SessionConfig or from a config_path

whylogs.app.session._session
whylogs.app.session.reset_default_session()

Reset and deactivate the global whylogs logging session.

whylogs.app.session.start_whylabs_session(path_to_config: Optional[str] = None, report_progress: Optional[bool] = False)
whylogs.app.session.get_or_create_session(path_to_config: Optional[str] = None, report_progress: Optional[bool] = False)

Retrieve the current active global session.

If no active session exists, attempt to load config and create a new session.

If an active session exists, return the session without loading new config.

Returns

The global active session

Return type

Session

whylogs.app.session.get_session()

Retrieve the logging session without altering or activating it.

Returns

session – The global session

Return type

Session

whylogs.app.session.get_logger()

Retrieve the global session logger

Returns

ylog – The global session logger

Return type

whylogs.app.logger.Logger

whylogs.app.utils
Module Contents
Functions

timer_wrap(func, interval, *args, **kwargs)

_do_wrap(func)

async_wrap(func, *args, **kwargs)

param func

the coroutine to run in an asyncio loop

_wait_for_children()

Wait for the child process to complete. This is to ensure that we write out the log files before the parent

Attributes

_NO_ASYNC

_logger

_threads

_timer_threads

whylogs.app.utils._NO_ASYNC = WHYLOGS_NO_ASYNC
whylogs.app.utils._logger
whylogs.app.utils._threads :List[threading.Thread] = []
whylogs.app.utils._timer_threads :List[threading.Thread] = []
whylogs.app.utils.timer_wrap(func, interval, *args, **kwargs)
whylogs.app.utils._do_wrap(func)
whylogs.app.utils.async_wrap(func, *args, **kwargs)
Parameters

func – the coroutine to run in an asyncio loop

Returns

an thread for the coroutine

Return type

threading.Thread

whylogs.app.utils._wait_for_children()

Wait for the child process to complete. This is to ensure that we write out the log files before the parent process finishes

whylogs.app.writers

Classes for writing whylogs output

Module Contents
Classes

Writer

Class for writing to disk

LocalWriter

whylogs Writer class that can write to disk.

S3Writer

whylogs Writer class that can write to S3.

MlFlowWriter

Class for writing to disk

WhyLabsWriter

Class for writing to disk

Functions

writer_from_config(config: whylogs.app.config.WriterConfig)

Construct a whylogs Writer from a WriterConfig

Attributes

DEFAULT_PATH_TEMPLATE

DEFAULT_FILENAME_TEMPLATE

logger

whylogs.app.writers.DEFAULT_PATH_TEMPLATE = $name/$session_id
whylogs.app.writers.DEFAULT_FILENAME_TEMPLATE = dataset_profile
whylogs.app.writers.logger
class whylogs.app.writers.Writer(output_path: str, formats: List[str], path_template: Optional[str] = None, filename_template: Optional[str] = None, transport_params: Optional[whylogs.app.config.TransportParameterConfig] = None)

Bases: abc.ABC

Class for writing to disk

Parameters
  • output_path (str) – Prefix of where to output files. A directory for type = ‘local’, or key prefix for type = ‘s3’

  • formats (list) – All output formats. See whylogs.app.config.ALL_SUPPORTED_FORMATS

  • path_template (str, optional) – Templatized path output using standard python string templates. Variables are accessed via $identifier or ${identifier}. See Writer.template_params() for a list of available identifers. Default = DEFAULT_PATH_TEMPLATE

  • filename_template (str, optional) – Templatized output filename using standardized python string templates. Variables are accessed via $identifier or ${identifier}. See Writer.template_params() for a list of available identifers. Default = DEFAULT_FILENAME_TEMPLATE

close(self)
abstract write(self, profile: whylogs.core.DatasetProfile, rotation_suffix: str = None)

Abstract method to write a dataset profile to disk. Must be implemented

path_suffix(self, profile: whylogs.core.DatasetProfile)

Generate a path string for an output path from a dataset profile by applying the path templating defined in self.path_template

file_name(self, profile: whylogs.core.DatasetProfile, file_extension: str, rotation_suffix: Optional[str] = None)

For a given DatasetProfile, generate an output filename based on the templating defined in self.filename_template

static template_params(profile: whylogs.core.DatasetProfile) dict

Return a dictionary of dataset profile metadata which can be used for generating templatized variables or paths.

Parameters

profile (DatasetProfile) – The dataset profile

Returns

params – Variables which can be substituted into a template string.

Return type

dict

Notes

Template params:

  • name: name of the dataset

  • session_timestamp: session time in UTC epoch milliseconds

  • dataset_timestamp: timestamp for the data in UTC epoch ms

  • session_id: Unique identifier for the session

class whylogs.app.writers.LocalWriter(output_path: str, formats: List[str], path_template: str, filename_template: str)

Bases: Writer

whylogs Writer class that can write to disk.

See Writer for a description of arguments

write(self, profile: whylogs.core.DatasetProfile, rotation_suffix: Optional[str] = None)

Write a dataset profile to disk

_do_write(self, profile, rotation_suffix: Optional[str] = None, **kwargs)
_write_json(self, profile: whylogs.core.DatasetProfile, rotation_suffix: Optional[str] = None)

Write a JSON summary of the dataset profile to disk

_write_flat(self, profile: whylogs.core.DatasetProfile, indent: int = 4, rotation_suffix: Optional[str] = None)

Write output data for flat format

Parameters
  • profile (DatasetProfile) – the dataset profile to output

  • indent (int) – The JSON indentation to use. Default is 4

_write_protobuf(self, profile: whylogs.core.DatasetProfile, rotation_suffix: Optional[str] = None)

Write a protobuf serialization of the DatasetProfile to disk

ensure_path(self, suffix: str, addition_part: Optional[str] = None) str

Ensure that a path exists, creating it if not

class whylogs.app.writers.S3Writer(output_path: str, formats: List[str], path_template: str = None, filename_template: str = None)

Bases: Writer

whylogs Writer class that can write to S3.

See Writer for a description of arguments

write(self, profile: whylogs.core.DatasetProfile, rotation_suffix: str = None)

Write a dataset profile to S3

_do_write(self, profile, rotation_suffix: str = None, **kwargs)
_write_json(self, profile: whylogs.core.DatasetProfile, rotation_suffix: Optional[str] = None, transport_params: Optional[dict] = None)

Write a dataset profile JSON summary to disk

_write_flat(self, profile: whylogs.core.DatasetProfile, indent: int = 4, rotation_suffix: Optional[str] = None, transport_params: Optional[dict] = None)

Write output data for flat format

Parameters
  • profile (DatasetProfile) – the dataset profile to output

  • indent (int) – The JSON indentation to use. Default is 4

_write_protobuf(self, profile: whylogs.core.DatasetProfile, rotation_suffix: Optional[str] = None, transport_params: Optional[dict] = None)

Write a datasetprofile protobuf serialization to S3

class whylogs.app.writers.MlFlowWriter(output_path: str, formats: List[str], path_template: str = None, filename_template: str = None)

Bases: Writer

Class for writing to disk

Parameters
  • output_path (str) – Prefix of where to output files. A directory for type = ‘local’, or key prefix for type = ‘s3’

  • formats (list) – All output formats. See whylogs.app.config.ALL_SUPPORTED_FORMATS

  • path_template (str, optional) – Templatized path output using standard python string templates. Variables are accessed via $identifier or ${identifier}. See Writer.template_params() for a list of available identifers. Default = DEFAULT_PATH_TEMPLATE

  • filename_template (str, optional) – Templatized output filename using standardized python string templates. Variables are accessed via $identifier or ${identifier}. See Writer.template_params() for a list of available identifers. Default = DEFAULT_FILENAME_TEMPLATE

write(self, profile: whylogs.core.DatasetProfile, rotation_suffix: str = None)

Write a dataset profile to MLFlow path

static _write_protobuf(profile: whylogs.core.DatasetProfile, rotation_suffix: str = None, **kwargs)

Write a protobuf the dataset profile to disk in binary format to MlFlow

class whylogs.app.writers.WhyLabsWriter(output_path='', formats=None)

Bases: Writer

Class for writing to disk

Parameters
  • output_path (str) – Prefix of where to output files. A directory for type = ‘local’, or key prefix for type = ‘s3’

  • formats (list) – All output formats. See whylogs.app.config.ALL_SUPPORTED_FORMATS

  • path_template (str, optional) – Templatized path output using standard python string templates. Variables are accessed via $identifier or ${identifier}. See Writer.template_params() for a list of available identifers. Default = DEFAULT_PATH_TEMPLATE

  • filename_template (str, optional) – Templatized output filename using standardized python string templates. Variables are accessed via $identifier or ${identifier}. See Writer.template_params() for a list of available identifers. Default = DEFAULT_FILENAME_TEMPLATE

write(self, profile: whylogs.core.DatasetProfile, rotation_suffix: str = None)

Write a dataset profile to WhyLabs

static _write_protobuf(profile: whylogs.core.DatasetProfile)

Write a protobuf profile to WhyLabs

whylogs.app.writers.writer_from_config(config: whylogs.app.config.WriterConfig)

Construct a whylogs Writer from a WriterConfig

Returns

writer – whylogs writer

Return type

Writer

Package Contents
Classes

Logger

Class for logging whylogs statistics.

Session

param project

The project name. We will default to the project name when logging

SessionConfig

Config for a whylogs session.

WriterConfig

Config for whylogs writers

Functions

load_config(path_to_config: str = None)

Load logging configuration, from disk and from the environment.

Attributes

__ALL__

whylogs.app.load_config(path_to_config: str = None)

Load logging configuration, from disk and from the environment.

Config is loaded by attempting to load files in the following order. The first valid file will be used

  1. Path set in WHYLOGS_CONFIG environment variable

  2. Current directory’s .whylogs.yaml file

  3. ~/.whylogs.yaml (home directory)

  4. /opt/whylogs/.whylogs.yaml path

Returns

config – Config for the logger, if a valid config file is found, else returns None.

Return type

SessionConfig, None

class whylogs.app.Logger(session_id: str, dataset_name: str, dataset_timestamp: Optional[datetime.datetime] = None, session_timestamp: Optional[datetime.datetime] = None, tags: Optional[Dict[str, str]] = None, metadata: Optional[Dict[str, str]] = None, writers: Optional[List[whylogs.app.writers.Writer]] = None, metadata_writer: Optional[whylogs.app.metadata_writer.MetadataWriter] = None, verbose: bool = False, with_rotation_time: Optional[str] = None, interval: int = 1, cache_size: int = 1, segments: Optional[Union[List[Segment], List[str], str]] = None, profile_full_dataset: bool = False, constraints: Optional[whylogs.core.statistics.constraints.DatasetConstraints] = None)

Class for logging whylogs statistics.

Parameters
  • session_id – The session ID value. Should be set by the Session boject

  • dataset_name – The name of the dataset. Gets included in the DatasetProfile metadata and can be used in generated filenames.

  • dataset_timestamp – Optional. The timestamp that the logger represents

  • session_timestamp – Optional. The time the session was created

  • tags – Optional. Dictionary of key, value for aggregating data upstream

  • metadata – Optional. Dictionary of key, value. Useful for debugging (associated with every single dataset profile)

  • writers – Optional. List of Writer objects used to write out the data

  • metadata_writer – Optional. MetadataWriter object used to write non-profile information

  • with_rotation_time – Optional. Log rotation interval, consisting of digits with unit specification, e.g. 30s, 2h, d. units are seconds (“s”), minutes (“m”), hours, (“h”), or days (“d”) Output filenames will have a suffix reflecting the rotation interval.

  • interval – Deprecated: Interval multiplier for with_rotation_time, defaults to 1.

  • verbose – enable debug logging

  • cache_size – dataprofiles to cache

  • segments

    Can be either:
    • Autosegmentation source, one of [“auto”, “local”]

    • List of tag key value pairs for tracking data segments

    • List of tag keys for which we will track every value

    • None, no segments will be used

  • profile_full_dataset – when segmenting dataset, an option to keep the full unsegmented profile of the dataset.

  • constraints – static assertions to be applied to streams and summaries.

__enter__(self)
__exit__(self, exc_type, exc_val, exc_tb)
property profile(self) whylogs.core.DatasetProfile
Returns

the last backing dataset profile

Return type

DatasetProfile

tracking_checks(self)
property segmented_profiles(self) Dict[str, whylogs.core.DatasetProfile]
Returns

the last backing dataset profile

Return type

Dict[str, DatasetProfile]

get_segment(self, segment: Segment) Optional[whylogs.core.DatasetProfile]
set_segments(self, segments: Union[List[Segment], List[str], str]) None
_retrieve_local_segments(self) Union[List[Segment], List[str], str]

Retrieves local segments

_intialize_profiles(self, dataset_timestamp: Optional[datetime.datetime] = datetime.datetime.now(datetime.timezone.utc)) None
_set_rotation(self, with_rotation_time: str = None)
rotate_when(self, time)
should_rotate(self)
_rotate_time(self)

rotate with time add a suffix

flush(self, rotation_suffix: Optional[str] = None)

Synchronously perform all remaining write tasks

full_profile_check(self) bool

returns a bool to determine if unsegmented dataset should be profiled.

close(self) Optional[whylogs.core.DatasetProfile]

Flush and close out the logger, outputs the last profile

Returns

the result dataset profile. None if the logger is closed

log(self, features: Optional[Dict[str, any]] = None, feature_name: Optional[str] = None, value: any = None, character_list: Optional[str] = None, token_method: Optional[Callable] = None)

Logs a collection of features or a single feature (must specify one or the other).

Parameters
  • features – a map of key value feature for model input

  • feature_name – name of a single feature. Cannot be specified if ‘features’ is specified

  • value – value of as single feature. Cannot be specified if ‘features’ is specified

log_segment_datum(self, feature_name, value, character_list: str = None, token_method: Optional[Callable] = None)
log_metrics(self, targets, predictions, scores=None, model_type: whylogs.proto.ModelType = None, target_field=None, prediction_field=None, score_field=None)
log_image(self, image, feature_transforms: Optional[List[Callable]] = None, metadata_attributes: Optional[List[str]] = METADATA_DEFAULT_ATTRIBUTES, feature_name: str = '')

API to track an image, either in PIL format or as an input path

Parameters
  • feature_name – name of the feature

  • metadata_attributes – metadata attributes to extract for the images

  • feature_transforms – a list of callables to transform the input into metrics

log_local_dataset(self, root_dir, folder_feature_name='folder_feature', image_feature_transforms=None, show_progress=False)

Log a local folder dataset It will log data from the files, along with structure file data like metadata, and magic numbers. If the folder has single layer for children folders, this will pick up folder names as a segmented feature

Parameters
  • show_progress – showing the progress bar

  • image_feature_transforms – image transform that you would like to use with the image log

  • root_dir (str) – directory where dataset is located.

  • folder_feature_name (str, optional) – Name for the subfolder features, i.e. class, store etc.

log_annotation(self, annotation_data)

Log structured annotation data ie. JSON like structures

Parameters

annotation_data (Dict or List) – Description

log_csv(self, filepath_or_buffer: Union[str, pathlib.Path, IO[AnyStr]], segments: Optional[Union[List[Segment], List[str]]] = None, profile_full_dataset: bool = False, **kwargs)

Log a CSV file. This supports the same parameters as :func`pandas.read_csv<pandas.read_csv>` function.

Parameters
  • filepath_or_buffer – the path to the CSV or a CSV buffer

  • segments – define either a list of segment keys or a list of segments tags: [ {“key”:<featurename>,”value”: <featurevalue>},… ]

  • profile_full_dataset – when segmenting dataset, an option to keep the full unsegmented profile of the dataset

  • **kwargs – from pandas:read_csv

log_dataframe(self, df, segments: Optional[Union[List[Segment], List[str]]] = None, profile_full_dataset: bool = False)

Generate and log a whylogs DatasetProfile from a pandas dataframe :param profile_full_dataset: when segmenting dataset, an option to keep the full unsegmented profile of the

dataset.

Parameters
  • segments – specify the tag key value pairs for segments

  • df – the Pandas dataframe to log

log_segments(self, data)
log_segments_keys(self, data)
log_fixed_segments(self, data)
log_df_segment(self, df, segment: Segment)
is_active(self)

Return the boolean state of the logger

static _prefix_segment_tags(segment_key_values)
class whylogs.app.Session(project: Optional[str] = None, pipeline: Optional[str] = None, writers: Optional[List[whylogs.app.writers.Writer]] = None, metadata_writer: Optional[whylogs.app.metadata_writer.MetadataWriter] = None, verbose: bool = False, with_rotation_time: str = None, cache_size: int = None, report_progress: bool = False)
Parameters
  • project (str) – The project name. We will default to the project name when logging a dataset if the dataset name is not specified

  • pipeline (str) – Name of the pipeline associated with this session

  • writers (list) – configuration for the output writers. This is where the log data will go

  • verbose (bool) – enable verbose logging for not. Default is False

__enter__(self)
__exit__(self, tpe, value, traceback)
__repr__(self)

Return repr(self).

get_config(self)
is_active(self)
logger(self, dataset_name: Optional[str] = None, dataset_timestamp: Optional[datetime.datetime] = None, session_timestamp: Optional[datetime.datetime] = None, tags: Dict[str, str] = None, metadata: Dict[str, str] = None, segments: Optional[Union[List[Dict], List[str], str]] = None, profile_full_dataset: bool = False, with_rotation_time: str = None, cache_size: int = 1, constraints: whylogs.core.statistics.constraints.DatasetConstraints = None) whylogs.app.logger.Logger

Create a new logger or return an existing one for a given dataset name. If no dataset_name is specified, we default to project name

Parameters
  • dataset_name – name of the dataset

  • dataset_timestamp – timestamp of the dataset. Default to now

  • session_timestamp – timestamp of the session. Inherits from the session

  • tags – metadata associated with the profile

  • metadata – same as tags. Will be deprecated

  • segments – slice of data that the profile belongs to

  • profile_full_dataset – when segmenting dataset, an option to keep the full unsegmented profile of the dataset

  • with_rotation_time – rotation time in minutes our hours (“1m”, “1h”)

  • cache_size – size of the segment cache

  • constraints – whylogs contrainst to monitor against

get_logger(self, dataset_name: str = None)
log_dataframe(self, df: pandas.DataFrame, dataset_name: Optional[str] = None, dataset_timestamp: Optional[datetime.datetime] = None, session_timestamp: Optional[datetime.datetime] = None, tags: Dict[str, str] = None, metadata: Dict[str, str] = None, segments: Optional[Union[List[Dict], List[str], str]] = None, profile_full_dataset: bool = False, constraints: whylogs.core.statistics.constraints.DatasetConstraints = None) Optional[whylogs.core.DatasetProfile]

Perform statistics caluclations and log a pandas dataframe

Parameters
  • df – the dataframe to profile

  • dataset_name – name of the dataset

  • dataset_timestamp – the timestamp for the dataset

  • session_timestamp – the timestamp for the session. Override the default one

  • tags – the tags for the profile. Useful when merging

  • metadata – information about this current profile. Can be discarded when merging

  • segments – Can be either: - Autosegmentation source, one of [“auto”, “local”] - List of tag key value pairs for tracking data segments - List of tag keys for which we will track every value - None, no segments will be used

  • profile_full_dataset – when segmenting dataset, an option to keep the full unsegmented profile of the dataset

Returns

a dataset profile if the session is active

profile_dataframe(self, df: pandas.DataFrame, dataset_name: Optional[str] = None, dataset_timestamp: Optional[datetime.datetime] = None, session_timestamp: Optional[datetime.datetime] = None, tags: Dict[str, str] = None, metadata: Dict[str, str] = None) Optional[whylogs.core.DatasetProfile]

Profile a Pandas dataframe without actually writing data to disk. This is useful when you just want to quickly capture and explore a dataset profile.

Parameters
  • df – the dataframe to profile

  • dataset_name – name of the dataset

  • dataset_timestamp – the timestamp for the dataset

  • session_timestamp – the timestamp for the session. Override the default one

  • tags – the tags for the profile. Useful when merging

  • metadata – information about this current profile. Can be discarded when merging

Returns

a dataset profile if the session is active

new_profile(self, dataset_name: Optional[str] = None, dataset_timestamp: Optional[datetime.datetime] = None, session_timestamp: Optional[datetime.datetime] = None, tags: Dict[str, str] = None, metadata: Dict[str, str] = None) Optional[whylogs.core.DatasetProfile]

Create an empty dataset profile with the metadata from the session.

Parameters
  • dataset_name – name of the dataset

  • dataset_timestamp – the timestamp for the dataset

  • session_timestamp – the timestamp for the session. Override the default one

  • tags – the tags for the profile. Useful when merging

  • metadata – information about this current profile. Can be discarded when merging

Returns

a dataset profile if the session is active

estimate_segments(self, df: pandas.DataFrame, name: str, target_field: str = None, max_segments: int = 30, dry_run: bool = False) Optional[Union[List[Dict], List[str]]]

Estimates the most important features and values on which to segment data profiling using entropy-based methods.

Parameters
  • df – the dataframe of data to profile

  • name – name for discovery in the logger, automatically applied

to loggers with same dataset_name :param target_field: target field (optional) :param max_segments: upper threshold for total combinations of segments, default 30 :param dry_run: run calculation but do not write results to metadata :return: a list of segmentation feature names

close(self)

Deactivate this session and flush all associated loggers

remove_logger(self, dataset_name: str)

Remove a logger from the dataset. This is called by the logger when it’s being closed

Parameters
  • logger (dataset_name the name of the dataset. used to identify the) –

  • None (Returns) –

  • -------

class whylogs.app.SessionConfig(project: str, pipeline: str, writers: List[WriterConfig], metadata: Optional[MetadataConfig] = None, verbose: bool = False, with_rotation_time: str = None, cache_size: int = 1, report_progress: bool = False)

Config for a whylogs session.

See also SessionConfigSchema

Parameters
  • project (str) – Project associated with this whylogs session

  • pipeline (str) – Name of the associated data pipeline

  • writers (list) – A list of WriterConfig objects defining writer outputs

  • metadata (MetadataConfig) – A MetadataConfiguration object. If none, will replace with default.

  • verbose (bool, default=False) – Output verbosity

  • with_rotation_time (str, default = None, to rotate profiles with time, takes values of overall rotation interval,) – “s” for seconds “m” for minutes “h” for hours “d” for days

  • cache_size (int default =1, sets how many dataprofiles to cache in logger during rotation) –

to_yaml(self, stream=None)

Serialize this config to YAML

Parameters

stream – If None (default) return a string, else dump the yaml into this stream.

static from_yaml(stream)

Load config from yaml

Parameters

stream (str, file-obj) – String or file-like object to load yaml from

Returns

config – Generated config

Return type

SessionConfig

class whylogs.app.WriterConfig(type: str, formats: Optional[List[str]] = None, output_path: Optional[str] = None, path_template: Optional[str] = None, filename_template: Optional[str] = None, data_collection_consent: Optional[bool] = None, transport_parameters: Optional[TransportParameterConfig] = None)

Config for whylogs writers

See also:

Parameters
  • type (str) – Destination for the writer output, e.g. ‘local’ or ‘s3’

  • formats (list) – All output formats. See ALL_SUPPORTED_FORMATS

  • output_path (str) – Prefix of where to output files. A directory for type = ‘local’, or key prefix for type = ‘s3’

  • path_template (str, optional) – Templatized path output using standard python string templates. Variables are accessed via $identifier or ${identifier}. See whylogs.app.writers.Writer.template_params() for a list of available identifers. Default = whylogs.app.writers.DEFAULT_PATH_TEMPLATE

  • filename_template (str, optional) – Templatized output filename using standardized python string templates. Variables are accessed via $identifier or ${identifier}. See whylogs.app.writers.Writer.template_params() for a list of available identifers. Default = whylogs.app.writers.DEFAULT_FILENAME_TEMPLATE

to_yaml(self, stream=None)

Serialize this config to YAML

Parameters

stream – If None (default) return a string, else dump the yaml into this stream.

static from_yaml(stream, **kwargs)

Load config from yaml

Parameters
  • stream (str, file-obj) – String or file-like object to load yaml from

  • kwargs – ignored

Returns

config – Generated config

Return type

WriterConfig

whylogs.app.__ALL__
whylogs.cli
Submodules
whylogs.cli.cli
Module Contents
Functions

_set_up_logger()

cli(verbose)

Welcome to whylogs CLI!

main()

whylogs.cli.cli._set_up_logger()
whylogs.cli.cli.cli(verbose)

Welcome to whylogs CLI!

Supported basic commands:

  • whylogs init : create a new whylogs project configuration

whylogs.cli.cli.main()
whylogs.cli.cli_text
Module Contents
whylogs.cli.cli_text.INTRO_MESSAGE = Multiline-String
Show Value
 1██╗    ██╗██╗  ██╗██╗   ██╗██╗      ██████╗  ██████╗ ███████╗
 2██║    ██║██║  ██║╚██╗ ██╔╝██║     ██╔═══██╗██╔════╝ ██╔════╝
 3██║ █╗ ██║███████║ ╚████╔╝ ██║     ██║   ██║██║  ███╗███████╗
 4██║███╗██║██╔══██║  ╚██╔╝  ██║     ██║   ██║██║   ██║╚════██║
 5╚███╔███╔╝██║  ██║   ██║   ███████╗╚██████╔╝╚██████╔╝███████║
 6 ╚══╝╚══╝ ╚═╝  ╚═╝   ╚═╝   ╚══════╝ ╚═════╝  ╚═════╝ ╚══════╝
 7                       / \__
 8                      (    @\___
 9                      /         O
10                     /   (_____/
11                    /_____/   U
12
13Welcome to whylogs!
14
15Join us our community slack at  http://join.slack.whylabs.ai/
16
17This CLI will guide you through initializing a basic whylogs configurations.
whylogs.cli.cli_text.DOING_NOTHING_ABORTING = Doing nothing. Aborting
whylogs.cli.cli_text.OVERRIDE_CONFIRM = Would you like to proceed with the above path?
whylogs.cli.cli_text.EMPTY_PATH_WARNING = WARNING: we will override the content in the non-empty path
whylogs.cli.cli_text.BEGIN_WORKFLOW = Multiline-String
Show Value
1Great. We will now generate the default configuration for whylogs'
2We'll need a few details from you before we can proceed
whylogs.cli.cli_text.PIPELINE_DESCRIPTION = "Pipeline" is a series of one or multiple datasets to build a single model or application. A...
whylogs.cli.cli_text.PROJECT_NAME_PROMPT = Project name (alphanumeric, dash, and underscore characters only)
whylogs.cli.cli_text.PROJECT_DESCRIPTION = "Project" is a collection of related datasets that are used for multiple models or applications.
whylogs.cli.cli_text.DATETIME_EXPLANATION = Multiline-String
Show Value
1whylogs can break down the data by time for you
2This will enable users to run time-based analysis
whylogs.cli.cli_text.DATETIME_COLUMN_PROMPT = What is the name of the datetime feature (leave blank to skip)?
whylogs.cli.cli_text.SKIP_DATETIME = Skip grouping by datetime
whylogs.cli.cli_text.DATETIME_FORMAT_PROMPT = What is the format of the column? Leave blank to use datetimeutil to parse
whylogs.cli.cli_text.INITIAL_PROFILING_CONFIRM = Would you like to run an initial profiling job?
whylogs.cli.cli_text.DATA_SOURCE_MESSAGE = Select data source:
whylogs.cli.cli_text.PROFILE_OVERRIDE_CONFIRM = Profile path already exists. This will override existing data
whylogs.cli.cli_text.DATA_WILL_BE_OVERRIDDEN = Previous profile data will be overridden
whylogs.cli.cli_text.OBSERVATORY_EXPLANATION = Multiline-String
Show Value
1WhyLabs Platform can visualize your statistics. This will require the CLI to upload
2your statistics to WhyLabs endpoint. Your original data (CSV file) will remain locally.
whylogs.cli.cli_text.RUN_PROFILING = Run whylogs profiling...
whylogs.cli.cli_text.GENERATE_NOTEBOOKS = Generate Jupyter notebooks
whylogs.cli.cli_text.DONE = Done
whylogs.cli.demo_cli
Module Contents
Classes

NameParamType

Functions

_set_up_logger()

init(project_dir)

Initialize and configure a new whylogs project.

profile_csv(session_config: whylogs.app.SessionConfig, project_dir: str) → str

cli(verbose)

Welcome to whylogs Demo CLI!

main()

Attributes

_LENDING_CLUB_CSV

_EXAMPLE_REPO

NAME_FORMAT

whylogs.cli.demo_cli._LENDING_CLUB_CSV = lending_club_1000.csv
whylogs.cli.demo_cli._EXAMPLE_REPO = https://github.com/whylabs/whylogs-examples.git
whylogs.cli.demo_cli._set_up_logger()
whylogs.cli.demo_cli.NAME_FORMAT
class whylogs.cli.demo_cli.NameParamType

Bases: click.ParamType

convert(self, value, param, ctx)
whylogs.cli.demo_cli.init(project_dir)

Initialize and configure a new whylogs project.

This guided input walks the user through setting up a new project and also on-boards a new developer in an existing project.

It scaffolds directories, sets up notebooks, creates a project file, and appends to a .gitignore file.

whylogs.cli.demo_cli.profile_csv(session_config: whylogs.app.SessionConfig, project_dir: str) str
whylogs.cli.demo_cli.cli(verbose)

Welcome to whylogs Demo CLI!

Supported commands:

  • whylogs-demo init : create a demo whylogs project with example data and notebooks

whylogs.cli.demo_cli.main()
whylogs.cli.init
Module Contents
Classes

NameParamType

Functions

init(project_dir)

Initialize and configure a new whylogs project.

Attributes

LENDING_CLUB_CSV

NAME_FORMAT

whylogs.cli.init.LENDING_CLUB_CSV = lending_club_1000.csv
whylogs.cli.init.NAME_FORMAT
class whylogs.cli.init.NameParamType

Bases: click.ParamType

convert(self, value, param, ctx)
whylogs.cli.init.init(project_dir)

Initialize and configure a new whylogs project.

This guided input walks the user through setting up a new project and also onboards a new developer in an existing project.

It scaffolds directories, sets up notebooks, creates a project file, and appends to a .gitignore file.

whylogs.cli.utils
Module Contents
Functions

echo(message: Union[str, list], **styles)

whylogs.cli.utils.echo(message: Union[str, list], **styles)
Package Contents
Functions

cli(verbose)

Welcome to whylogs CLI!

main()

demo_main()

Attributes

__ALL__

whylogs.cli.cli(verbose)

Welcome to whylogs CLI!

Supported basic commands:

  • whylogs init : create a new whylogs project configuration

whylogs.cli.main()
whylogs.cli.demo_main()
whylogs.cli.__ALL__
whylogs.core
Subpackages
whylogs.core.metrics
Submodules
whylogs.core.metrics.confusion_matrix
Module Contents
Classes

ConfusionMatrix

Confusion Matrix Class to hold labels and matrix data.

Functions

_merge_CM(old_conf_matrix: ConfusionMatrix, new_conf_matrix: ConfusionMatrix)

Merges two confusion_matrix since distinc or overlaping labels

Attributes

SUPPORTED_TYPES

MODEL_METRICS_MAX_LABELS

MODEL_METRICS_LABEL_SIZE_WARNING_THRESHOLD

_logger

whylogs.core.metrics.confusion_matrix.SUPPORTED_TYPES = ['binary', 'multiclass']
whylogs.core.metrics.confusion_matrix.MODEL_METRICS_MAX_LABELS = 256
whylogs.core.metrics.confusion_matrix.MODEL_METRICS_LABEL_SIZE_WARNING_THRESHOLD = 64
whylogs.core.metrics.confusion_matrix._logger
class whylogs.core.metrics.confusion_matrix.ConfusionMatrix(labels: List[str] = None, prediction_field: str = None, target_field: str = None, score_field: str = None)

Confusion Matrix Class to hold labels and matrix data.

labels

list of labels in a sorted order

prediction_field

name of the prediction field

target_field

name of the target field

score_field

name of the score field

confusion_matrix

Confusion Matrix kept as matrix of NumberTrackers

Type

nd.array

labels

list of labels for the confusion_matrix axes

Type

List[str]

add(self, predictions: List[Union[str, int, bool]], targets: List[Union[str, int, bool]], scores: List[float])

Function adds predictions and targets to confusion matrix with scores.

Parameters
  • predictions (List[Union[str, int, bool]]) –

  • targets (List[Union[str, int, bool]]) –

  • scores (List[float]) –

Raises
  • NotImplementedError – in case targets do not fall into binary or

  • multiclass suport

  • ValueError – incase missing validation or predictions

merge(self, other_cm)

Merge two seperate confusion matrix which may or may not overlap in labels.

Parameters

other_cm (Optional[ConfusionMatrix]) – confusion_matrix to merge with self

Returns

merged confusion_matrix

Return type

ConfusionMatrix

to_protobuf(self)

Convert to protobuf

Returns

Description

Return type

TYPE

classmethod from_protobuf(cls, message: whylogs.proto.ScoreMatrixMessage)
whylogs.core.metrics.confusion_matrix._merge_CM(old_conf_matrix: ConfusionMatrix, new_conf_matrix: ConfusionMatrix)

Merges two confusion_matrix since distinc or overlaping labels

Parameters
whylogs.core.metrics.model_metrics
Module Contents
Classes

ModelMetrics

Container class for various model-related metrics

class whylogs.core.metrics.model_metrics.ModelMetrics(confusion_matrix: whylogs.core.metrics.confusion_matrix.ConfusionMatrix = None, regression_metrics: whylogs.core.metrics.regression_metrics.RegressionMetrics = None, nlp_metrics: whylogs.core.metrics.nlp_metrics.NLPMetrics = None, model_type: whylogs.proto.ModelType = ModelType.UNKNOWN)

Container class for various model-related metrics

confusion_matrix

ConfusionMatrix which keeps it track of counts with NumberTracker

Type

ConfusionMatrix

regression_metrics

Regression Metrics keeps track of a common regression metrics in case the targets are continous.

Type

RegressionMetrics

to_protobuf(self) whylogs.proto.ModelMetricsMessage
classmethod from_protobuf(cls, message)
init_or_get_model_type(self, scores) whylogs.proto.ModelType
compute_confusion_matrix(self, predictions: List[Union[str, int, bool, float]], targets: List[Union[str, int, bool, float]], scores: List[float] = None, target_field: str = None, prediction_field: str = None, score_field: str = None)

computes the confusion_matrix, if one is already present merges to old one.

Parameters
  • predictions (List[Union[str, int, bool]]) –

  • targets (List[Union[str, int, bool]]) –

  • scores (List[float], optional) –

  • target_field (str, optional) –

  • prediction_field (str, optional) –

  • score_field (str, optional) –

compute_regression_metrics(self, predictions: List[Union[float, int]], targets: List[Union[float, int]], target_field: str = None, prediction_field: str = None)
merge(self, other)
whylogs.core.metrics.nlp_metrics
Module Contents
Classes

NLPMetrics

Attributes

logger

whylogs.core.metrics.nlp_metrics.logger
class whylogs.core.metrics.nlp_metrics.NLPMetrics(prediction_field: str = None, target_field: str = None)
update(self, predictions: Union[List[str], str], targets: Union[List[str]], transform=None) None

Function adds predictions and targets computation of nlp metrics.

Parameters
  • predictions (Union[str,List[str]]) –

  • targets (Union[List[str],str]) –

merge(self, other: NLPMetrics) NLPMetrics

Merge two seperate nlp metrics

Parameters

other – nlp metrics to merge with self

Returns

merged nlp metrics

Return type

NLPMetrics

to_protobuf(self) whylogs.proto.NLPMetricsMessage

Convert to protobuf

Returns

Protobuf Message

Return type

TYPE

classmethod from_protobuf(cls: NLPMetrics, message: whylogs.proto.NLPMetricsMessage)
whylogs.core.metrics.regression_metrics
Module Contents
Classes

RegressionMetrics

Attributes

SUPPORTED_TYPES

whylogs.core.metrics.regression_metrics.SUPPORTED_TYPES = regression
class whylogs.core.metrics.regression_metrics.RegressionMetrics(prediction_field: str = None, target_field: str = None)
add(self, predictions: List[float], targets: List[float])

Function adds predictions and targets computation of regression metrics.

Parameters
  • predictions (List[float]) –

  • targets (List[float]) –

mean_absolute_error(self)
mean_squared_error(self)
root_mean_squared_error(self)
merge(self, other)

Merge two seperate confusion matrix which may or may not overlap in labels.

Parameters

other – regression metrics to merge with self

Returns

merged regression metrics

Return type

RegressionMetrics

to_protobuf(self)

Convert to protobuf

Returns

Protobuf Message

Return type

TYPE

classmethod from_protobuf(cls, message: whylogs.proto.RegressionMetricsMessage)
whylogs.core.statistics

Define classes for tracking statistics

Subpackages
whylogs.core.statistics.datatypes

Define classes for tracking statistics for various data types

Submodules
whylogs.core.statistics.datatypes.floattracker
Module Contents
Classes

FloatTracker

Track statistics for floating point numbers

class whylogs.core.statistics.datatypes.floattracker.FloatTracker(min: float = None, max: float = None, sum: float = None, count: int = None)

Track statistics for floating point numbers

Parameters
  • min (float) – Current min value

  • max (float) – Current max value

  • sum (float) – Sum of the numbers

  • count (int) – Total count of numbers

update(self, value: float)

Add a number to the tracking statistics

add_integers(self, tracker)

Copy data from a IntTracker into this object, overwriting the current values.

Parameters

tracker (IntTracker) –

mean(self)

Calculate the current mean

merge(self, other)

Merge this tracker with another.

Parameters

other (FloatTracker) – The other float tracker

Returns

merged – A new float tracker

Return type

FloatTracker

to_protobuf(self)

Return the object serialized as a protobuf message

Returns

message

Return type

DoublesMessage

static from_protobuf(message)

Load from a protobuf message

Returns

number_tracker

Return type

FloatTracker

whylogs.core.statistics.datatypes.integertracker
Module Contents
Classes

IntTracker

Track statistics for integers

class whylogs.core.statistics.datatypes.integertracker.IntTracker(min: int = None, max: int = None, sum: int = None, count: int = None)

Track statistics for integers

Parameters
  • min – Current min value

  • max – Current max value

  • sum – Sum of the numbers

  • count – Total count of numbers

DEFAULTS
set_defaults(self)

Set attribute values to defaults

mean(self)

Calculate the current mean. Returns None if self.count = 0

update(self, value)

Add a number to the tracking statistics

merge(self, other)

Merge values of another IntTracker with this one.

Parameters

other (IntTracker) – Other tracker

Returns

new – New, merged tracker

Return type

IntTracker

to_protobuf(self)

Return the object serialized as a protobuf message

Returns

message

Return type

LongsMessage

static from_protobuf(message)

Load from a protobuf message

Returns

number_tracker

Return type

IntTracker

whylogs.core.statistics.datatypes.variancetracker
Module Contents
Classes

VarianceTracker

Class that implements variance estimates for streaming data and for

class whylogs.core.statistics.datatypes.variancetracker.VarianceTracker(count=0, sum=0.0, mean=0.0)

Class that implements variance estimates for streaming data and for batched data.

Parameters
  • count – Number tracked elements

  • sum – Sum of all numbers

  • mean – Current estimate of the mean

update(self, new_value)

Add a number to tracking estimates

Based on https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford’s_online_algorithm

Parameters

new_value (int, float) –

stddev(self)

Return an estimate of the sample standard deviation

variance(self)

Return an estimate of the sample variance

merge(self, other: VarianceTracker)

Merge statistics from another VarianceTracker into this one

See: https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm

Parameters

other (VarianceTracker) – Other variance tracker

Returns

merged – A new variance tracker from the merged statistics

Return type

VarianceTracker

copy(self)

Return a copy of this tracker

to_protobuf(self)

Return the object serialized as a protobuf message

Returns

message

Return type

VarianceMessage

static from_protobuf(message)

Load from a protobuf message

Returns

variance_tracker

Return type

VarianceTracker

Package Contents
Classes

FloatTracker

Track statistics for floating point numbers

IntTracker

Track statistics for integers

VarianceTracker

Class that implements variance estimates for streaming data and for

Attributes

__ALL__

class whylogs.core.statistics.datatypes.FloatTracker(min: float = None, max: float = None, sum: float = None, count: int = None)

Track statistics for floating point numbers

Parameters
  • min (float) – Current min value

  • max (float) – Current max value

  • sum (float) – Sum of the numbers

  • count (int) – Total count of numbers

update(self, value: float)

Add a number to the tracking statistics

add_integers(self, tracker)

Copy data from a IntTracker into this object, overwriting the current values.

Parameters

tracker (IntTracker) –

mean(self)

Calculate the current mean

merge(self, other)

Merge this tracker with another.

Parameters

other (FloatTracker) – The other float tracker

Returns

merged – A new float tracker

Return type

FloatTracker

to_protobuf(self)

Return the object serialized as a protobuf message

Returns

message

Return type

DoublesMessage

static from_protobuf(message)

Load from a protobuf message

Returns

number_tracker

Return type

FloatTracker

class whylogs.core.statistics.datatypes.IntTracker(min: int = None, max: int = None, sum: int = None, count: int = None)

Track statistics for integers

Parameters
  • min – Current min value

  • max – Current max value

  • sum – Sum of the numbers

  • count – Total count of numbers

DEFAULTS
set_defaults(self)

Set attribute values to defaults

mean(self)

Calculate the current mean. Returns None if self.count = 0

update(self, value)

Add a number to the tracking statistics

merge(self, other)

Merge values of another IntTracker with this one.

Parameters

other (IntTracker) – Other tracker

Returns

new – New, merged tracker

Return type

IntTracker

to_protobuf(self)

Return the object serialized as a protobuf message

Returns

message

Return type

LongsMessage

static from_protobuf(message)

Load from a protobuf message

Returns

number_tracker

Return type

IntTracker

class whylogs.core.statistics.datatypes.VarianceTracker(count=0, sum=0.0, mean=0.0)

Class that implements variance estimates for streaming data and for batched data.

Parameters
  • count – Number tracked elements

  • sum – Sum of all numbers

  • mean – Current estimate of the mean

update(self, new_value)

Add a number to tracking estimates

Based on https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford’s_online_algorithm

Parameters

new_value (int, float) –

stddev(self)

Return an estimate of the sample standard deviation

variance(self)

Return an estimate of the sample variance

merge(self, other: VarianceTracker)

Merge statistics from another VarianceTracker into this one

See: https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm

Parameters

other (VarianceTracker) – Other variance tracker

Returns

merged – A new variance tracker from the merged statistics

Return type

VarianceTracker

copy(self)

Return a copy of this tracker

to_protobuf(self)

Return the object serialized as a protobuf message

Returns

message

Return type

VarianceMessage

static from_protobuf(message)

Load from a protobuf message

Returns

variance_tracker

Return type

VarianceTracker

whylogs.core.statistics.datatypes.__ALL__
Submodules
whylogs.core.statistics.constraints
Module Contents
Classes

ValueConstraint

ValueConstraints express a binary boolean relationship between an implied numeric value and a literal.

SummaryConstraint

Summary constraints specify a relationship between a summary field and a static value,

ValueConstraints

SummaryConstraints

MultiColumnValueConstraint

ValueConstraints express a binary boolean relationship between an implied numeric value and a literal.

MultiColumnValueConstraints

DatasetConstraints

Functions

_try_parse_strftime_format(strftime_val: str, format: str) → Optional[datetime.datetime]

Return whether the string is in a strftime format.

_try_parse_dateutil(dateutil_val: str, ref_val=None) → Optional[datetime.datetime]

Return whether the string can be interpreted as a date.

_try_parse_json(json_string: str, ref_val=None) → Optional[dict]

Return whether the string can be interpreted as json.

_matches_json_schema(json_data: Union[str, dict], json_schema: Union[str, dict]) → bool

Return whether the provided json matches the provided schema.

_check_between_constraint_valid_initialization(lower_value, upper_value, lower_field, upper_field)

_set_between_constraint_default_name(field, lower_value, upper_value, lower_field, upper_field)

_format_set_values_for_display(reference_set)

stddevBetweenConstraint(lower_value=None, upper_value=None, lower_field=None, upper_field=None, name=None, verbose=False)

Defines a summary constraint on the standard deviation of a feature. The standard deviation can be defined to be

meanBetweenConstraint(lower_value=None, upper_value=None, lower_field=None, upper_field=None, name=None, verbose=False)

Defines a summary constraint on the mean (average) of a feature. The mean can be defined to be

minBetweenConstraint(lower_value=None, upper_value=None, lower_field=None, upper_field=None, name=None, verbose=False)

Defines a summary constraint on the minimum value of a feature. The minimum can be defined to be

minGreaterThanEqualConstraint(value=None, field=None, name=None, verbose=False)

Defines a summary constraint on the minimum value of a feature. The minimum can be defined to be

maxBetweenConstraint(lower_value=None, upper_value=None, lower_field=None, upper_field=None, name=None, verbose=False)

Defines a summary constraint on the maximum value of a feature. The maximum can be defined to be

maxLessThanEqualConstraint(value=None, field=None, name=None, verbose=False)

Defines a summary constraint on the maximum value of a feature. The maximum can be defined to be

distinctValuesInSetConstraint(reference_set: Set[Any], name=None, verbose=False)

Defines a summary constraint on the distinct values of a feature. All of the distinct values should

distinctValuesEqualSetConstraint(reference_set: Set[Any], name=None, verbose=False)

Defines a summary constraint on the distinct values of a feature. The set of the distinct values should

distinctValuesContainSetConstraint(reference_set: Set[Any], name=None, verbose=False)

Defines a summary constraint on the distinct values of a feature. The set of user-supplied reference values,

columnValuesInSetConstraint(value_set: Set[Any], name=None, verbose=False)

Defines a value constraint with set operations on the values of a single feature.

containsEmailConstraint(regex_pattern: str = None, name=None, verbose=False)

Defines a value constraint with email regex matching operations on the values of a single feature.

containsCreditCardConstraint(regex_pattern: str = None, name=None, verbose=False)

Defines a value constraint with credit card number regex matching operations on the values of a single feature.

dateUtilParseableConstraint(name=None, verbose=False)

Defines a value constraint which checks if the values of a single feature

jsonParseableConstraint(name=None, verbose=False)

Defines a value constraint which checks if the values of a single feature

matchesJsonSchemaConstraint(json_schema, name=None, verbose=False)

Defines a value constraint which checks if the values of a single feature

strftimeFormatConstraint(format, name=None, verbose=False)

Defines a value constraint which checks if the values of a single feature

containsSSNConstraint(regex_pattern: str = None, name=None, verbose=False)

Defines a value constraint with social security number (SSN) matching operations

containsURLConstraint(regex_pattern: str = None, name=None, verbose=False)

Defines a value constraint with URL regex matching operations on the values of a single feature.

stringLengthEqualConstraint(length: int, name=None, verbose=False)

Defines a value constraint which checks if the string values of a single feature

stringLengthBetweenConstraint(lower_value: int, upper_value: int, name=None, verbose=False)

Defines a value constraint which checks if the string values' length of a single feature

quantileBetweenConstraint(quantile_value: Union[int, float], lower_value: Union[int, float], upper_value: Union[int, float], name=None, verbose: bool = False)

Defines a summary constraint on the n-th quantile value of a numeric feature.

columnUniqueValueCountBetweenConstraint(lower_value: int, upper_value: int, name=None, verbose: bool = False)

Defines a summary constraint on the cardinality of a specific feature.

columnUniqueValueProportionBetweenConstraint(lower_fraction: float, upper_fraction: float, name=None, verbose: bool = False)

Defines a summary constraint on the proportion of unique values of a specific feature.

columnExistsConstraint(column: str, name=None, verbose=False)

Defines a constraint on the data set schema.

numberOfRowsConstraint(n_rows: int, name=None, verbose=False)

Defines a constraint on the data set schema.

columnsMatchSetConstraint(reference_set: Set[str], name=None, verbose=False)

Defines a constraint on the data set schema.

columnMostCommonValueInSetConstraint(value_set: Set[Any], name=None, verbose=False)

Defines a summary constraint on the most common value of a feature.

columnValuesNotNullConstraint(name=None, verbose=False)

Defines a non-null summary constraint on the value of a feature.

missingValuesProportionBetweenConstraint(lower_fraction: float, upper_fraction: float, name: str = None, verbose: bool = False)

Defines a summary constraint on the proportion of missing values of a specific feature.

columnValuesTypeEqualsConstraint(expected_type: Union[whylogs.proto.InferredType, int], name=None, verbose: bool = False)

Defines a summary constraint on the type of the feature values.

columnValuesTypeInSetConstraint(type_set: Set[int], name=None, verbose: bool = False)

Defines a summary constraint on the type of the feature values.

approximateEntropyBetweenConstraint(lower_value: Union[int, float], upper_value: float, name=None, verbose=False)

Defines a summary constraint specifying the expected interval of the features estimated entropy.

parametrizedKSTestPValueGreaterThanConstraint(reference_distribution: Union[List[float], numpy.ndarray], p_value=0.05, name=None, verbose=False)

Defines a summary constraint specifying the expected

columnKLDivergenceLessThanConstraint(reference_distribution: Union[List[Any], numpy.ndarray], threshold: float = 0.5, name=None, verbose: bool = False)

Defines a summary constraint specifying the expected

columnChiSquaredTestPValueGreaterThanConstraint(reference_distribution: Union[List[Any], numpy.ndarray, Mapping[str, int]], p_value: float = 0.05, name=None, verbose: bool = False)

Defines a summary constraint specifying the expected

columnValuesAGreaterThanBConstraint(column_A: str, column_B: str, name=None, verbose: bool = False)

Defines a multi-column value constraint which specifies that each value in column A,

columnValuesAGreaterThanEqualBConstraint(column_A: str, column_B: str, name=None, verbose: bool = False)

Defines a multi-column value constraint which specifies that each value in column A,

columnValuesALessThanBConstraint(column_A: str, column_B: str, name=None, verbose: bool = False)

Defines a multi-column value constraint which specifies that each value in column A,

columnValuesALessThanEqualBConstraint(column_A: str, column_B: str, name=None, verbose: bool = False)

Defines a multi-column value constraint which specifies that each value in column A,

columnValuesAEqualBConstraint(column_A: str, column_B: str, name=None, verbose: bool = False)

Defines a multi-column value constraint which specifies that each value in column A,

columnValuesANotEqualBConstraint(column_A: str, column_B: str, name=None, verbose: bool = False)

Defines a multi-column value constraint which specifies that each value in column A,

sumOfRowValuesOfMultipleColumnsEqualsConstraint(columns: Union[List[str], Set[str], numpy.array], value: Union[float, int, str], name=None, verbose: bool = False)

Defines a multi-column value constraint which specifies that the sum of the values in each row

columnPairValuesInSetConstraint(column_A: str, column_B: str, value_set: Set[Tuple[Any, Any]], name=None, verbose: bool = False)

Defines a multi-column value constraint which specifies that the pair of values of columns A and B,

columnValuesUniqueWithinRow(column_A: str, name=None, verbose: bool = False)

Defines a multi-column value constraint which specifies that the values of column A

Attributes

TYPES

logger

MAX_SET_DISPLAY_MESSAGE_LENGTH

Dict indexed by constraint operator.

_value_funcs

_summary_funcs1

_summary_funcs2

_multi_column_value_funcs

whylogs.core.statistics.constraints.TYPES
whylogs.core.statistics.constraints.logger
whylogs.core.statistics.constraints._try_parse_strftime_format(strftime_val: str, format: str) Optional[datetime.datetime]

Return whether the string is in a strftime format. :param strftime_val: str, string to check for date :param format: format to check if strftime_val can be parsed :return None if not parseable, otherwise the parsed datetime.datetime object

whylogs.core.statistics.constraints._try_parse_dateutil(dateutil_val: str, ref_val=None) Optional[datetime.datetime]

Return whether the string can be interpreted as a date. :param dateutil_val: str, string to check for date :param ref_val: any, not used, interface design requirement :return None if not parseable, otherwise the parsed datetime.datetime object

whylogs.core.statistics.constraints._try_parse_json(json_string: str, ref_val=None) Optional[dict]

Return whether the string can be interpreted as json. :param json_string: str, string to check for json :param ref_val: any, not used, interface design requirement :return None if not parseable, otherwise the parsed json object

whylogs.core.statistics.constraints._matches_json_schema(json_data: Union[str, dict], json_schema: Union[str, dict]) bool

Return whether the provided json matches the provided schema. :param json_data: json object to check :param json_schema: schema to check if the json object matches it :return True if the json data matches the schema, False otherwise

whylogs.core.statistics.constraints.MAX_SET_DISPLAY_MESSAGE_LENGTH = 20

Dict indexed by constraint operator.

These help translate from constraint schema to language-specific functions that are faster to evaluate. This is just a form of currying, and I chose to bind the boolean comparison operator first.

whylogs.core.statistics.constraints._value_funcs
whylogs.core.statistics.constraints._summary_funcs1
whylogs.core.statistics.constraints._summary_funcs2
whylogs.core.statistics.constraints._multi_column_value_funcs
class whylogs.core.statistics.constraints.ValueConstraint(op: whylogs.proto.Op, value=None, regex_pattern: str = None, apply_function=None, name: str = None, verbose=False)

ValueConstraints express a binary boolean relationship between an implied numeric value and a literal. When associated with a ColumnProfile, the relation is evaluated for every incoming value that is processed by whylogs.

Parameters
  • op (whylogs.proto.Op (required)) – Enumeration of binary comparison operator applied between static value and incoming stream. Enum values are mapped to operators like ‘==’, ‘<’, and ‘<=’, etc.

  • value ((one-of)) – When value is provided, regex_pattern must be None. Static value to compare against incoming stream using operator specified in op.

  • regex_pattern ((one-of)) – When regex_pattern is provided, value must be None. Regex pattern to use when MATCH or NOMATCH operations are used.

  • apply_function – To be supplied only when using APPLY_FUNC operation. In case when the apply_function requires argument, to be supplied in the value param.

  • name (str) – Name of the constraint used for reporting

  • verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.

property name(self)
update(self, v) bool
apply_func_validate(self, value) str
merge(self, other) ValueConstraint
static from_protobuf(msg: whylogs.proto.ValueConstraintMsg) ValueConstraint
to_protobuf(self) whylogs.proto.ValueConstraintMsg
report(self)
class whylogs.core.statistics.constraints.SummaryConstraint(first_field: str, op: whylogs.proto.Op, value=None, upper_value=None, quantile_value: Union[int, float] = None, second_field: str = None, third_field: str = None, reference_set: Union[List[Any], Set[Any], datasketches.kll_floats_sketch, whylogs.proto.ReferenceDistributionDiscreteMessage] = None, name: str = None, verbose=False)

Summary constraints specify a relationship between a summary field and a static value, or between two summary fields. e.g. ‘min’ < 6

‘std_dev’ < 2.17 ‘min’ > ‘avg’

Parameters
  • first_field (str) – Name of field in NumberSummary that will be compared against either a second field or a static value.

  • op (whylogs.proto.Op (required)) – Enumeration of binary comparison operator applied between summary values. Enum values are mapped to operators like ‘==’, ‘<’, and ‘<=’, etc.

  • value ((one-of)) – Static value to be compared against summary field specified in first_field. Only one of value or second_field should be supplied.

  • upper_value ((one-of)) – Only to be supplied when using Op.BTWN. Static upper boundary value to be compared against summary field specified in first_field. Only one of upper_value or third_field should be supplied.

  • second_field ((one-of)) – Name of second field in NumberSummary to be compared against summary field specified in first_field. Only one of value or second_field should be supplied.

  • third_field ((one-of)) –

    Only to be supplied when op == Op.BTWN. Name of third field in NumberSummary, used as an upper boundary,

    to be compared against summary field specified in first_field.

    Only one of upper_value or third_field should be supplied.

  • reference_set ((one-of)) – Only to be supplied when using set operations or distributional measures. Used as a reference set to be compared with the column distinct values. Or is instance of datasketches.kll_floats_sketch or ReferenceDistributionDiscreteMessage. Only to be supplied for constraints on distributional measures, such as KS test, KL divergence and Chi-Squared test.

  • name (str) – Name of the constraint used for reporting

  • verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.

property name(self)
_get_field_name(self)
_get_value_or_field(self)
_get_constraint_type(self)
_check_and_init_table_shape_constraint(self, reference_set)
_check_and_init_valid_set_constraint(self, reference_set)
_check_and_init_distributional_measure_constraint(self, reference_set)
_check_and_init_between_constraint(self)
_get_str_from_ref_set(self) str
_try_cast_set(self) Set[Any]
_get_string_and_numbers_sets(self)
_create_theta_sketch(self, ref_set: set = None)
update(self, update_summary: object) bool
merge(self, other) SummaryConstraint
_check_if_summary_constraint_message_is_valid(msg: whylogs.proto.SummaryConstraintMsg)
static from_protobuf(msg: whylogs.proto.SummaryConstraintMsg) SummaryConstraint
to_protobuf(self) whylogs.proto.SummaryConstraintMsg
report(self)
class whylogs.core.statistics.constraints.ValueConstraints(constraints: Mapping[str, ValueConstraint] = None)
static from_protobuf(msg: whylogs.proto.ValueConstraintMsgs) ValueConstraints
__getitem__(self, name: str) Optional[ValueConstraint]
to_protobuf(self) whylogs.proto.ValueConstraintMsgs
update(self, v)
update_typed(self, v)
merge(self, other) ValueConstraints
report(self) List[tuple]
class whylogs.core.statistics.constraints.SummaryConstraints(constraints: Mapping[str, SummaryConstraint] = None)
static from_protobuf(msg: whylogs.proto.SummaryConstraintMsgs) SummaryConstraints
__getitem__(self, name: str) Optional[SummaryConstraint]
to_protobuf(self) whylogs.proto.SummaryConstraintMsgs
update(self, v)
merge(self, other) SummaryConstraints
report(self) List[tuple]
class whylogs.core.statistics.constraints.MultiColumnValueConstraint(dependent_columns: Union[str, List[str], Tuple[str], numpy.ndarray], op: whylogs.proto.Op, reference_columns: Union[str, List[str], Tuple[str], numpy.ndarray] = None, internal_dependent_cols_op: whylogs.proto.Op = None, value=None, name: str = None, verbose: bool = False)

Bases: ValueConstraint

ValueConstraints express a binary boolean relationship between an implied numeric value and a literal. When associated with a ColumnProfile, the relation is evaluated for every incoming value that is processed by whylogs.

Parameters
  • op (whylogs.proto.Op (required)) – Enumeration of binary comparison operator applied between static value and incoming stream. Enum values are mapped to operators like ‘==’, ‘<’, and ‘<=’, etc.

  • value ((one-of)) – When value is provided, regex_pattern must be None. Static value to compare against incoming stream using operator specified in op.

  • regex_pattern ((one-of)) – When regex_pattern is provided, value must be None. Regex pattern to use when MATCH or NOMATCH operations are used.

  • apply_function – To be supplied only when using APPLY_FUNC operation. In case when the apply_function requires argument, to be supplied in the value param.

  • name (str) – Name of the constraint used for reporting

  • verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.

property name(self)
update(self, column_values_dictionary)
merge(self, other) MultiColumnValueConstraint
static from_protobuf(msg: whylogs.proto.MultiColumnValueConstraintMsg) MultiColumnValueConstraint
to_protobuf(self) whylogs.proto.MultiColumnValueConstraintMsg
class whylogs.core.statistics.constraints.MultiColumnValueConstraints(constraints: Mapping[str, MultiColumnValueConstraint] = None)

Bases: ValueConstraints

static from_protobuf(msg: whylogs.proto.ValueConstraintMsgs) MultiColumnValueConstraints
to_protobuf(self) whylogs.proto.ValueConstraintMsgs
class whylogs.core.statistics.constraints.DatasetConstraints(props: whylogs.proto.DatasetProperties, value_constraints: Mapping[str, ValueConstraints] = None, summary_constraints: Mapping[str, SummaryConstraints] = None, table_shape_constraints: Mapping[str, SummaryConstraints] = None, multi_column_value_constraints: Optional[MultiColumnValueConstraints] = None)
__getitem__(self, key)
static from_protobuf(msg: whylogs.proto.DatasetConstraintMsg) DatasetConstraints
static from_json(data: str) DatasetConstraints
to_protobuf(self) whylogs.proto.DatasetConstraintMsg
to_json(self) str
report(self)
whylogs.core.statistics.constraints._check_between_constraint_valid_initialization(lower_value, upper_value, lower_field, upper_field)
whylogs.core.statistics.constraints._set_between_constraint_default_name(field, lower_value, upper_value, lower_field, upper_field)
whylogs.core.statistics.constraints._format_set_values_for_display(reference_set)
whylogs.core.statistics.constraints.stddevBetweenConstraint(lower_value=None, upper_value=None, lower_field=None, upper_field=None, name=None, verbose=False)

Defines a summary constraint on the standard deviation of a feature. The standard deviation can be defined to be between two values, or between the values of two other summary fields of the same feature, such as the minimum and the maximum. The defined interval is a closed interval, which includes both of its limit points.

Parameters
  • lower_value (numeric (one-of)) – Represents the lower value limit of the interval for the standard deviation. If lower_value is supplied, then upper_value must also be supplied, and none of lower_field and upper_field should be provided.

  • upper_value (numeric (one-of)) – Represents the upper value limit of the interval for the standard deviation. If upper_value is supplied, then lower_value must also be supplied, and none of lower_field and upper_field should be provided.

  • lower_field (str (one-of)) – Represents the lower field limit of the interval for the standard deviation. The lower field is a string representing a summary field e.g. min, mean, max, stddev, etc., for which the value will be used as a lower bound. If lower_field is supplied, then upper_field must also be supplied, and none of lower_value and upper_value should be provided.

  • upper_field (str (one-of)) – Represents the upper field limit of the interval for the standard deviation. The upper field is a string representing a summary field e.g. min, mean, max, stddev, etc., for which the value will be used as an upper bound. If upper_field is supplied, then lower_field must also be supplied, and none of lower_value and upper_value should be provided.

  • name (str) – Name of the constraint used for reporting

  • verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.

Return type

SummaryConstraint - a summary constraint defining an interval of values for the standard deviation of a feature

whylogs.core.statistics.constraints.meanBetweenConstraint(lower_value=None, upper_value=None, lower_field=None, upper_field=None, name=None, verbose=False)

Defines a summary constraint on the mean (average) of a feature. The mean can be defined to be between two values, or between the values of two other summary fields of the same feature, such as the minimum and the maximum. The defined interval is a closed interval, which includes both of its limit points.

Parameters
  • lower_value (numeric (one-of)) – Represents the lower value limit of the interval for the mean. If lower_value is supplied, then upper_value must also be supplied, and none of lower_field and upper_field should be provided.

  • upper_value (numeric (one-of)) – Represents the upper value limit of the interval for the mean. If upper_value is supplied, then lower_value must also be supplied, and none of lower_field and upper_field should be provided.

  • lower_field (str (one-of)) – Represents the lower field limit of the interval for the mean. The lower field is a string representing a summary field e.g. min, mean, max, stddev, etc., for which the value will be used as a lower bound. If lower_field is supplied, then upper_field must also be supplied, and none of lower_value and upper_value should be provided.

  • upper_field (str (one-of)) – Represents the upper field limit of the interval for the mean. The upper field is a string representing a summary field e.g. min, mean, max, stddev, etc., for which the value will be used as an upper bound. If upper_field is supplied, then lower_field must also be supplied, and none of lower_value and upper_value should be provided.

  • name (str) – Name of the constraint used for reporting

  • verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.

Return type

SummaryConstraint - a summary constraint defining an interval of values for the mean of a feature

whylogs.core.statistics.constraints.minBetweenConstraint(lower_value=None, upper_value=None, lower_field=None, upper_field=None, name=None, verbose=False)

Defines a summary constraint on the minimum value of a feature. The minimum can be defined to be between two values, or between the values of two other summary fields of the same feature, such as the minimum and the maximum. The defined interval is a closed interval, which includes both of its limit points.

Parameters
  • lower_value (numeric (one-of)) – Represents the lower value limit of the interval for the minimum. If lower_value is supplied, then upper_value must also be supplied, and none of lower_field and upper_field should be provided.

  • upper_value (numeric (one-of)) – Represents the upper value limit of the interval for the minimum. If upper_value is supplied, then lower_value must also be supplied, and none of lower_field and upper_field should be provided.

  • lower_field (str (one-of)) – Represents the lower field limit of the interval for the minimum. The lower field is a string representing a summary field e.g. min, mean, max, stddev, etc., for which the value will be used as a lower bound. If lower_field is supplied, then upper_field must also be supplied, and none of lower_value and upper_value should be provided.

  • upper_field (str (one-of)) – Represents the upper field limit of the interval for the minimum. The upper field is a string representing a summary field e.g. min, mean, max, stddev, etc., for which the value will be used as an upper bound. If upper_field is supplied, then lower_field must also be supplied, and none of lower_value and upper_value should be provided.

  • name (str) – Name of the constraint used for reporting

  • verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.

Return type

SummaryConstraint - a summary constraint defining an interval of values for the minimum value of a feature

whylogs.core.statistics.constraints.minGreaterThanEqualConstraint(value=None, field=None, name=None, verbose=False)

Defines a summary constraint on the minimum value of a feature. The minimum can be defined to be greater than or equal to some value, or greater than or equal to the values of another summary field of the same feature, such as the mean (average).

Parameters
  • value (numeric (one-of)) – Represents the value which should be compared to the minimum value of the specified feature, for checking the greater than or equal to constraint. Only one of value and field should be supplied.

  • field (str (one-of)) – The field is a string representing a summary field e.g. min, mean, max, stddev, etc., for which the value will be used for checking the greater than or equal to constraint. Only one of field and value should be supplied.

  • name (str) – Name of the constraint used for reporting

  • verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.

Returns

  • SummaryConstraint - a summary constraint defining a constraint on the minimum value to be greater than

  • or equal to some value / summary field

whylogs.core.statistics.constraints.maxBetweenConstraint(lower_value=None, upper_value=None, lower_field=None, upper_field=None, name=None, verbose=False)

Defines a summary constraint on the maximum value of a feature. The maximum can be defined to be between two values, or between the values of two other summary fields of the same feature, such as the minimum and the maximum. The defined interval is a closed interval, which includes both of its limit points.

Parameters
  • lower_value (numeric (one-of)) – Represents the lower value limit of the interval for the maximum. If lower_value is supplied, then upper_value must also be supplied, and none of lower_field and upper_field should be provided.

  • upper_value (numeric (one-of)) – Represents the upper value limit of the interval for the maximum. If upper_value is supplied, then lower_value must also be supplied, and none of lower_field and upper_field should be provided.

  • lower_field (str (one-of)) – Represents the lower field limit of the interval for the maximum. The lower field is a string representing a summary field e.g. min, mean, max, stddev, etc., for which the value will be used as a lower bound. If lower_field is supplied, then upper_field must also be supplied, and none of lower_value and upper_value should be provided.

  • upper_field (str (one-of)) – Represents the upper field limit of the interval for the maximum. The upper field is a string representing a summary field e.g. min, mean, max, stddev, etc., for which the value will be used as an upper bound. If upper_field is supplied, then lower_field must also be supplied, and none of lower_value and upper_value should be provided.

  • name (str) – Name of the constraint used for reporting

  • verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.

Return type

SummaryConstraint - a summary constraint defining an interval of values for the maximum value of a feature

whylogs.core.statistics.constraints.maxLessThanEqualConstraint(value=None, field=None, name=None, verbose=False)

Defines a summary constraint on the maximum value of a feature. The maximum can be defined to be less than or equal to some value, or less than or equal to the values of another summary field of the same feature, such as the mean (average).

Parameters
  • value (numeric (one-of)) – Represents the value which should be compared to the maximum value of the specified feature, for checking the less than or equal to constraint. Only one of value and field should be supplied.

  • field (str (one-of)) – The field is a string representing a summary field e.g. min, mean, max, stddev, etc., for which the value will be used for checking the less than or equal to constraint. Only one of field and value should be supplied.

  • name (str) – Name of the constraint used for reporting

  • verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.

Returns

  • SummaryConstraint - a summary constraint defining a constraint on the maximum value to be less than

  • or equal to some value / summary field

whylogs.core.statistics.constraints.distinctValuesInSetConstraint(reference_set: Set[Any], name=None, verbose=False)

Defines a summary constraint on the distinct values of a feature. All of the distinct values should belong in the user-provided set or reference values reference_set. Useful for categorical features, for checking if the set of values present in a feature is contained in the set of expected categories.

Parameters
  • reference_set (Set[Any] (required)) – Represents the set of reference (expected) values for a feature. The provided values can be of any type. If at least one of the distinct values of the feature is not in the user specified set reference_set, then the constraint will fail.

  • name (str) – The name of the constraint.

  • verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.

Returns

  • SummaryConstraint - a summary constraint defining a constraint on the distinct values of a feature

  • to belong in a user supplied set of values

whylogs.core.statistics.constraints.distinctValuesEqualSetConstraint(reference_set: Set[Any], name=None, verbose=False)

Defines a summary constraint on the distinct values of a feature. The set of the distinct values should be equal to the user-provided set or reference values, reference_set. Useful for categorical features, for checking if the set of values present in a feature is the same as the set of expected categories.

Parameters
  • reference_set (Set[Any] (required)) – Represents the set of reference (expected) values for a feature. The provided values can be of any type. If the distinct values of the feature are not equal to the user specified set reference_set, then the constraint will fail.

  • name (str) – The name of the constraint.

  • verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.

Returns

  • SummaryConstraint - a summary constraint defining a constraint on the distinct values of a feature

  • to be equal to a user supplied set of values

whylogs.core.statistics.constraints.distinctValuesContainSetConstraint(reference_set: Set[Any], name=None, verbose=False)

Defines a summary constraint on the distinct values of a feature. The set of user-supplied reference values, reference_set should be a subset of the set of distinct values for the current feature. Useful for categorical features, for checking if the set of values present in a feature is a superset of the set of expected categories.

Parameters
  • reference_set (Set[Any] (required)) – Represents the set of reference (expected) values for a feature. The provided values can be of any type. If at least one of the values of the reference set, specified in reference_set, is not contained in the set of distinct values of the feature, then the constraint will fail.

  • name (str) – The name of the constraint.

  • verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.

Returns

  • SummaryConstraint - a summary constraint defining a constraint on the distinct values of a feature

  • to be a super set of the user supplied set of values

whylogs.core.statistics.constraints.columnValuesInSetConstraint(value_set: Set[Any], name=None, verbose=False)

Defines a value constraint with set operations on the values of a single feature. The values of the feature should all be in the set of user-supplied values, specified in value_set. Useful for categorical features, for checking if the values in a feature belong in a predefined set.

Parameters
  • value_set (Set[Any] (required)) – Represents the set of expected values for a feature. The provided values can be of any type. Each value in the feature is checked against the constraint. The total number of failures equals the number of values not in the provided set value_set.

  • name (str) – The name of the constraint.

  • verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.

Returns

  • ValueConstraint - a value constraint specifying a constraint on the values of a feature

  • to be drawn from a predefined set of values.

whylogs.core.statistics.constraints.containsEmailConstraint(regex_pattern: str = None, name=None, verbose=False)

Defines a value constraint with email regex matching operations on the values of a single feature. The constraint defines a default email regex pattern, but a user-defined pattern can be supplied to override it. Useful for checking the validity of features with values representing email addresses.

Parameters
  • regex_pattern (str (optional)) – User-defined email regex pattern. If supplied, will override the default email regex pattern provided by whylogs.

  • name (str) – Name of the constraint used for reporting

  • verbose (bool (optional)) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.

Return type

ValueConstraint - a value constraint for email regex matching of the values of a single feature

whylogs.core.statistics.constraints.containsCreditCardConstraint(regex_pattern: str = None, name=None, verbose=False)

Defines a value constraint with credit card number regex matching operations on the values of a single feature. The constraint defines a default credit card number regex pattern, but a user-defined pattern can be supplied to override it. Useful for checking the validity of features with values representing credit card numbers.

Parameters
  • regex_pattern (str (optional)) – User-defined credit card number regex pattern. If supplied, will override the default credit card number regex pattern provided by whylogs.

  • name (str) – Name of the constraint used for reporting

  • verbose (bool (optional)) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.

Return type

ValueConstraint - a value constraint for credit card number regex matching of the values of a single feature

whylogs.core.statistics.constraints.dateUtilParseableConstraint(name=None, verbose=False)

Defines a value constraint which checks if the values of a single feature can be parsed by the dateutil parser. Useful for checking if the date time values of a feature are compatible with dateutil.

Parameters
  • name (str) – Name of the constraint used for reporting

  • verbose (bool (optional)) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.

Return type

ValueConstraint - a value constraint for checking if a feature’s values are dateutil parseable

whylogs.core.statistics.constraints.jsonParseableConstraint(name=None, verbose=False)

Defines a value constraint which checks if the values of a single feature are JSON parseable. Useful for checking if the values of a feature can be serialized to JSON.

Parameters
  • name (str) – Name of the constraint used for reporting

  • verbose (bool (optional)) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.

Return type

ValueConstraint - a value constraint for checking if a feature’s values are JSON parseable

whylogs.core.statistics.constraints.matchesJsonSchemaConstraint(json_schema, name=None, verbose=False)

Defines a value constraint which checks if the values of a single feature match a user-provided JSON schema. Useful for checking if the values of a feature can be serialized to match a predefined JSON schema.

Parameters
  • json_schema (Union[str, dict] (required)) – A string or dictionary of key-value pairs representing the expected JSON schema.

  • name (str) – Name of the constraint used for reporting

  • verbose (bool (optional)) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.

Return type

ValueConstraint - a value constraint for checking if a feature’s values match a user-provided JSON schema

whylogs.core.statistics.constraints.strftimeFormatConstraint(format, name=None, verbose=False)

Defines a value constraint which checks if the values of a single feature are strftime parsable.

Parameters
  • format (str (required)) – A string representing the expected strftime format for parsing the values.

  • name (str) – Name of the constraint used for reporting

  • verbose (bool (optional)) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.

Return type

ValueConstraint - a value constraint for checking if a feature’s values are strftime parseable

whylogs.core.statistics.constraints.containsSSNConstraint(regex_pattern: str = None, name=None, verbose=False)

Defines a value constraint with social security number (SSN) matching operations on the values of a single feature. The constraint defines a default SSN regex pattern, but a user-defined pattern can be supplied to override it. Useful for checking the validity of features with values representing SNN numbers.

Parameters
  • regex_pattern (str (optional)) – User-defined SSN regex pattern. If supplied, will override the default SSN regex pattern provided by whylogs.

  • name (str) – Name of the constraint used for reporting

  • verbose (bool (optional)) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.

Return type

ValueConstraint - a value constraint for SSN regex matching of the values of a single feature

whylogs.core.statistics.constraints.containsURLConstraint(regex_pattern: str = None, name=None, verbose=False)

Defines a value constraint with URL regex matching operations on the values of a single feature. The constraint defines a default URL regex pattern, but a user-defined pattern can be supplied to override it. Useful for checking the validity of features with values representing URL addresses.

Parameters
  • regex_pattern (str (optional)) – User-defined URL regex pattern. If supplied, will override the default URL regex pattern provided by whylogs.

  • name (str) – Name of the constraint used for reporting

  • verbose (bool (optional)) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.

Return type

ValueConstraint - a value constraint for URL regex matching of the values of a single feature

whylogs.core.statistics.constraints.stringLengthEqualConstraint(length: int, name=None, verbose=False)

Defines a value constraint which checks if the string values of a single feature have a predefined length.

Parameters
  • length (int (required)) – A numeric value which represents the expected length of the string values in the specified feature.

  • name (str) – Name of the constraint used for reporting

  • verbose (bool (optional)) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.

Return type

ValueConstraint - a value constraint for checking if a feature’s string values have a predefined length

whylogs.core.statistics.constraints.stringLengthBetweenConstraint(lower_value: int, upper_value: int, name=None, verbose=False)

Defines a value constraint which checks if the string values’ length of a single feature is in some predefined interval.

Parameters
  • lower_value (int (required)) – A numeric value which represents the expected lower bound of the length of the string values in the specified feature.

  • upper_value (int (required)) – A numeric value which represents the expected upper bound of the length of the string values in the specified feature.

  • name (str) – Name of the constraint used for reporting

  • verbose (bool (optional)) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.

Returns

  • ValueConstraint - a value constraint for checking if a feature’s string values’

  • length is in a predefined interval

whylogs.core.statistics.constraints.quantileBetweenConstraint(quantile_value: Union[int, float], lower_value: Union[int, float], upper_value: Union[int, float], name=None, verbose: bool = False)

Defines a summary constraint on the n-th quantile value of a numeric feature. The n-th quantile can be defined to be between two values. The defined interval is a closed interval, which includes both of its limit points.

Parameters
  • quantile_value (numeric (required)) – The n-the quantile for which the constraint will be executed

  • lower_value (numeric (required)) – Represents the lower value limit of the interval for the n-th quantile.

  • upper_value (numeric (required)) – Represents the upper value limit of the interval for the n-th quantile.

  • name (str) – Name of the constraint used for reporting

  • verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.

Returns

  • SummaryConstraint - a summary constraint defining a closed interval of valid values

  • for the n-th quantile value of a specific feature

whylogs.core.statistics.constraints.columnUniqueValueCountBetweenConstraint(lower_value: int, upper_value: int, name=None, verbose: bool = False)

Defines a summary constraint on the cardinality of a specific feature. The cardinality can be defined to be between two values. The defined interval is a closed interval, which includes both of its limit points. Useful for checking the unique count of values for discrete features.

Parameters
  • lower_value (numeric (required)) – Represents the lower value limit of the interval for the feature cardinality.

  • upper_value (numeric (required)) – Represents the upper value limit of the interval for the feature cardinality.

  • name (str) – Name of the constraint used for reporting

  • verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.

Returns

  • SummaryConstraint - a summary constraint defining a closed interval

  • for the valid cardinality of a specific feature

whylogs.core.statistics.constraints.columnUniqueValueProportionBetweenConstraint(lower_fraction: float, upper_fraction: float, name=None, verbose: bool = False)

Defines a summary constraint on the proportion of unique values of a specific feature. The proportion of unique values can be defined to be between two values. The defined interval is a closed interval, which includes both of its limit points. Useful for checking the frequency of unique values for discrete features.

Parameters
  • lower_fraction (fraction between 0 and 1 (required)) – Represents the lower fraction limit of the interval for the feature unique value proportion.

  • upper_fraction (fraction between 0 and 1 (required)) – Represents the upper fraction limit of the interval for the feature cardinality.

  • name (str) – Name of the constraint used for reporting

  • verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.

Returns

  • SummaryConstraint - a summary constraint defining a closed interval

  • for the valid proportion of unique values of a specific feature

whylogs.core.statistics.constraints.columnExistsConstraint(column: str, name=None, verbose=False)

Defines a constraint on the data set schema. Checks if the user-supplied column, identified by column, is present in the data set schema.

Parameters
  • column (str (required)) – Represents the name of the column to be checked for existence in the data set.

  • name (str) – Name of the constraint used for reporting

  • verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.

Returns

  • SummaryConstraint - a summary constraint which checks the existence of a column

  • in the current data set.

whylogs.core.statistics.constraints.numberOfRowsConstraint(n_rows: int, name=None, verbose=False)

Defines a constraint on the data set schema. Checks if the number of rows in the data set equals the user-supplied number of rows.

Parameters
  • n_rows (int (required)) – Represents the user-supplied expected number of rows.

  • name (str) – Name of the constraint used for reporting

  • verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.

Return type

SummaryConstraint - a summary constraint which checks the number of rows in the data set

whylogs.core.statistics.constraints.columnsMatchSetConstraint(reference_set: Set[str], name=None, verbose=False)

Defines a constraint on the data set schema. Checks if the set of columns in the data set is equal to the user-supplied set of expected columns.

Parameters
  • reference_set (Set[str] (required)) – Represents the expected columns in the current data set.

  • name (str) – Name of the constraint used for reporting

  • verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.

Returns

  • SummaryConstraint - a summary constraint which checks if the column set

  • of the current data set matches the expected column set

whylogs.core.statistics.constraints.columnMostCommonValueInSetConstraint(value_set: Set[Any], name=None, verbose=False)

Defines a summary constraint on the most common value of a feature. The most common value of the feature should be in the set of user-supplied values, value_set. Useful for categorical features, for checking if the most common value of a feature belongs in an expected set of common categories.

Parameters
  • value_set (Set[Any] (required)) – Represents the set of expected values for a feature. The provided values can be of any type. If the most common value of the feature is not in the values of the user-specified value_set, the constraint will fail.

  • name (str) – The name of the constraint.

  • verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.

Returns

  • SummaryConstraint - a summary constraint defining a constraint on the most common value of a feature

  • to belong to a set of user-specified expected values

whylogs.core.statistics.constraints.columnValuesNotNullConstraint(name=None, verbose=False)

Defines a non-null summary constraint on the value of a feature. Useful for features for which there is no tolerance for missing values. The constraint will fail if there is at least one missing value in the specified feature.

Parameters
  • name (str) – The name of the constraint.

  • verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.

Returns

  • SummaryConstraint - a summary constraint defining that no missing values

  • are allowed for the specified feature

whylogs.core.statistics.constraints.missingValuesProportionBetweenConstraint(lower_fraction: float, upper_fraction: float, name: str = None, verbose: bool = False)

Defines a summary constraint on the proportion of missing values of a specific feature. The proportion of missing values can be defined to be between two frequency values. The defined interval is a closed interval, which includes both of its limit points. Useful for checking features with expected amounts of missing values.

Parameters
  • lower_fraction (fraction between 0 and 1 (required)) – Represents the lower fraction limit of the interval for the feature missing value proportion.

  • upper_fraction (fraction between 0 and 1 (required)) – Represents the upper fraction limit of the interval for the feature missing value proportion.

  • verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.

Returns

  • SummaryConstraint - a summary constraint defining a closed interval

  • for the valid proportion of missing values of a specific feature

whylogs.core.statistics.constraints.columnValuesTypeEqualsConstraint(expected_type: Union[whylogs.proto.InferredType, int], name=None, verbose: bool = False)

Defines a summary constraint on the type of the feature values. The type of values should be equal to the user-provided expected type.

Parameters
  • expected_type (Union[InferredType, int]) –

    whylogs.proto.InferredType.Type - Enumeration of allowed inferred data types If supplied as integer value, should be one of:

    UNKNOWN = 0 NULL = 1 FRACTIONAL = 2 INTEGRAL = 3 BOOLEAN = 4 STRING = 5

  • name (str) – Name of the constraint used for reporting

  • verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.

Returns

equal to a user-provided expected type

Return type

SummaryConstraint - a summary constraint defining that the feature values type should be

whylogs.core.statistics.constraints.columnValuesTypeInSetConstraint(type_set: Set[int], name=None, verbose: bool = False)

Defines a summary constraint on the type of the feature values. The type of values should be in the set of to the user-provided expected types.

Parameters
  • type_set (Set[int]) –

    whylogs.proto.InferredType.Type - Enumeration of allowed inferred data types If supplied as integer value, should be one of:

    UNKNOWN = 0 NULL = 1 FRACTIONAL = 2 INTEGRAL = 3 BOOLEAN = 4 STRING = 5

  • name (str) – Name of the constraint used for reporting

  • verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.

Returns

in the set of user-provided expected types

Return type

SummaryConstraint - a summary constraint defining that the feature values type should be

whylogs.core.statistics.constraints.approximateEntropyBetweenConstraint(lower_value: Union[int, float], upper_value: float, name=None, verbose=False)

Defines a summary constraint specifying the expected interval of the features estimated entropy. The defined interval is a closed interval, which includes both of its limit points.

Parameters
  • lower_value (numeric (required)) – Represents the lower value limit of the interval for the feature’s estimated entropy.

  • upper_value (numeric (required)) – Represents the upper value limit of the interval for the feature’s estimated entropy.

  • name (str) – Name of the constraint used for reporting

  • verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.

Returns

  • SummaryConstraint - a summary constraint defining the interval of valid values

  • of the feature’s estimated entropy

whylogs.core.statistics.constraints.parametrizedKSTestPValueGreaterThanConstraint(reference_distribution: Union[List[float], numpy.ndarray], p_value=0.05, name=None, verbose=False)

Defines a summary constraint specifying the expected upper limit of the p-value for rejecting the null hypothesis of the KS test. Can be used only for continuous data.

Parameters
  • reference_distribution (Array-like) – Represents the reference distribution for calculating the KS Test p_value of the column, should be an array-like object with floating point numbers, Only numeric distributions are accepted

  • p_value (float) – Represents the reference p_value value to compare with the p_value of the test Should be between 0 and 1, inclusive

  • name (str) – Name of the constraint used for reporting

  • verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.

Returns

  • SummaryConstraint - a summary constraint specifying the upper limit of the

  • KS test p-value for rejecting the null hypothesis

whylogs.core.statistics.constraints.columnKLDivergenceLessThanConstraint(reference_distribution: Union[List[Any], numpy.ndarray], threshold: float = 0.5, name=None, verbose: bool = False)

Defines a summary constraint specifying the expected upper limit of the threshold for the KL divergence of the specified feature.

Parameters
  • reference_distribution (Array-like) – Represents the reference distribution for calculating the KL Divergence of the column, should be an array-like object with floating point numbers, or integers, strings and booleans, but not both Both numeric and categorical distributions are accepted

  • threshold (float) – Represents the threshold value which if exceeded from the KL Divergence, the constraint would fail

  • name (str) – Name of the constraint used for reporting

  • verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.

Returns

  • SummaryConstraint - a summary constraint specifying the upper threshold of the

  • feature’s KL divergence

whylogs.core.statistics.constraints.columnChiSquaredTestPValueGreaterThanConstraint(reference_distribution: Union[List[Any], numpy.ndarray, Mapping[str, int]], p_value: float = 0.05, name=None, verbose: bool = False)

Defines a summary constraint specifying the expected upper limit of the p-value for rejecting the null hypothesis of the Chi-Squared test. Can be used only for discrete data.

Parameters
  • reference_distribution (Array-like) – Represents the reference distribution for calculating the Chi-Squared test, should be an array-like object with integer, string or boolean values or a mapping of type key: value where the keys are the items and the values are the per-item counts Only categorical distributions are accepted

  • p_value (float) – Represents the reference p_value value to compare with the p_value of the test Should be between 0 and 1, inclusive

  • name (str) – Name of the constraint used for reporting

  • verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.

Returns

  • SummaryConstraint - a summary constraint specifying the upper limit of the

  • Chi-Squared test p-value for rejecting the null hypothesis

whylogs.core.statistics.constraints.columnValuesAGreaterThanBConstraint(column_A: str, column_B: str, name=None, verbose: bool = False)

Defines a multi-column value constraint which specifies that each value in column A, specified in column_A, is greater than the corresponding value of column B, specified in column_B in the same row.

Parameters
  • column_A (str) – The name of column A

  • column_B (str) – The name of column B

  • name (str) – Name of the constraint used for reporting

  • verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.

Returns

  • MultiColumnValueConstraint - multi-column value constraint specifying that values from column A

  • should always be greater than the corresponding values of column B

whylogs.core.statistics.constraints.columnValuesAGreaterThanEqualBConstraint(column_A: str, column_B: str, name=None, verbose: bool = False)

Defines a multi-column value constraint which specifies that each value in column A, specified in column_A, is greater than or equal to the corresponding value of column B, specified in column_B in the same row.

Parameters
  • column_A (str) – The name of column A

  • column_B (str) – The name of column B

  • name (str) – Name of the constraint used for reporting

  • verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.

Returns

  • MultiColumnValueConstraint - multi-column value constraint specifying that values from column A

  • should always be greater than or equal to the corresponding values of column B

whylogs.core.statistics.constraints.columnValuesALessThanBConstraint(column_A: str, column_B: str, name=None, verbose: bool = False)

Defines a multi-column value constraint which specifies that each value in column A, specified in column_A, is less than the corresponding value of column B, specified in column_B in the same row.

Parameters
  • column_A (str) – The name of column A

  • column_B (str) – The name of column B

  • name (str) – Name of the constraint used for reporting

  • verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.

Returns

  • MultiColumnValueConstraint - multi-column value constraint specifying that values from column A

  • should always be less the corresponding values of column B

whylogs.core.statistics.constraints.columnValuesALessThanEqualBConstraint(column_A: str, column_B: str, name=None, verbose: bool = False)

Defines a multi-column value constraint which specifies that each value in column A, specified in column_A, is less than or equal to the corresponding value of column B, specified in column_B in the same row.

Parameters
  • column_A (str) – The name of column A

  • column_B (str) – The name of column B

  • name (str) – Name of the constraint used for reporting

  • verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.

Returns

  • MultiColumnValueConstraint - multi-column value constraint specifying that values from column A

  • should always be less than or equal to the corresponding values of column B

whylogs.core.statistics.constraints.columnValuesAEqualBConstraint(column_A: str, column_B: str, name=None, verbose: bool = False)

Defines a multi-column value constraint which specifies that each value in column A, specified in column_A, is equal to the corresponding value of column B, specified in column_B in the same row.

Parameters
  • column_A (str) – The name of column A

  • column_B (str) – The name of column B

  • name (str) – Name of the constraint used for reporting

  • verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.

Returns

  • MultiColumnValueConstraint - multi-column value constraint specifying that values from column A

  • should always be equal to the corresponding values of column B

whylogs.core.statistics.constraints.columnValuesANotEqualBConstraint(column_A: str, column_B: str, name=None, verbose: bool = False)

Defines a multi-column value constraint which specifies that each value in column A, specified in column_A, is different from the corresponding value of column B, specified in column_B in the same row.

Parameters
  • column_A (str) – The name of column A

  • column_B (str) – The name of column B

  • name (str) – Name of the constraint used for reporting

  • verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.

Returns

  • MultiColumnValueConstraint - multi-column value constraint specifying that values from column A

  • should always be different from the corresponding values of column B

whylogs.core.statistics.constraints.sumOfRowValuesOfMultipleColumnsEqualsConstraint(columns: Union[List[str], Set[str], numpy.array], value: Union[float, int, str], name=None, verbose: bool = False)

Defines a multi-column value constraint which specifies that the sum of the values in each row of the provided columns, specified in columns, should be equal to the user-predefined value, specified in value, or to the corresponding value of another column, which will be specified with a name in the value parameter.

Parameters
  • columns (List[str]) – List of columns for which the sum of row values should equal the provided-value

  • value (Union[float, int, str]) – Numeric value to compare with the sum of the column row values, or a string indicating a column name for which the row value will be compared with the sum

  • name (str) – Name of the constraint used for reporting

  • verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.

Return type

MultiColumnValueConstraint - specifying the expected value of the sum of the values in multiple columns

whylogs.core.statistics.constraints.columnPairValuesInSetConstraint(column_A: str, column_B: str, value_set: Set[Tuple[Any, Any]], name=None, verbose: bool = False)

Defines a multi-column value constraint which specifies that the pair of values of columns A and B, should be in a user-predefined set of expected pairs of values.

Parameters
  • column_A (str) – The name of the first column

  • column_B (str) – The name of the second column

  • value_set (Set[Tuple[Any, Any]]) – A set of expected pairs of values for the columns A and B, in that order

  • name (str) – Name of the constraint used for reporting

  • verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.

Return type

MultiColumnValueConstraint - specifying the expected set of value pairs of two columns in the data set

whylogs.core.statistics.constraints.columnValuesUniqueWithinRow(column_A: str, name=None, verbose: bool = False)

Defines a multi-column value constraint which specifies that the values of column A should be unique within each row of the data set.

Parameters
  • column_A (str) – The name of the column for which it is expected that the values are unique within each row

  • name (str) – Name of the constraint used for reporting

  • verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.

Return type

MultiColumnValueConstraint - specifying that the provided column’s values are unique within each row

whylogs.core.statistics.counterstracker
Module Contents
Classes

CountersTracker

Class to keep track of the counts of various data types

class whylogs.core.statistics.counterstracker.CountersTracker(count=0, true_count=0)

Class to keep track of the counts of various data types

Parameters
  • count (int, optional) – Current number of objects

  • true_count (int, optional) – Number of boolean values

  • null_count (int, optional) – Number of nulls encountered

increment_count(self)

Add one to the count of total objects

increment_bool(self)

Add one to the boolean count

increment_null(self)

Add one to the null count

merge(self, other)

Merge another counter tracker with this one

Returns

new_tracker – The merged tracker

Return type

CountersTracker

to_protobuf(self, null_count=0)

Return the object serialized as a protobuf message

static from_protobuf(message: whylogs.proto.Counters)

Load from a protobuf message

Returns

counters

Return type

CountersTracker

whylogs.core.statistics.hllsketch
Module Contents
Classes

HllSketch

Attributes

DEFAULT_LG_K

whylogs.core.statistics.hllsketch.DEFAULT_LG_K = 12
class whylogs.core.statistics.hllsketch.HllSketch(lg_k=None, sketch=None)
update(self, value)
merge(self, other)
get_estimate(self)
get_lower_bound(self, num_std_devs: int = 1)
get_upper_bound(self, num_std_devs: int = 1)
to_protobuf(self)
_serialize_item(self, x)
is_empty(self)
static from_protobuf(message: whylogs.proto.HllSketchMessage)
to_summary(self, num_std_devs=1)
whylogs.core.statistics.numbertracker
Module Contents
Classes

NumberTracker

Class to track statistics for numeric data.

Attributes

DEFAULT_HIST_K

logger

whylogs.core.statistics.numbertracker.DEFAULT_HIST_K = 256
whylogs.core.statistics.numbertracker.logger
class whylogs.core.statistics.numbertracker.NumberTracker(variance: whylogs.core.statistics.datatypes.VarianceTracker = None, floats: whylogs.core.statistics.datatypes.FloatTracker = None, ints: whylogs.core.statistics.datatypes.IntTracker = None, theta_sketch: whylogs.core.statistics.thetasketch.ThetaSketch = None, histogram: datasketches.kll_floats_sketch = None)

Class to track statistics for numeric data.

Parameters
  • variance – Tracker to follow the variance

  • floats – Float tracker for tracking all floats

  • ints – Integer tracker

variance

See above

floats

See above

ints

See above

theta_sketch

Sketch which tracks approximate cardinality

Type

whylabs.logs.core.statistics.thetasketch.ThetaSketch

property count(self)
track(self, number)

Add a number to statistics tracking

Parameters

number (int, float) – A numeric value

merge(self, other)
to_protobuf(self)

Return the object serialized as a protobuf message

static from_protobuf(message: whylogs.proto.NumbersMessage)

Load from a protobuf message

Returns

number_tracker

Return type

NumberTracker

to_summary(self)

Construct a NumberSummary message

Returns

summary – Summary of the tracker statistics

Return type

NumberSummary

whylogs.core.statistics.schematracker
Module Contents
Classes

SchemaTracker

Track information about a column's schema and present datatypes

Attributes

Type

whylogs.core.statistics.schematracker.Type
class whylogs.core.statistics.schematracker.SchemaTracker(type_counts: dict = None, legacy_null_count=0)

Track information about a column’s schema and present datatypes

type_countsdict

If specified, a dictionary containing information about the counts of all data types.

UNKNOWN_TYPE
NULL_TYPE
CANDIDATE_MIN_FRAC = 0.7
_non_null_type_counts(self)
track(self, item_type)

Track an item type

get_count(self, item_type)

Return the count of a given item type

infer_type(self)

Generate a guess at what type the tracked values are.

Returns

type_guess – The guess tome. See InferredType.Type for candidates

Return type

object

merge(self, other)

Merge another schema tracker with this and return a new one. Does not alter this object.

Parameters

other (SchemaTracker) –

Returns

merged – Merged tracker

Return type

SchemaTracker

copy(self)

Return a copy of this tracker

to_protobuf(self)

Return the object serialized as a protobuf message

Returns

message

Return type

SchemaMessage

static from_protobuf(message, legacy_null_count=0)

Load from a protobuf message

Returns

schema_tracker

Return type

SchemaTracker

to_summary(self)

Generate a summary of the statistics

Returns

summary – Protobuf summary message.

Return type

SchemaSummary

whylogs.core.statistics.stringtracker
Module Contents
Classes

CharPosTracker

Track statistics for character positions within a string

StringTracker

Track statistics for strings

Attributes

MAX_ITEMS_SIZE

MAX_SUMMARY_ITEMS

logger

whylogs.core.statistics.stringtracker.MAX_ITEMS_SIZE = 128
whylogs.core.statistics.stringtracker.MAX_SUMMARY_ITEMS = 100
whylogs.core.statistics.stringtracker.logger
class whylogs.core.statistics.stringtracker.CharPosTracker(character_list: str = None)

Track statistics for character positions within a string

Parameters

character_list (str) – string containing all characters to be tracked this list can include specific unicode characters to track.

update(self, value: str, character_list: str = None) None

update

Parameters
  • value (str) – utf-16 string

  • character_list (str, optional) – use a specific character_list for the tracked string. Note that modifing it from a previous saved choice, will reset the character position map, since NITL no longer has the same context.

merge(self, other: CharPosTracker) CharPosTracker

Merges two Char Pos Frequency Maps

Parameters

other (CharPosTracker) – to be merged

to_protobuf(self)

Return the object serialized as a protobuf message

static from_protobuf(message: whylogs.proto.CharPosMessage)

Load from a CharPosMessage protobuf message

Return type

CharPosTracker

to_summary(self)
class whylogs.core.statistics.stringtracker.StringTracker(count: int = None, items: datasketches.frequent_strings_sketch = None, theta_sketch: whylogs.core.statistics.thetasketch.ThetaSketch = None, length: whylogs.core.statistics.numbertracker.NumberTracker = None, token_length: whylogs.core.statistics.numbertracker.NumberTracker = None, char_pos_tracker: CharPosTracker = None, token_method: Callable[[], List[str]] = None)

Track statistics for strings

Parameters
  • count (int) – Total number of processed values

  • items (frequent_strings_sketch) – Sketch for tracking string counts

  • theta_sketch (ThetaSketch) – Sketch for approximate cardinality tracking

  • length (NumberTracker) – tracks the distribution of length of strings

  • token_length (NumberTracker) – counts token per sentence

  • token_method (funtion) – method used to turn string into tokens

  • char_pos_tracker (CharPosTracker) –

update(self, value: str, character_list=None, token_method=None)

Add a string to the tracking statistics.

If value is None, nothing will be done

merge(self, other)

Merge the values of this string tracker with another

Parameters

other (StringTracker) – The other StringTracker

Returns

new – Merged values

Return type

StringTracker

to_protobuf(self)

Return the object serialized as a protobuf message

Returns

message

Return type

StringsMessage

static from_protobuf(message: whylogs.proto.StringsMessage)

Load from a protobuf message

Returns

string_tracker

Return type

StringTracker

to_summary(self)

Generate a summary of the statistics

Returns

summary – Protobuf summary message.

Return type

StringsSummary

whylogs.core.statistics.thetasketch
Module Contents
Classes

ThetaSketch

A sketch for approximate cardinality tracking.

Functions

_copy_union(union)

numbers_summary(sketch: ThetaSketch, num_std_devs=1)

Generate a summary protobuf message from a thetasketch based on numeric

whylogs.core.statistics.thetasketch._copy_union(union)
class whylogs.core.statistics.thetasketch.ThetaSketch(theta_sketch=None, union=None, compact_theta=None)

A sketch for approximate cardinality tracking.

A wrapper class for datasketches.update_theta_sketch which implements merging for updatable theta sketches.

Currently, datasketches only implements merging for compact (read-only) theta sketches.

update(self, value)

Update the statistics tracking

Parameters

value (object) – Value to follow

merge(self, other)

Merge another ThetaSketch with this one, returning a new object

Parameters

other (ThetaSketch) – Other theta sketch

Returns

new – New theta sketch with merged statistics

Return type

ThetaSketch

get_result(self)

Generate a theta sketch

Returns

compact_sketch – Read-only compact theta sketch with full statistics.

Return type

datasketches.compact_theta_sketch

serialize(self)

Serialize this object.

Note that serialization only preserves the object approximately.

Returns

msg – Serialized to bytes

Return type

bytes

static deserialize(msg: bytes)

Deserialize from a serialized message.

msg

Parameters

msg (bytes) –

Serialized object. can be a serialized version of:
  • ThetaSketch

  • datasketches.update_theta_sketch,

  • datasketches.compact_theta_sketch

Returns

sketch – ThetaSketch object

Return type

ThetaSketch

to_summary(self, num_std_devs=1)

Generate a summary protobuf message

Parameters

num_std_devs (float) – For estimating bounds

Returns

summary – Summary protobuf message

Return type

UniqueCountSummary

whylogs.core.statistics.thetasketch.numbers_summary(sketch: ThetaSketch, num_std_devs=1)

Generate a summary protobuf message from a thetasketch based on numeric values

Parameters
  • sketch

  • num_std_devs (float) – For estimating bounds

Returns

summary – Summary protobuf message

Return type

UniqueCountSummary

Package Contents
Classes

CountersTracker

Class to keep track of the counts of various data types

NumberTracker

Class to track statistics for numeric data.

SchemaTracker

Track information about a column's schema and present datatypes

StringTracker

Track statistics for strings

ThetaSketch

A sketch for approximate cardinality tracking.

Attributes

__ALL__

class whylogs.core.statistics.CountersTracker(count=0, true_count=0)

Class to keep track of the counts of various data types

Parameters
  • count (int, optional) – Current number of objects

  • true_count (int, optional) – Number of boolean values

  • null_count (int, optional) – Number of nulls encountered

increment_count(self)

Add one to the count of total objects

increment_bool(self)

Add one to the boolean count

increment_null(self)

Add one to the null count

merge(self, other)

Merge another counter tracker with this one

Returns

new_tracker – The merged tracker

Return type

CountersTracker

to_protobuf(self, null_count=0)

Return the object serialized as a protobuf message

static from_protobuf(message: whylogs.proto.Counters)

Load from a protobuf message

Returns

counters

Return type

CountersTracker

class whylogs.core.statistics.NumberTracker(variance: whylogs.core.statistics.datatypes.VarianceTracker = None, floats: whylogs.core.statistics.datatypes.FloatTracker = None, ints: whylogs.core.statistics.datatypes.IntTracker = None, theta_sketch: whylogs.core.statistics.thetasketch.ThetaSketch = None, histogram: datasketches.kll_floats_sketch = None)

Class to track statistics for numeric data.

Parameters
  • variance – Tracker to follow the variance

  • floats – Float tracker for tracking all floats

  • ints – Integer tracker

variance

See above

floats

See above

ints

See above

theta_sketch

Sketch which tracks approximate cardinality

Type

whylabs.logs.core.statistics.thetasketch.ThetaSketch

property count(self)
track(self, number)

Add a number to statistics tracking

Parameters

number (int, float) – A numeric value

merge(self, other)
to_protobuf(self)

Return the object serialized as a protobuf message

static from_protobuf(message: whylogs.proto.NumbersMessage)

Load from a protobuf message

Returns

number_tracker

Return type

NumberTracker

to_summary(self)

Construct a NumberSummary message

Returns

summary – Summary of the tracker statistics

Return type

NumberSummary

class whylogs.core.statistics.SchemaTracker(type_counts: dict = None, legacy_null_count=0)

Track information about a column’s schema and present datatypes

type_countsdict

If specified, a dictionary containing information about the counts of all data types.

UNKNOWN_TYPE
NULL_TYPE
CANDIDATE_MIN_FRAC = 0.7
_non_null_type_counts(self)
track(self, item_type)

Track an item type

get_count(self, item_type)

Return the count of a given item type

infer_type(self)

Generate a guess at what type the tracked values are.

Returns

type_guess – The guess tome. See InferredType.Type for candidates

Return type

object

merge(self, other)

Merge another schema tracker with this and return a new one. Does not alter this object.

Parameters

other (SchemaTracker) –

Returns

merged – Merged tracker

Return type

SchemaTracker

copy(self)

Return a copy of this tracker

to_protobuf(self)

Return the object serialized as a protobuf message

Returns

message

Return type

SchemaMessage

static from_protobuf(message, legacy_null_count=0)

Load from a protobuf message

Returns

schema_tracker

Return type

SchemaTracker

to_summary(self)

Generate a summary of the statistics

Returns

summary – Protobuf summary message.

Return type

SchemaSummary

class whylogs.core.statistics.StringTracker(count: int = None, items: datasketches.frequent_strings_sketch = None, theta_sketch: whylogs.core.statistics.thetasketch.ThetaSketch = None, length: whylogs.core.statistics.numbertracker.NumberTracker = None, token_length: whylogs.core.statistics.numbertracker.NumberTracker = None, char_pos_tracker: CharPosTracker = None, token_method: Callable[[], List[str]] = None)

Track statistics for strings

Parameters
  • count (int) – Total number of processed values

  • items (frequent_strings_sketch) – Sketch for tracking string counts

  • theta_sketch (ThetaSketch) – Sketch for approximate cardinality tracking

  • length (NumberTracker) – tracks the distribution of length of strings

  • token_length (NumberTracker) – counts token per sentence

  • token_method (funtion) – method used to turn string into tokens

  • char_pos_tracker (CharPosTracker) –

update(self, value: str, character_list=None, token_method=None)

Add a string to the tracking statistics.

If value is None, nothing will be done

merge(self, other)

Merge the values of this string tracker with another

Parameters

other (StringTracker) – The other StringTracker

Returns

new – Merged values

Return type

StringTracker

to_protobuf(self)

Return the object serialized as a protobuf message

Returns

message

Return type

StringsMessage

static from_protobuf(message: whylogs.proto.StringsMessage)

Load from a protobuf message

Returns

string_tracker

Return type

StringTracker

to_summary(self)

Generate a summary of the statistics

Returns

summary – Protobuf summary message.

Return type

StringsSummary

class whylogs.core.statistics.ThetaSketch(theta_sketch=None, union=None, compact_theta=None)

A sketch for approximate cardinality tracking.

A wrapper class for datasketches.update_theta_sketch which implements merging for updatable theta sketches.

Currently, datasketches only implements merging for compact (read-only) theta sketches.

update(self, value)

Update the statistics tracking

Parameters

value (object) – Value to follow

merge(self, other)

Merge another ThetaSketch with this one, returning a new object

Parameters

other (ThetaSketch) – Other theta sketch

Returns

new – New theta sketch with merged statistics

Return type

ThetaSketch

get_result(self)

Generate a theta sketch

Returns

compact_sketch – Read-only compact theta sketch with full statistics.

Return type

datasketches.compact_theta_sketch

serialize(self)

Serialize this object.

Note that serialization only preserves the object approximately.

Returns

msg – Serialized to bytes

Return type

bytes

static deserialize(msg: bytes)

Deserialize from a serialized message.

msg

Parameters

msg (bytes) –

Serialized object. can be a serialized version of:
  • ThetaSketch

  • datasketches.update_theta_sketch,

  • datasketches.compact_theta_sketch

Returns

sketch – ThetaSketch object

Return type

ThetaSketch

to_summary(self, num_std_devs=1)

Generate a summary protobuf message

Parameters

num_std_devs (float) – For estimating bounds

Returns

summary – Summary protobuf message

Return type

UniqueCountSummary

whylogs.core.statistics.__ALL__
whylogs.core.types
Submodules
whylogs.core.types.typeddataconverter

TODO: implement this using something other than yaml

Module Contents
Classes

TypedDataConverter

A class to coerce types on data.

Attributes

TYPES

TYPENUM_TO_NAME

INTEGRAL_TYPES

FLOAT_TYPES

whylogs.core.types.typeddataconverter.TYPES
whylogs.core.types.typeddataconverter.TYPENUM_TO_NAME
whylogs.core.types.typeddataconverter.INTEGRAL_TYPES
whylogs.core.types.typeddataconverter.FLOAT_TYPES
class whylogs.core.types.typeddataconverter.TypedDataConverter

A class to coerce types on data.

To see available types:

>>> from whylogs.core.types.typeddataconverter import TYPES
>>> print("\n".join(sorted(TYPES.keys())))
static convert(data)

Convert data to a typed value

If a data is a string, parse data with yaml. Else, return data unchanged

Note: this method is very slow, since it relies on the complex and python-based implementation of yaml.

static _is_array_like(value)
static _are_nulls(value)
static get_type(typed_data)

Extract the data type of a value. See typeddataconvert.TYPES for available types.

Parameters

typed_data – Data processed by TypedDataConverter.convert

Returns

dtype

Return type

TYPES

Package Contents
Classes

TypedDataConverter

A class to coerce types on data.

Attributes

__ALL__

class whylogs.core.types.TypedDataConverter

A class to coerce types on data.

To see available types:

>>> from whylogs.core.types.typeddataconverter import TYPES
>>> print("\n".join(sorted(TYPES.keys())))
static convert(data)

Convert data to a typed value

If a data is a string, parse data with yaml. Else, return data unchanged

Note: this method is very slow, since it relies on the complex and python-based implementation of yaml.

static _is_array_like(value)
static _are_nulls(value)
static get_type(typed_data)

Extract the data type of a value. See typeddataconvert.TYPES for available types.

Parameters

typed_data – Data processed by TypedDataConverter.convert

Returns

dtype

Return type

TYPES

whylogs.core.types.__ALL__
Submodules
whylogs.core.annotation_profiling
Module Contents
Classes

Rectangle

Helper class to compute minimal bouding box intersections and/or iou

TrackBB

Attributes

BB_ATTRIBUTES

class whylogs.core.annotation_profiling.Rectangle(boundingBox, confidence=None, labels=None)

Helper class to compute minimal bouding box intersections and/or iou minimal stats properties of boudning box

area

Description

Type

float

aspect_ratio

Description

Type

TYPE

boundingBox

Description

Type

TYPE

centroid

Description

Type

TYPE

confidence

Description

Type

TYPE

height

Description

Type

TYPE

labels

Description

Type

TYPE

width

Description

Type

TYPE

property x1(self)
property x2(self)
property y1(self)
property y2(self)
intersection(self, Rectangle_2)
iou(self, Rectangle_2)
whylogs.core.annotation_profiling.BB_ATTRIBUTES = ['annotation_count', 'annotation_density', 'area_coverage', 'bb_width', 'bb_height', 'bb_area',...
class whylogs.core.annotation_profiling.TrackBB(filepath: str = None, obj: Dict = None, feature_transforms: Optional[List[Callable]] = None, feature_names: str = '')
calculate_metrics(self)
__call__(self, profiles)
whylogs.core.columnprofile

Defines the ColumnProfile class for tracking per-column statistics

Module Contents
Classes

ColumnProfile

Statistics tracking for a column (i.e. a feature)

MultiColumnProfile

Statistics tracking for a multiple columns (i.e. a features)

Attributes

_TYPES

_NUMERIC_TYPES

_UNIQUE_COUNT_BOUNDS_STD

whylogs.core.columnprofile._TYPES
whylogs.core.columnprofile._NUMERIC_TYPES
whylogs.core.columnprofile._UNIQUE_COUNT_BOUNDS_STD = 1
class whylogs.core.columnprofile.ColumnProfile(name: str, number_tracker: whylogs.core.statistics.NumberTracker = None, string_tracker: whylogs.core.statistics.StringTracker = None, schema_tracker: whylogs.core.statistics.SchemaTracker = None, counters: whylogs.core.statistics.CountersTracker = None, frequent_items: whylogs.util.dsketch.FrequentItemsSketch = None, cardinality_tracker: whylogs.core.statistics.hllsketch.HllSketch = None, constraints: whylogs.core.statistics.constraints.ValueConstraints = None)

Statistics tracking for a column (i.e. a feature)

The primary method for

Parameters
  • name (str (required)) – Name of the column profile

  • number_tracker (NumberTracker) – Implements numeric data statistics tracking

  • string_tracker (StringTracker) – Implements string data-type statistics tracking

  • schema_tracker (SchemaTracker) – Implements tracking of schema-related information

  • counters (CountersTracker) – Keep count of various things

  • frequent_items (FrequentItemsSketch) – Keep track of all frequent items, even for mixed datatype features

  • cardinality_tracker (HllSketch) – Track feature cardinality (even for mixed data types)

  • constraints (ValueConstraints) – Static assertions to be applied to numeric data tracked in this column

  • TODO

    • Proper TypedDataConverter type checking

    • Multi-threading/parallelism

track(self, value, character_list=None, token_method=None)

Add value to tracking statistics.

_unique_count_summary(self) whylogs.proto.UniqueCountSummary
to_summary(self)

Generate a summary of the statistics

Returns

summary – Protobuf summary message.

Return type

ColumnSummary

generate_constraints(self) whylogs.core.statistics.constraints.SummaryConstraints
merge(self, other)

Merge this columnprofile with another.

Parameters

other (ColumnProfile) –

Returns

merged – A new, merged column profile.

Return type

ColumnProfile

to_protobuf(self)

Return the object serialized as a protobuf message

Returns

message

Return type

ColumnMessage

static from_protobuf(message)

Load from a protobuf message

Returns

column_profile

Return type

ColumnProfile

class whylogs.core.columnprofile.MultiColumnProfile(constraints: whylogs.core.statistics.constraints.MultiColumnValueConstraints = None)

Statistics tracking for a multiple columns (i.e. a features)

The primary method for

Parameters

constraints (MultiColumnValueConstraints) – Static assertions to be applied to data tracked between all columns

track(self, column_dict, character_list=None, token_method=None)

TODO: Add column_dict to tracking statistics.

abstract to_summary(self)

Generate a summary of the statistics

Returns

summary – Protobuf summary message.

Return type

(Multi)ColumnSummary

merge(self, other) MultiColumnProfile

Merge this columnprofile with another.

Parameters

other (MultiColumnProfile) –

Returns

merged – A new, merged multi column profile.

Return type

MultiColumnProfile

abstract to_protobuf(self)

Return the object serialized as a protobuf message

Returns

message

Return type

ColumnMessage

abstract static from_protobuf(message)

Load from a protobuf message

Returns

column_profile

Return type

MultiColumnProfile

whylogs.core.datasetprofile

Defines the primary interface class for tracking dataset statistics.

Module Contents
Classes

DatasetProfile

Statistics tracking for a dataset.

Functions

columns_chunk_iterator(iterator, marker: str)

Create an iterator to return column messages in batches

dataframe_profile(df: pandas.DataFrame, name: str = None, timestamp: datetime.datetime = None)

Generate a dataset profile for a dataframe

array_profile(x: numpy.ndarray, name: str = None, timestamp: datetime.datetime = None, columns: list = None)

Generate a dataset profile for an array

_create_column_profile_summary_object(number_summary: whylogs.proto.NumberSummary, **kwargs)

Wrapper method for summary constraints update object creation

Attributes

SCHEMA_MAJOR_VERSION

SCHEMA_MINOR_VERSION

logger

cudfDataFrame

COLUMN_CHUNK_MAX_LEN_IN_BYTES

whylogs.core.datasetprofile.SCHEMA_MAJOR_VERSION = 1
whylogs.core.datasetprofile.SCHEMA_MINOR_VERSION = 2
whylogs.core.datasetprofile.logger
whylogs.core.datasetprofile.cudfDataFrame
whylogs.core.datasetprofile.COLUMN_CHUNK_MAX_LEN_IN_BYTES
class whylogs.core.datasetprofile.DatasetProfile(name: str, dataset_timestamp: datetime.datetime = None, session_timestamp: datetime.datetime = None, columns: dict = None, multi_columns: whylogs.core.MultiColumnProfile = None, tags: Dict[str, str] = None, metadata: Dict[str, str] = None, session_id: str = None, model_profile: whylogs.core.model_profile.ModelProfile = None, constraints: whylogs.core.statistics.constraints.DatasetConstraints = None)

Statistics tracking for a dataset.

A dataset refers to a collection of columns.

Parameters
  • name (str) – A human readable name for the dataset profile. Could be model name. This is stored under “name” tag

  • dataset_timestamp (datetime.datetime) – The timestamp associated with the data (i.e. batch run). Optional.

  • session_timestamp (datetime.datetime) – Timestamp of the dataset

  • columns (dict) – Dictionary lookup of `ColumnProfile`s

  • tags (dict) – A dictionary of key->value. Can be used upstream for aggregating data. Tags must match when merging with another dataset profile object.

  • metadata (dict) – Metadata that can store arbitrary string mapping. Metadata is not used when aggregating data and can be dropped when merging with another dataset profile object.

  • session_id (str) – The unique session ID run. Should be a UUID.

  • constraints (DatasetConstraints) – Static assertions to be applied to tracked numeric data and profile summaries.

__getstate__(self)
__setstate__(self, serialized_profile)
property name(self)
property tags(self)
property metadata(self)
property session_timestamp(self)
property session_timestamp_ms(self)

Return the session timestamp value in epoch milliseconds.

property total_row_number(self)
add_output_field(self, field: Union[str, List[str]])
track_metrics(self, targets: List[Union[str, bool, float, int]], predictions: List[Union[str, bool, float, int]], scores: List[float] = None, model_type: whylogs.proto.ModelType = None, target_field: str = None, prediction_field: str = None, score_field: str = None)

Function to track metrics based on validation data.

user may also pass the associated attribute names associated with target, prediction, and/or score.

Parameters
  • targets (List[Union[str, bool, float, int]]) – actual validated values

  • predictions (List[Union[str, bool, float, int]]) – inferred/predicted values

  • scores (List[float], optional) – assocaited scores for each inferred, all values set to 1 if not passed

  • target_field (str, optional) – Description

  • prediction_field (str, optional) – Description

  • score_field (str, optional) – Description

  • model_type (ModelType, optional) – Defaul is Classification type.

  • target_field

  • prediction_field

  • score_field

  • score_field

track(self, columns, data=None, character_list=None, token_method=None)

Add value(s) to tracking statistics for column(s).

Parameters
  • columns (str, dict) – Either the name of a column, or a dictionary specifying column names and the data (value) for each column If a string, data must be supplied. Otherwise, data is ignored.

  • data (object, None) – Value to track. Specify if columns is a string.

track_datum(self, column_name, data, character_list=None, token_method=None)
track_multi_column(self, columns)
track_array(self, x: numpy.ndarray, columns=None)

Track statistics for a numpy array

Parameters
  • x (np.ndarray) – 2D array to track.

  • columns (list) – Optional column labels

track_dataframe(self, df: pandas.DataFrame, character_list=None, token_method=None)

Track statistics for a dataframe

Parameters

df (pandas.DataFrame) – DataFrame to track

to_properties(self)

Return dataset profile related metadata

Returns

properties – The metadata as a protobuf object.

Return type

DatasetProperties

to_summary(self)

Generate a summary of the statistics

Returns

summary – Protobuf summary message.

Return type

DatasetSummary

generate_constraints(self) whylogs.core.statistics.constraints.DatasetConstraints

Assemble a sparse dict of constraints for all features.

Returns

summary – Protobuf constraints message.

Return type

DatasetConstraints

flat_summary(self)

Generate and flatten a summary of the statistics.

See flatten_summary() for a description

_column_message_iterator(self)
chunk_iterator(self)

Generate an iterator to iterate over chunks of data

validate(self)

Sanity check for this object. Raises an AssertionError if invalid

merge(self, other)

Merge this profile with another dataset profile object.

We will use metadata and timestamps from the current DatasetProfile in the result.

This operation will drop the metadata from the ‘other’ profile object.

Parameters

other (DatasetProfile) –

Returns

merged – New, merged DatasetProfile

Return type

DatasetProfile

_do_merge(self, other)
merge_strict(self, other)

Merge this profile with another dataset profile object. This throws exception if session_id, timestamps and tags don’t match.

This operation will drop the metadata from the ‘other’ profile object.

Parameters

other (DatasetProfile) –

Returns

merged – New, merged DatasetProfile

Return type

DatasetProfile

serialize_delimited(self) bytes

Write out in delimited format (data is prefixed with the length of the datastream).

This is useful when you are streaming multiple dataset profile objects

Returns

data – A sequence of bytes

Return type

bytes

to_protobuf(self) whylogs.proto.DatasetProfileMessage

Return the object serialized as a protobuf message

Returns

message

Return type

DatasetProfileMessage

write_protobuf(self, protobuf_path: str, delimited_file: bool = True, transport_parameters: dict = None) None

Write the dataset profile to disk in binary format

Parameters
  • protobuf_path (str) – local path or any path supported supported by smart_open: https://github.com/RaRe-Technologies/smart_open#how. The parent directory must already exist

  • delimited_file (bool, optional) – whether to prefix the data with the length of output or not. Default is True

static read_protobuf(protobuf_path: str, delimited_file: bool = True, transport_parameters: dict = None) DatasetProfile

Parse a protobuf file and return a DatasetProfile object

Parameters
Returns

whylogs.DatasetProfile object from the protobuf

Return type

DatasetProfile

static from_protobuf(message: whylogs.proto.DatasetProfileMessage) DatasetProfile

Load from a protobuf message

Parameters

message (DatasetProfileMessage) – The protobuf message. Should match the output of DatasetProfile.to_protobuf()

Returns

dataset_profile

Return type

DatasetProfile

static from_protobuf_string(data: bytes) DatasetProfile

Deserialize a serialized DatasetProfileMessage

Parameters

data (bytes) – The serialized message

Returns

profile – The deserialized dataset profile

Return type

DatasetProfile

static _parse_delimited_generator(data: bytes)
static parse_delimited_single(data: bytes, pos=0)

Parse a single delimited entry from a byte stream :param data: The bytestream :type data: bytes :param pos: The starting position. Default is zero :type pos: int

Returns

  • pos (int) – Current position in the stream after parsing

  • profile (DatasetProfile) – A dataset profile

static parse_delimited(data: bytes)

Parse delimited data (i.e. data prefixed with the message length).

Java protobuf writes delimited messages, which is convenient for storing multiple dataset profiles. This means that the main data is prefixed with the length of the message.

Parameters

data (bytes) – The input byte stream

Returns

profiles – List of all Dataset profile objects

Return type

list

apply_summary_constraints(self, summary_constraints: Optional[Mapping[str, whylogs.core.statistics.constraints.SummaryConstraints]] = None)
apply_table_shape_constraints(self, table_shape_constraints: Optional[whylogs.core.statistics.constraints.SummaryConstraints] = None)
whylogs.core.datasetprofile.columns_chunk_iterator(iterator, marker: str)

Create an iterator to return column messages in batches

Parameters
  • iterator – An iterator which returns protobuf column messages

  • marker – Value used to mark a group of column messages

whylogs.core.datasetprofile.dataframe_profile(df: pandas.DataFrame, name: str = None, timestamp: datetime.datetime = None)

Generate a dataset profile for a dataframe

Parameters
  • df (pandas.DataFrame) – Dataframe to track, treated as a complete dataset.

  • name (str) – Name of the dataset

  • timestamp (datetime.datetime, float) – Timestamp of the dataset. Defaults to current UTC time. Can be a datetime or UTC epoch seconds.

Returns

prof

Return type

DatasetProfile

whylogs.core.datasetprofile.array_profile(x: numpy.ndarray, name: str = None, timestamp: datetime.datetime = None, columns: list = None)

Generate a dataset profile for an array

Parameters
  • x (np.ndarray) – Array-like object to track. Will be treated as an full dataset

  • name (str) – Name of the dataset

  • timestamp (datetime.datetime) – Timestamp of the dataset. Defaults to current UTC time

  • columns (list) – Optional column labels

Returns

prof

Return type

DatasetProfile

whylogs.core.datasetprofile._create_column_profile_summary_object(number_summary: whylogs.proto.NumberSummary, **kwargs)

Wrapper method for summary constraints update object creation

Parameters
  • number_summary (NumberSummary) – Summary object generated from NumberTracker Used to unpack the metrics as separate items in the dictionary

  • kwargs (Summary objects or datasketches objects) – Used to update specific constraints that need additional calculations

Return type

Anonymous object containing all of the metrics as fields with their corresponding values

whylogs.core.flatten_datasetprofile
Module Contents
Functions

flatten_summary(dataset_summary: whylogs.proto.DatasetSummary) → dict

Flatten a DatasetSummary

_quantile_strings(quantiles: list)

flatten_dataset_quantiles(dataset_summary: whylogs.proto.DatasetSummary)

Flatten quantiles from a dataset summary

flatten_dataset_string_quantiles(dataset_summary: whylogs.proto.DatasetSummary)

Flatten quantiles from a dataset summary

flatten_dataset_histograms(dataset_summary: whylogs.proto.DatasetSummary)

Flatten histograms from a dataset summary

flatten_dataset_frequent_strings(dataset_summary: whylogs.proto.DatasetSummary)

Flatten frequent strings summaries from a dataset summary

get_dataset_frame(dataset_summary: whylogs.proto.DatasetSummary, mapping: dict = None)

Get a dataframe from scalar values flattened from a dataset summary

Attributes

TYPENUM_COLUMN_NAMES

SCALAR_NAME_MAPPING

whylogs.core.flatten_datasetprofile.TYPENUM_COLUMN_NAMES
whylogs.core.flatten_datasetprofile.SCALAR_NAME_MAPPING
whylogs.core.flatten_datasetprofile.flatten_summary(dataset_summary: whylogs.proto.DatasetSummary) dict

Flatten a DatasetSummary

Parameters

dataset_summary (DatasetSummary) – Summary to flatten

Returns

data

A dictionary with the following keys:

summarypandas.DataFrame

Per-column summary statistics

histpandas.Series

Series of histogram Series with (column name, histogram) key, value pairs. Histograms are formatted as a pandas.Series

frequent_stringspandas.Series

Series of frequent string counts with (column name, counts) key, val pairs. counts are a pandas Series.

Return type

dict

Notes

Some relevant info on the summary mapping:

>>> from whylogs.core.datasetprofile import SCALAR_NAME_MAPPING
>>> import json
>>> print(json.dumps(SCALAR_NAME_MAPPING, indent=2))
whylogs.core.flatten_datasetprofile._quantile_strings(quantiles: list)
whylogs.core.flatten_datasetprofile.flatten_dataset_quantiles(dataset_summary: whylogs.proto.DatasetSummary)

Flatten quantiles from a dataset summary

whylogs.core.flatten_datasetprofile.flatten_dataset_string_quantiles(dataset_summary: whylogs.proto.DatasetSummary)

Flatten quantiles from a dataset summary

whylogs.core.flatten_datasetprofile.flatten_dataset_histograms(dataset_summary: whylogs.proto.DatasetSummary)

Flatten histograms from a dataset summary

whylogs.core.flatten_datasetprofile.flatten_dataset_frequent_strings(dataset_summary: whylogs.proto.DatasetSummary)

Flatten frequent strings summaries from a dataset summary

whylogs.core.flatten_datasetprofile.get_dataset_frame(dataset_summary: whylogs.proto.DatasetSummary, mapping: dict = None)

Get a dataframe from scalar values flattened from a dataset summary

Parameters
  • dataset_summary (DatasetSummary) – The dataset summary.

  • mapping (dict, optional) – Override the default variable mapping.

Returns

summary – Scalar values, flattened and re-named according to mapping

Return type

pd.DataFrame

whylogs.core.image_profiling
Module Contents
Classes

TrackImage

This is a class that computes image features and visits profiles and so image features can be sketched.

Functions

image_loader(path: str = None) → PIL.Image.Image

get_pil_image_statistics(img: PIL.Image.Image, channels: List[str] = _IMAGE_HSV_CHANNELS, image_stats: List[str] = _STATS_PROPERTIES) → Dict

Compute statistics data for a PIL Image

get_pil_image_metadata(img: PIL.Image.Image) → Dict

Grab statistics data from a PIL ImageStats.Stat

image_based_metadata(img)

Attributes

logger

ImageType

DEFAULT_IMAGE_FEATURES

_DEFAULT_TAGS_ATTRIBUTES

_IMAGE_HSV_CHANNELS

_STATS_PROPERTIES

_DEFAULT_STAT_ATTRIBUTES

_METADATA_DEFAULT_ATTRIBUTES

whylogs.core.image_profiling.logger
whylogs.core.image_profiling.ImageType
whylogs.core.image_profiling.DEFAULT_IMAGE_FEATURES = []
whylogs.core.image_profiling._DEFAULT_TAGS_ATTRIBUTES = ['ImagePixelWidth', 'ImagePixelHeight', 'Colorspace']
whylogs.core.image_profiling._IMAGE_HSV_CHANNELS = ['Hue', 'Saturation', 'Brightness']
whylogs.core.image_profiling._STATS_PROPERTIES = ['mean', 'stddev']
whylogs.core.image_profiling._DEFAULT_STAT_ATTRIBUTES
whylogs.core.image_profiling._METADATA_DEFAULT_ATTRIBUTES
whylogs.core.image_profiling.image_loader(path: str = None) PIL.Image.Image
class whylogs.core.image_profiling.TrackImage(filepath: str = None, img: PIL.Image.Image = None, feature_transforms: List[Callable] = DEFAULT_IMAGE_FEATURES, feature_name: str = '', metadata_attributes: Union[str, List[str]] = _METADATA_DEFAULT_ATTRIBUTES)

This is a class that computes image features and visits profiles and so image features can be sketched.

feature_name

name given to this image feature, will prefix all image based features

Type

str

feature_transforms

Feature transforms to be apply to image data.

Type

List[Callable]

img

the PIL.Image

Type

PIL.Image

metadata_attributes

metadata attributes to track

Type

TYPE

__call__(self, profiles)

Call method to add image data and metadata to associated profiles :param profiles: DatasetProfile :type profiles: Union[List[DatasetProfile],DatasetProfile]

whylogs.core.image_profiling.get_pil_image_statistics(img: PIL.Image.Image, channels: List[str] = _IMAGE_HSV_CHANNELS, image_stats: List[str] = _STATS_PROPERTIES) Dict

Compute statistics data for a PIL Image

Parameters

img (ImageType) – PIL Image

Returns

of metadata

Return type

Dict

whylogs.core.image_profiling.get_pil_image_metadata(img: PIL.Image.Image) Dict

Grab statistics data from a PIL ImageStats.Stat

Parameters

img (ImageType) – PIL Image

Returns

of metadata

Return type

Dict

whylogs.core.image_profiling.image_based_metadata(img)
whylogs.core.model_profile
Module Contents
Classes

ModelProfile

Model Class for sketch metrics for model outputs

Attributes

SUPPORTED_TYPES

whylogs.core.model_profile.SUPPORTED_TYPES = ['binary', 'multiclass']
class whylogs.core.model_profile.ModelProfile(output_fields=None, metrics: whylogs.core.metrics.model_metrics.ModelMetrics = None)

Model Class for sketch metrics for model outputs

metrics

the model metrics object

Type

ModelMetrics

model_type

Type of mode, CLASSIFICATION, REGRESSION, UNKNOWN, etc.

Type

ModelType

output_fields

list of fields that map to model outputs

Type

list

add_output_field(self, field: str)
compute_metrics(self, targets, predictions, scores=None, model_type: whylogs.proto.ModelType = None, target_field=None, prediction_field=None, score_field=None)

Compute and track metrics for confusion_matrix

Parameters
  • targets (List) – targets (or actuals) for validation, if these are floats it is assumed the model is a regression type model

  • predictions (List) – predictions (or inferred values)

  • scores (List, optional) – associated scores for each prediction (for binary and multiclass problems)

  • target_field (str, optional) –

  • prediction_field (str, optional) –

  • score_field (str, optional (for binary and multiclass problems)) –

Raises

NotImplementedError

to_protobuf(self)
classmethod from_protobuf(cls, message: whylogs.proto.ModelProfileMessage)
merge(self, model_profile)
whylogs.core.summaryconverters

Library module defining function for generating summaries

Module Contents
Functions

from_sketch(sketch: datasketches.update_theta_sketch, num_std_devs: float = 1)

Generate a protobuf summary message from a datasketches theta sketch

from_string_sketch(sketch: datasketches.frequent_strings_sketch)

Generate a protobuf summary message from a string sketch

quantiles_from_sketch(sketch: datasketches.kll_floats_sketch, quantiles=None)

Calculate quantiles from a data sketch

single_quantile_from_sketch(sketch: datasketches.kll_floats_sketch, quantile: float)

Calculate the specified quantile from a data sketch

_calculate_bins(end: float, start: float, n: int, avg_per_bucket: float, max_buckets: int)

histogram_from_sketch(sketch: datasketches.kll_floats_sketch, max_buckets: int = None, avg_per_bucket: int = None)

Generate a summary of a kll_floats_sketch, including a histogram

entropy_from_column_summary(summary: whylogs.proto.ColumnSummary, histogram: datasketches.kll_floats_sketch)

Calculate the estimated entropy for a ColumnProfile, using the ColumnSummary

ks_test_compute_p_value(target_distribution: datasketches.kll_floats_sketch, reference_distribution: datasketches.kll_floats_sketch)

Compute the Kolmogorov-Smirnov test p-value of two continuous distributions.

compute_kl_divergence(target_distribution: Union[datasketches.kll_floats_sketch, whylogs.proto.ReferenceDistributionDiscreteMessage], reference_distribution: Union[datasketches.kll_floats_sketch, whylogs.proto.ReferenceDistributionDiscreteMessage])

Calculates the KL divergence between a target feature and a reference feature.

_compute_kl_divergence_continuous_distributions(target_distribution: datasketches.kll_floats_sketch, reference_distribution: datasketches.kll_floats_sketch)

Calculates the estimated KL divergence for two continuous distributions.

_compute_kl_divergence_discrete_distributions(target_distribution: whylogs.proto.ReferenceDistributionDiscreteMessage, reference_distribution: whylogs.proto.ReferenceDistributionDiscreteMessage)

Calculates the estimated KL divergence for two discrete distributions.

compute_chi_squared_test_p_value(target_distribution: whylogs.proto.ReferenceDistributionDiscreteMessage, reference_distribution: whylogs.proto.ReferenceDistributionDiscreteMessage)

Calculates the Chi-Squared test p-value for two discrete distributions.

Attributes

MAX_HIST_BUCKETS

HIST_AVG_NUMBER_PER_BUCKET

QUANTILES

logger

whylogs.core.summaryconverters.MAX_HIST_BUCKETS = 30
whylogs.core.summaryconverters.HIST_AVG_NUMBER_PER_BUCKET = 4.0
whylogs.core.summaryconverters.QUANTILES = [0.0, 0.01, 0.05, 0.25, 0.5, 0.75, 0.95, 0.99, 1.0]
whylogs.core.summaryconverters.logger
whylogs.core.summaryconverters.from_sketch(sketch: datasketches.update_theta_sketch, num_std_devs: float = 1)

Generate a protobuf summary message from a datasketches theta sketch

Parameters
  • sketch – Theta sketch to summarize

  • num_std_devs – Number of standard deviations for calculating bounds

Returns

summary

Return type

UniqueCountSummary

whylogs.core.summaryconverters.from_string_sketch(sketch: datasketches.frequent_strings_sketch)

Generate a protobuf summary message from a string sketch

Parameters

sketch – Frequent strings sketch

Returns

summary

Return type

FrequentStringsSummary

whylogs.core.summaryconverters.quantiles_from_sketch(sketch: datasketches.kll_floats_sketch, quantiles=None)

Calculate quantiles from a data sketch

Parameters
  • sketch (kll_floats_sketch) – Data sketch

  • quantiles (list-like) – Override the default quantiles. Should be a list of values from 0 to 1 inclusive.

whylogs.core.summaryconverters.single_quantile_from_sketch(sketch: datasketches.kll_floats_sketch, quantile: float)

Calculate the specified quantile from a data sketch

Parameters
  • sketch (kll_floats_sketch) – Data sketch

  • quantile (float) – Override the default quantiles to a single quantile. Should be a value from 0 to 1 inclusive.

Return type

Anonymous object with one filed equal to the quantile value

whylogs.core.summaryconverters._calculate_bins(end: float, start: float, n: int, avg_per_bucket: float, max_buckets: int)
whylogs.core.summaryconverters.histogram_from_sketch(sketch: datasketches.kll_floats_sketch, max_buckets: int = None, avg_per_bucket: int = None)

Generate a summary of a kll_floats_sketch, including a histogram

Parameters
  • sketch (kll_floats_sketch) – Data sketch

  • max_buckets (int) – Override the default maximum number of buckets

  • avg_per_bucket (int) – Override the default target number of items per bucket.

Returns

histogram – Protobuf histogram message

Return type

HistogramSummary

whylogs.core.summaryconverters.entropy_from_column_summary(summary: whylogs.proto.ColumnSummary, histogram: datasketches.kll_floats_sketch)

Calculate the estimated entropy for a ColumnProfile, using the ColumnSummary Can be used for both continuous and discrete types of data.

Parameters
  • summary (ColumnSummary) – Protobuf summary message

  • histogram (datasketches.kll_floats_sketch) – Data sketch for quantiles

Returns

entropy – Estimated entropy value, np.nan if the inferred data type of the column is not categorical or numeric

Return type

float

whylogs.core.summaryconverters.ks_test_compute_p_value(target_distribution: datasketches.kll_floats_sketch, reference_distribution: datasketches.kll_floats_sketch)

Compute the Kolmogorov-Smirnov test p-value of two continuous distributions. Uses the quantile values and the corresponding CDFs to calculate the approximate KS statistic. Only applicable to continuous distributions. The null hypothesis expects the samples to come from the same distribution.

Parameters
  • target_distribution (datasketches.kll_floats_sketch) – A kll_floats_sketch (quantiles sketch) from the target distribution’s values

  • reference_distribution (datasketches.kll_floats_sketch) – A kll_floats_sketch (quantiles sketch) from the reference (expected) distribution’s values Can be generated from a theoretical distribution, or another sample for the same feature.

Returns

  • p_value (float)

  • The estimated p-value from the parametrized KS test, applied on the target and reference distributions’

  • kll_floats_sketch summaries

whylogs.core.summaryconverters.compute_kl_divergence(target_distribution: Union[datasketches.kll_floats_sketch, whylogs.proto.ReferenceDistributionDiscreteMessage], reference_distribution: Union[datasketches.kll_floats_sketch, whylogs.proto.ReferenceDistributionDiscreteMessage])

Calculates the KL divergence between a target feature and a reference feature. Applicable to both continuous and discrete distributions. Uses the pmf and the datasketches.kll_floats_sketch to calculate the KL divergence in the continuous case. Uses the top frequent items to calculate the KL divergence in the discrete case.

Parameters
  • target_distribution (Union[kll_floats_sketch, ReferenceDistributionDiscreteMessage]) – The target distribution. Should be a datasketches.kll_floats_sketch if the target distribution is continuous. Should be a ReferenceDistributionDiscreteMessage if the target distribution is discrete. Both the target distribution, specified in target_distribution, and the reference distribution, specified in reference_distribution must be of the same type.

  • reference_distribution (Union[kll_floats_sketch, ReferenceDistributionDiscreteMessage]) – The reference distribution. Should be a datasketches.kll_floats_sketch if the reference distribution is continuous. Should be a ReferenceDistributionDiscreteMessage if the reference distribution is discrete. Both the target distribution, specified in target_distribution, and the reference distribution, specified in reference_distribution must be of the same type.

Returns

  • kl_divergence (float)

  • The estimated value of the KL divergence between the target and the reference feature

whylogs.core.summaryconverters._compute_kl_divergence_continuous_distributions(target_distribution: datasketches.kll_floats_sketch, reference_distribution: datasketches.kll_floats_sketch)

Calculates the estimated KL divergence for two continuous distributions. Uses the datasketches.kll_floats_sketch sketch to calculate the KL divergence based on the PMFs. Only applicable to continuous distributions.

Parameters
  • target_distribution (datasketches.kll_floats_sketch) – The quantiles summary of the target feature’s distribution.

  • reference_distribution (datasketches.kll_floats_sketch) – The quantiles summary of the reference feature’s distribution.

Returns

  • kl_divergence (float)

  • The estimated KL divergence between two continuous features.

whylogs.core.summaryconverters._compute_kl_divergence_discrete_distributions(target_distribution: whylogs.proto.ReferenceDistributionDiscreteMessage, reference_distribution: whylogs.proto.ReferenceDistributionDiscreteMessage)

Calculates the estimated KL divergence for two discrete distributions. Uses the frequent items summary to calculate the estimated frequencies of items in each distribution. Only applicable to discrete distributions.

Parameters
  • target_distribution (ReferenceDistributionDiscreteMessage) – The summary message of the target feature’s distribution. Should be a ReferenceDistributionDiscreteMessage containing the frequent items, unique, and total count summaries.

  • reference_distribution (ReferenceDistributionDiscreteMessage) – The summary message of the reference feature’s distribution. Should be a ReferenceDistributionDiscreteMessage containing the frequent items, unique, and total count summaries.

Returns

  • kl_divergence (float)

  • The estimated KL divergence between two discrete features.

whylogs.core.summaryconverters.compute_chi_squared_test_p_value(target_distribution: whylogs.proto.ReferenceDistributionDiscreteMessage, reference_distribution: whylogs.proto.ReferenceDistributionDiscreteMessage)

Calculates the Chi-Squared test p-value for two discrete distributions. Uses the top frequent items summary, unique count estimate and total count estimate for each feature, to calculate the estimated Chi-Squared statistic. Applicable only to discrete distributions.

Parameters
  • target_distribution (ReferenceDistributionDiscreteMessage) – The summary message of the target feature’s distribution. Should be a ReferenceDistributionDiscreteMessage containing the frequent items, unique, and total count summaries.

  • reference_distribution (ReferenceDistributionDiscreteMessage) – The summary message of the reference feature’s distribution. Should be a ReferenceDistributionDiscreteMessage containing the frequent items, unique, and total count summaries.

Returns

  • p_value (float)

  • The estimated p-value from the Chi-Squared test, applied on the target and reference distributions’

  • frequent and unique items summaries

Package Contents
Classes

TrackBB

ColumnProfile

Statistics tracking for a column (i.e. a feature)

MultiColumnProfile

Statistics tracking for a multiple columns (i.e. a features)

DatasetProfile

Statistics tracking for a dataset.

TrackImage

This is a class that computes image features and visits profiles and so image features can be sketched.

Attributes

BB_ATTRIBUTES

METADATA_DEFAULT_ATTRIBUTES

__ALL__

whylogs.core.BB_ATTRIBUTES = ['annotation_count', 'annotation_density', 'area_coverage', 'bb_width', 'bb_height', 'bb_area',...
class whylogs.core.TrackBB(filepath: str = None, obj: Dict = None, feature_transforms: Optional[List[Callable]] = None, feature_names: str = '')
calculate_metrics(self)
__call__(self, profiles)
class whylogs.core.ColumnProfile(name: str, number_tracker: whylogs.core.statistics.NumberTracker = None, string_tracker: whylogs.core.statistics.StringTracker = None, schema_tracker: whylogs.core.statistics.SchemaTracker = None, counters: whylogs.core.statistics.CountersTracker = None, frequent_items: whylogs.util.dsketch.FrequentItemsSketch = None, cardinality_tracker: whylogs.core.statistics.hllsketch.HllSketch = None, constraints: whylogs.core.statistics.constraints.ValueConstraints = None)

Statistics tracking for a column (i.e. a feature)

The primary method for

Parameters
  • name (str (required)) – Name of the column profile

  • number_tracker (NumberTracker) – Implements numeric data statistics tracking

  • string_tracker (StringTracker) – Implements string data-type statistics tracking

  • schema_tracker (SchemaTracker) – Implements tracking of schema-related information

  • counters (CountersTracker) – Keep count of various things

  • frequent_items (FrequentItemsSketch) – Keep track of all frequent items, even for mixed datatype features

  • cardinality_tracker (HllSketch) – Track feature cardinality (even for mixed data types)

  • constraints (ValueConstraints) – Static assertions to be applied to numeric data tracked in this column

  • TODO

    • Proper TypedDataConverter type checking

    • Multi-threading/parallelism

track(self, value, character_list=None, token_method=None)

Add value to tracking statistics.

_unique_count_summary(self) whylogs.proto.UniqueCountSummary
to_summary(self)

Generate a summary of the statistics

Returns

summary – Protobuf summary message.

Return type

ColumnSummary

generate_constraints(self) whylogs.core.statistics.constraints.SummaryConstraints
merge(self, other)

Merge this columnprofile with another.

Parameters

other (ColumnProfile) –

Returns

merged – A new, merged column profile.

Return type

ColumnProfile

to_protobuf(self)

Return the object serialized as a protobuf message

Returns

message

Return type

ColumnMessage

static from_protobuf(message)

Load from a protobuf message

Returns

column_profile

Return type

ColumnProfile

class whylogs.core.MultiColumnProfile(constraints: whylogs.core.statistics.constraints.MultiColumnValueConstraints = None)

Statistics tracking for a multiple columns (i.e. a features)

The primary method for

Parameters

constraints (MultiColumnValueConstraints) – Static assertions to be applied to data tracked between all columns

track(self, column_dict, character_list=None, token_method=None)

TODO: Add column_dict to tracking statistics.

abstract to_summary(self)

Generate a summary of the statistics

Returns

summary – Protobuf summary message.

Return type

(Multi)ColumnSummary

merge(self, other) MultiColumnProfile

Merge this columnprofile with another.

Parameters

other (MultiColumnProfile) –

Returns

merged – A new, merged multi column profile.

Return type

MultiColumnProfile

abstract to_protobuf(self)

Return the object serialized as a protobuf message

Returns

message

Return type

ColumnMessage

abstract static from_protobuf(message)

Load from a protobuf message

Returns

column_profile

Return type

MultiColumnProfile

class whylogs.core.DatasetProfile(name: str, dataset_timestamp: datetime.datetime = None, session_timestamp: datetime.datetime = None, columns: dict = None, multi_columns: whylogs.core.MultiColumnProfile = None, tags: Dict[str, str] = None, metadata: Dict[str, str] = None, session_id: str = None, model_profile: whylogs.core.model_profile.ModelProfile = None, constraints: whylogs.core.statistics.constraints.DatasetConstraints = None)

Statistics tracking for a dataset.

A dataset refers to a collection of columns.

Parameters
  • name (str) – A human readable name for the dataset profile. Could be model name. This is stored under “name” tag

  • dataset_timestamp (datetime.datetime) – The timestamp associated with the data (i.e. batch run). Optional.

  • session_timestamp (datetime.datetime) – Timestamp of the dataset

  • columns (dict) – Dictionary lookup of `ColumnProfile`s

  • tags (dict) – A dictionary of key->value. Can be used upstream for aggregating data. Tags must match when merging with another dataset profile object.

  • metadata (dict) – Metadata that can store arbitrary string mapping. Metadata is not used when aggregating data and can be dropped when merging with another dataset profile object.

  • session_id (str) – The unique session ID run. Should be a UUID.

  • constraints (DatasetConstraints) – Static assertions to be applied to tracked numeric data and profile summaries.

__getstate__(self)
__setstate__(self, serialized_profile)
property name(self)
property tags(self)
property metadata(self)
property session_timestamp(self)
property session_timestamp_ms(self)

Return the session timestamp value in epoch milliseconds.

property total_row_number(self)
add_output_field(self, field: Union[str, List[str]])
track_metrics(self, targets: List[Union[str, bool, float, int]], predictions: List[Union[str, bool, float, int]], scores: List[float] = None, model_type: whylogs.proto.ModelType = None, target_field: str = None, prediction_field: str = None, score_field: str = None)

Function to track metrics based on validation data.

user may also pass the associated attribute names associated with target, prediction, and/or score.

Parameters
  • targets (List[Union[str, bool, float, int]]) – actual validated values

  • predictions (List[Union[str, bool, float, int]]) – inferred/predicted values

  • scores (List[float], optional) – assocaited scores for each inferred, all values set to 1 if not passed

  • target_field (str, optional) – Description

  • prediction_field (str, optional) – Description

  • score_field (str, optional) – Description

  • model_type (ModelType, optional) – Defaul is Classification type.

  • target_field

  • prediction_field

  • score_field

  • score_field

track(self, columns, data=None, character_list=None, token_method=None)

Add value(s) to tracking statistics for column(s).

Parameters
  • columns (str, dict) – Either the name of a column, or a dictionary specifying column names and the data (value) for each column If a string, data must be supplied. Otherwise, data is ignored.

  • data (object, None) – Value to track. Specify if columns is a string.

track_datum(self, column_name, data, character_list=None, token_method=None)
track_multi_column(self, columns)
track_array(self, x: numpy.ndarray, columns=None)

Track statistics for a numpy array

Parameters
  • x (np.ndarray) – 2D array to track.

  • columns (list) – Optional column labels

track_dataframe(self, df: pandas.DataFrame, character_list=None, token_method=None)

Track statistics for a dataframe

Parameters

df (pandas.DataFrame) – DataFrame to track

to_properties(self)

Return dataset profile related metadata

Returns

properties – The metadata as a protobuf object.

Return type

DatasetProperties

to_summary(self)

Generate a summary of the statistics

Returns

summary – Protobuf summary message.

Return type

DatasetSummary

generate_constraints(self) whylogs.core.statistics.constraints.DatasetConstraints

Assemble a sparse dict of constraints for all features.

Returns

summary – Protobuf constraints message.

Return type

DatasetConstraints

flat_summary(self)

Generate and flatten a summary of the statistics.

See flatten_summary() for a description

_column_message_iterator(self)
chunk_iterator(self)

Generate an iterator to iterate over chunks of data

validate(self)

Sanity check for this object. Raises an AssertionError if invalid

merge(self, other)

Merge this profile with another dataset profile object.

We will use metadata and timestamps from the current DatasetProfile in the result.

This operation will drop the metadata from the ‘other’ profile object.

Parameters

other (DatasetProfile) –

Returns

merged – New, merged DatasetProfile

Return type

DatasetProfile

_do_merge(self, other)
merge_strict(self, other)

Merge this profile with another dataset profile object. This throws exception if session_id, timestamps and tags don’t match.

This operation will drop the metadata from the ‘other’ profile object.

Parameters

other (DatasetProfile) –

Returns

merged – New, merged DatasetProfile

Return type

DatasetProfile

serialize_delimited(self) bytes

Write out in delimited format (data is prefixed with the length of the datastream).

This is useful when you are streaming multiple dataset profile objects

Returns

data – A sequence of bytes

Return type

bytes

to_protobuf(self) whylogs.proto.DatasetProfileMessage

Return the object serialized as a protobuf message

Returns

message

Return type

DatasetProfileMessage

write_protobuf(self, protobuf_path: str, delimited_file: bool = True, transport_parameters: dict = None) None

Write the dataset profile to disk in binary format

Parameters
  • protobuf_path (str) – local path or any path supported supported by smart_open: https://github.com/RaRe-Technologies/smart_open#how. The parent directory must already exist

  • delimited_file (bool, optional) – whether to prefix the data with the length of output or not. Default is True

static read_protobuf(protobuf_path: str, delimited_file: bool = True, transport_parameters: dict = None) DatasetProfile

Parse a protobuf file and return a DatasetProfile object

Parameters
Returns

whylogs.DatasetProfile object from the protobuf

Return type

DatasetProfile

static from_protobuf(message: whylogs.proto.DatasetProfileMessage) DatasetProfile

Load from a protobuf message

Parameters

message (DatasetProfileMessage) – The protobuf message. Should match the output of DatasetProfile.to_protobuf()

Returns

dataset_profile

Return type

DatasetProfile

static from_protobuf_string(data: bytes) DatasetProfile

Deserialize a serialized DatasetProfileMessage

Parameters

data (bytes) – The serialized message

Returns

profile – The deserialized dataset profile

Return type

DatasetProfile

static _parse_delimited_generator(data: bytes)
static parse_delimited_single(data: bytes, pos=0)

Parse a single delimited entry from a byte stream :param data: The bytestream :type data: bytes :param pos: The starting position. Default is zero :type pos: int

Returns

  • pos (int) – Current position in the stream after parsing

  • profile (DatasetProfile) – A dataset profile

static parse_delimited(data: bytes)

Parse delimited data (i.e. data prefixed with the message length).

Java protobuf writes delimited messages, which is convenient for storing multiple dataset profiles. This means that the main data is prefixed with the length of the message.

Parameters

data (bytes) – The input byte stream

Returns

profiles – List of all Dataset profile objects

Return type

list

apply_summary_constraints(self, summary_constraints: Optional[Mapping[str, whylogs.core.statistics.constraints.SummaryConstraints]] = None)
apply_table_shape_constraints(self, table_shape_constraints: Optional[whylogs.core.statistics.constraints.SummaryConstraints] = None)
whylogs.core.METADATA_DEFAULT_ATTRIBUTES
class whylogs.core.TrackImage(filepath: str = None, img: PIL.Image.Image = None, feature_transforms: List[Callable] = DEFAULT_IMAGE_FEATURES, feature_name: str = '', metadata_attributes: Union[str, List[str]] = _METADATA_DEFAULT_ATTRIBUTES)

This is a class that computes image features and visits profiles and so image features can be sketched.

feature_name

name given to this image feature, will prefix all image based features

Type

str

feature_transforms

Feature transforms to be apply to image data.

Type

List[Callable]

img

the PIL.Image

Type

PIL.Image

metadata_attributes

metadata attributes to track

Type

TYPE

__call__(self, profiles)

Call method to add image data and metadata to associated profiles :param profiles: DatasetProfile :type profiles: Union[List[DatasetProfile],DatasetProfile]

whylogs.core.__ALL__
whylogs.features
Submodules
whylogs.features.autosegmentation
Module Contents
Functions

_entropy(series: pandas.Series, normalized: bool = True) → numpy.float64

Entropy calculation. If normalized, use log cardinality.

_weighted_entropy(df: pandas.DataFrame, split_columns: List[Optional[str]], target_column_name: str, normalized: bool = True)

Entropy calculation. If normalized, use log cardinality.

_information_gain_ratio(df: pandas.DataFrame, prev_split_columns: List[Optional[str]], column_name: str, target_column_name: str, normalized: bool = True)

Entropy calculation. If normalized, use log cardinality.

_find_best_split(df: pandas.DataFrame, prev_split_columns: List[str], valid_column_names: List[str], target_column_name: str)

_estimate_segments(df: pandas.DataFrame, target_field: str = None, max_segments: int = 30) → Optional[Union[List[Dict], List[str]]]

Estimates the most important features and values on which to segment

whylogs.features.autosegmentation._entropy(series: pandas.Series, normalized: bool = True) numpy.float64

Entropy calculation. If normalized, use log cardinality.

whylogs.features.autosegmentation._weighted_entropy(df: pandas.DataFrame, split_columns: List[Optional[str]], target_column_name: str, normalized: bool = True)

Entropy calculation. If normalized, use log cardinality.

whylogs.features.autosegmentation._information_gain_ratio(df: pandas.DataFrame, prev_split_columns: List[Optional[str]], column_name: str, target_column_name: str, normalized: bool = True)

Entropy calculation. If normalized, use log cardinality.

whylogs.features.autosegmentation._find_best_split(df: pandas.DataFrame, prev_split_columns: List[str], valid_column_names: List[str], target_column_name: str)
whylogs.features.autosegmentation._estimate_segments(df: pandas.DataFrame, target_field: str = None, max_segments: int = 30) Optional[Union[List[Dict], List[str]]]

Estimates the most important features and values on which to segment data profiling using entropy-based methods.

If no target column provided, maximum entropy column is substituted.

Parameters
  • df – the dataframe of data to profile

  • target_field – target field (optional)

  • max_segments – upper threshold for total combinations of segments,

default 30 :return: a list of segmentation feature names

whylogs.features.transforms
Module Contents
Classes

ComposeTransforms

Outputs the composition of each transformation passed in transforms

Brightness

Outputs the Brightness of each pixel in the image

Saturation

Summary

Resize

Helper Transform to resize images.

Hue

SimpleBlur

Simple Blur Ammount computation based on variance of laplacian

Attributes

logger

ImageType

whylogs.features.transforms.logger
whylogs.features.transforms.ImageType
class whylogs.features.transforms.ComposeTransforms(transforms: List, name=None)

Outputs the composition of each transformation passed in transforms

__call__(self, x)
__repr__(self)

Return repr(self).

class whylogs.features.transforms.Brightness

Outputs the Brightness of each pixel in the image

__call__(self, img: Union[PIL.Image.Image, numpy.ndarray]) numpy.ndarray
Parameters

img (Union[Image, np.ndarray]) – Either a PIL image or numpy array with int8 values

Returns

Converted image.

Return type

np.ndarray

Deleted Parameters:

pic (PIL Image or numpy.ndarray): Image to be converted to tensor.

__repr__(self)

Return repr(self).

class whylogs.features.transforms.Saturation

Summary Outputs the saturation of each pixel in the image

__call__(self, img: Union[PIL.Image.Image, numpy.ndarray]) numpy.ndarray
Parameters

img (Union[Image, np.ndarray]) – Either a PIL image or numpy array with int8 values

Returns

(1,number_pixels) array for saturation values for the image

Return type

np.ndarray

__repr__(self)

Return repr(self).

class whylogs.features.transforms.Resize(size)

Helper Transform to resize images.

size

Description

Type

TYPE

__call__(self, img: Union[PIL.Image.Image, numpy.ndarray]) numpy.ndarray
Parameters

img (Union[ImageType, np.ndarray]) – Description

Returns

Description

Return type

np.ndarray

__repr__(self)

Return repr(self).

class whylogs.features.transforms.Hue
__call__(self, img: Union[PIL.Image.Image, numpy.ndarray]) numpy.ndarray
Parameters

img (Union[Image, np.ndarray]) – Either a PIL image or numpy array with int8 values

Returns

(1,number_pixels) array for hue values for the image

Return type

np.ndarray

__repr__(self)

Return repr(self).

class whylogs.features.transforms.SimpleBlur

Simple Blur Ammount computation based on variance of laplacian Overall metric of how blurry is the image. No overall scale.

__call__(self, img: Union[PIL.Image.Image, numpy.ndarray]) float
Parameters

img (Union[Image, np.ndarray]) – Either a PIL image or numpy array with int8 values

Returns

variance of laplacian of image.

Return type

float

__repr__(self)

Return repr(self).

Package Contents
whylogs.features._IMAGE_FEATURES = ['Hue', 'Brightness', 'Saturation']
whylogs.io
Submodules
whylogs.io.file_loader
Module Contents
Functions

valid_file(fname: str)

simple check if extension is part of the implemented ones

extension_file(path: str)

Check the enconding format based on the magic number

image_loader(path: str)

tries to load image using the PIL lib

json_loader(path: str = None) → Union[Dict, list]

Loads json or jsonl data

file_loader(path: str, valid_file: Callable[[str], bool] = valid_file) → Any

Factory for file data

Attributes

EXTENSIONS

IMAGE_EXTENSIONS

PD_EXCEL_FORMATS

whylogs.io.file_loader.EXTENSIONS = ['.csv', '.jpg', '.jpeg', '.png', '.ppm', '.bmp', '.jsonl', '.json', '.pgm', '.tif', '.tiff',...
whylogs.io.file_loader.IMAGE_EXTENSIONS = ['.jpg', '.jpeg', '.png', '.ppm', '.bmp', '.pgm', '.tif', '.tiff', '.webp', '.gif']
whylogs.io.file_loader.PD_EXCEL_FORMATS = ['.xls', '.xlsx', '.xlsm', '.xlsb', '.odf', '.ods', '.odt']
whylogs.io.file_loader.valid_file(fname: str)

simple check if extension is part of the implemented ones

Parameters

fname (str) – file path

Returns

bool

whylogs.io.file_loader.extension_file(path: str)

Check the enconding format based on the magic number if file has no magic number we simply use extension. More advance analytics of file content is needed, potentially extendind to a lib like libmagic

Parameters

path (str) – File path

Returns

str: extension of encoding data magic_data : dic : any magic data information available including

magic number : byte mime_type: str name : str

Return type

file_extension_given

whylogs.io.file_loader.image_loader(path: str)

tries to load image using the PIL lib

Parameters

path (str) – path to image files

Returns

image data and image encoding format

Return type

PIL.Image.Image

whylogs.io.file_loader.json_loader(path: str = None) Union[Dict, list]

Loads json or jsonl data

Parameters

path (str, optional) – path to file

Returns

Union[Dict, list]: Returns a list or dict of json data json_format : format of file (json or jsonl)

Return type

objs

whylogs.io.file_loader.file_loader(path: str, valid_file: Callable[[str], bool] = valid_file) Any

Factory for file data

Parameters
  • path (str) – path to file

  • valid_file (Callable[[str], bool], optional) – Optional valid file check,

Returns

Tuple( [] Dataframe or Image data (PIL format), or Dict], magic_data: Dict of magic number data)

Return type

data

Raises

NotImplementedError – Description

whylogs.io.local_dataset
Module Contents
Classes

Dataset

Helper class that provides a standard way to create an ABC using

LocalDataset

Helper class that provides a standard way to create an ABC using

class whylogs.io.local_dataset.Dataset(root_folder: str = '', feature_transforms: Optional[List[Callable]] = None)

Bases: abc.ABC

Helper class that provides a standard way to create an ABC using inheritance.

abstract __getitem__(self, index: int) Any
abstract __len__(self) int
__repr__(self) str

Return repr(self).

class whylogs.io.local_dataset.LocalDataset(root_folder, loader: Callable[[str], Any] = file_loader, extensions: List[str] = EXTENSIONS, feature_transforms: Optional[List[Callable]] = None, valid_file: Optional[Callable[[str], bool]] = valid_file)

Bases: Dataset

Helper class that provides a standard way to create an ABC using inheritance.

_find_folder_feature(self) None
_init_dataset(self) List[Tuple[str, int]]
__getitem__(self, index: int) Tuple[Any, Any]
__len__(self)
Package Contents
Classes

LocalDataset

Helper class that provides a standard way to create an ABC using

Attributes

__ALL__

class whylogs.io.LocalDataset(root_folder, loader: Callable[[str], Any] = file_loader, extensions: List[str] = EXTENSIONS, feature_transforms: Optional[List[Callable]] = None, valid_file: Optional[Callable[[str], bool]] = valid_file)

Bases: Dataset

Helper class that provides a standard way to create an ABC using inheritance.

_find_folder_feature(self) None
_init_dataset(self) List[Tuple[str, int]]
__getitem__(self, index: int) Tuple[Any, Any]
__len__(self)
whylogs.io.__ALL__
whylogs.logs

Convenience module for displaying/configuring python logs for whylogs

Package Contents
Functions

display_logging(level='DEBUG', root_logger=False)

Convenience utility for setting whylogs to print logs to stdout.

whylogs.logs.display_logging(level='DEBUG', root_logger=False)

Convenience utility for setting whylogs to print logs to stdout.

Parameters
  • level (str) – Logging level

  • root_logger (bool, default=False) – Redirect to the root logger.

whylogs.mlflow
Submodules
whylogs.mlflow.model_wrapper
Module Contents
Classes

ModelWrapper

Attributes

logger

PyFuncOutput

whylogs.mlflow.model_wrapper.logger
whylogs.mlflow.model_wrapper.PyFuncOutput
class whylogs.mlflow.model_wrapper.ModelWrapper(model)

Bases: object

create_logger(self)
predict(self, data: pandas.DataFrame) PyFuncOutput

Wrapper around https://www.mlflow.org/docs/latest/_modules/mlflow/pyfunc.html#PyFuncModel.predict This allows us to capture input and predictions into whylogs

whylogs.mlflow.patcher
Module Contents
Classes

WhyLogsRun

Functions

_new_mlflow_conda_env(path=None, additional_conda_deps=None, additional_pip_deps=None, additional_conda_channels=None, install_mlflow=True)

_new_add_to_model(model, loader_module, data=None, code=None, env=None, **kwargs)

Replaces the MLFLow's original add_to_model

new_model_log(**kwargs)

Hijack the mlflow.models.Model.log method and upload the .whylogs.yaml configuration to the model path

enable_mlflow(session=None) → bool

Enable whylogs in mlflow module via mlflow.whylogs.

disable_mlflow()

Attributes

logger

_mlflow

_original_end_run

_active_whylogs

_is_patched

_original_mlflow_conda_env

_original_add_to_model

_original_model_log

WHYLOG_YAML

whylogs.mlflow.patcher.logger
whylogs.mlflow.patcher._mlflow
whylogs.mlflow.patcher._original_end_run
whylogs.mlflow.patcher._active_whylogs = []
whylogs.mlflow.patcher._is_patched = False
whylogs.mlflow.patcher._original_mlflow_conda_env
whylogs.mlflow.patcher._original_add_to_model
whylogs.mlflow.patcher._original_model_log
class whylogs.mlflow.patcher.WhyLogsRun(session=None)

Bases: object

_session
_active_run_id
_loggers :Dict[str, whylogs.app.logger.Logger]
_create_logger(self, dataset_name: Optional[str] = None, dataset_timestamp: Optional[datetime.datetime] = None)
log_pandas(self, df: pandas.DataFrame, dataset_name: Optional[str] = None, dataset_timestamp: Optional[datetime.datetime] = None)

Log the statistics of a Pandas dataframe. Note that this method is additive within a run: calling this method with a specific dataset name will not generate a new profile; instead, data will be aggregated into the existing profile.

In order to create a new profile, please specify a dataset_name

Parameters
  • df – the Pandas dataframe to log

  • dataset_name – the name of the dataset (Optional). If not specified, the experiment name is used

log(self, features: Optional[Dict[str, any]] = None, feature_name: Optional[str] = None, value: any = None, dataset_name: Optional[str] = None)

Logs a collection of features or a single feature (must specify one or the other).

Parameters
  • features – a map of key value feature for model input

  • feature_name – name of a single feature. Cannot be specified if ‘features’ is specified

  • value – value of as single feature. Cannot be specified if ‘features’ is specified

  • dataset_name – the name of the dataset. If not specified, we fall back to using the experiment name

_get_or_create_logger(self, dataset_name: Optional[str] = None, dataset_timestamp: Optional[datetime.datetime] = None)
_close(self)
whylogs.mlflow.patcher._new_mlflow_conda_env(path=None, additional_conda_deps=None, additional_pip_deps=None, additional_conda_channels=None, install_mlflow=True)
whylogs.mlflow.patcher._new_add_to_model(model, loader_module, data=None, code=None, env=None, **kwargs)

Replaces the MLFLow’s original add_to_model https://github.com/mlflow/mlflow/blob/4e68f960d4520ade6b64a28c297816f622adc83e/mlflow/pyfunc/__init__.py#L242

Accepts the same signature as MLFlow’s original add_to_model call. We inject our loader module.

We also inject whylogs into the Conda environment by patching _mlflow_conda_env.

Parameters
  • model – Existing model.

  • loader_module – The module to be used to load the model.

  • data – Path to the model data.

  • code – Path to the code dependencies.

  • env – Conda environment.

  • kwargs – Additional key-value pairs to include in the pyfunc flavor specification. Values must be YAML-serializable.

Returns

Updated model configuration.

whylogs.mlflow.patcher.WHYLOG_YAML = .whylogs.yaml
whylogs.mlflow.patcher.new_model_log(**kwargs)

Hijack the mlflow.models.Model.log method and upload the .whylogs.yaml configuration to the model path This will allow us to pick up the configuration later under /opt/ml/model/.whylogs.yaml path

whylogs.mlflow.patcher.enable_mlflow(session=None) bool

Enable whylogs in mlflow module via mlflow.whylogs.

Returns

True if MLFlow has been patched. False otherwise.

Example of whylogs and MLFlow
import mlflow
import whylogs

whylogs.enable_mlflow()

import numpy as np
import pandas as pd
pdf = pd.DataFrame(
    data=[[1, 2, 3, 4, True, "x", bytes([1])]],
    columns=["b", "d", "a", "c", "e", "g", "f"],
    dtype=np.object,
)

active_run = mlflow.start_run()

# log a Pandas dataframe under default name
mlflow.whylogs.log_pandas(pdf)

# log a Pandas dataframe with custom name
mlflow.whylogs.log_pandas(pdf, "another dataset")

# Finish the MLFlow run
mlflow.end_run()
whylogs.mlflow.patcher.disable_mlflow()
whylogs.mlflow.sklearn
Module Contents
Functions

_load_pyfunc(path: str)

whylogs.mlflow.sklearn._load_pyfunc(path: str)
Package Contents
Functions

disable_mlflow()

enable_mlflow(session=None) → bool

Enable whylogs in mlflow module via mlflow.whylogs.

list_whylogs_runs(experiment_id: str, dataset_name: str = 'default')

List all the runs from an experiment that contains whylogs

get_run_profiles(run_id: str, dataset_name: str = 'default', client=None)

Retrieve all whylogs DatasetProfile for a given run and a given dataset name.

get_experiment_profiles(experiment_id: str, dataset_name: str = 'default')

Retrieve all whylogs profiles for a given experiment. This only

whylogs.mlflow.disable_mlflow()
whylogs.mlflow.enable_mlflow(session=None) bool

Enable whylogs in mlflow module via mlflow.whylogs.

Returns

True if MLFlow has been patched. False otherwise.

Example of whylogs and MLFlow
import mlflow
import whylogs

whylogs.enable_mlflow()

import numpy as np
import pandas as pd
pdf = pd.DataFrame(
    data=[[1, 2, 3, 4, True, "x", bytes([1])]],
    columns=["b", "d", "a", "c", "e", "g", "f"],
    dtype=np.object,
)

active_run = mlflow.start_run()

# log a Pandas dataframe under default name
mlflow.whylogs.log_pandas(pdf)

# log a Pandas dataframe with custom name
mlflow.whylogs.log_pandas(pdf, "another dataset")

# Finish the MLFlow run
mlflow.end_run()
whylogs.mlflow.list_whylogs_runs(experiment_id: str, dataset_name: str = 'default')

List all the runs from an experiment that contains whylogs

Return type

typing.List[mlflow.entities.Run]

Parameters
  • experiment_id – the experiment id

  • dataset_name – the name of the dataset. Default to “default”

whylogs.mlflow.get_run_profiles(run_id: str, dataset_name: str = 'default', client=None)

Retrieve all whylogs DatasetProfile for a given run and a given dataset name.

Parameters
  • clientmlflow.tracking.MlflowClient

  • run_id – the run id

  • dataset_name – the dataset name within a run. If not set, use the default value “default”

Return type

typing.List[whylogs.DatasetProfile]

whylogs.mlflow.get_experiment_profiles(experiment_id: str, dataset_name: str = 'default')

Retrieve all whylogs profiles for a given experiment. This only returns Active Runs at the moment.

Return type

typing.List[whylogs.DatasetProfile]

Parameters
  • experiment_id – the experiment ID string

  • dataset_name – the dataset name within a run. If not set, use the default value “default”

whylogs.proto

Auto-generated protobuf class definitions.

Protobuf allows us to serialize/deserialize classes across languages

whylogs.util

Utilities for whylogs

Submodules
whylogs.util.data

Utility functions for interacting with data

Module Contents
Functions

getter(x, k: str, *args)

get an attribute (from an object) or key (from a dict-like object)

remap(x, mapping: dict)

Flatten a nested dictionary/object according to a specified name mapping.

_remap(x, mapping: dict, y: dict)

get_valid_filename(s)

Return the given string converted to a string that can be used for a clean

whylogs.util.data.getter(x, k: str, *args)

get an attribute (from an object) or key (from a dict-like object)

getter(x, k) raise KeyError if k not present

getter(x, k, default) return default if k not present

This is a convenience function that allows you to interact the same with an object or a dictionary

Parameters
  • x (object, dict) – Item to get attribute from

  • k (str) – Key or attribute name

  • default (optional) – Default value if k not present

Returns

val – Associated value

Return type

object

whylogs.util.data.remap(x, mapping: dict)

Flatten a nested dictionary/object according to a specified name mapping.

Parameters
  • x (object, dict) –

    An object or dict which can be treated as a nested dictionary, where attributes can be accessed as:

    attr = x.a.b[‘key_name’][‘other_Name’].d

    Indexing list values is not implemented, e.g.:

    x.a.b[3].d[‘key_name’]

  • mapping (dict) –

    Nested dictionary specifying the mapping. ONLY values specified in the mapping will be returned. For example:

    {'a': {
        'b': {
            'c': 'new_name'
        }
    }
    

    could flatten x.a.b.c or x.a[‘b’][‘c’] to x[‘new_name’]

Returns

flat – A flattened ordered dictionary of values

Return type

OrderedDict

whylogs.util.data._remap(x, mapping: dict, y: dict)
whylogs.util.data.get_valid_filename(s)

Return the given string converted to a string that can be used for a clean filename. Remove leading and trailing spaces; convert other spaces to underscores; and remove anything that is not an alphanumeric, dash, underscore, or dot.

>>> from whylogs.util.data import get_valid_filename
>>> get_valid_filename("  Background of tim's 8/1/2019 party!.jpg ")
whylogs.util.dsketch

Define functions and classes for interfacing with datasketches

Module Contents
Classes

FrequentItemsSketch

A class to implement frequent item counting for mixed data types.

Functions

deserialize_kll_floats_sketch(x: bytes, kind: str = 'float')

Deserialize a KLL floats sketch. Compatible with whylogs-java

deserialize_frequent_strings_sketch(x: bytes)

Deserialize a frequent strings sketch. Compatible with whylogs-java

whylogs.util.dsketch.deserialize_kll_floats_sketch(x: bytes, kind: str = 'float')

Deserialize a KLL floats sketch. Compatible with whylogs-java

whylogs histograms are serialized as kll floats sketches

Parameters
  • x (bytes) – Serialized sketch

  • kind (str, optional) – Specify type of sketch: ‘float’ or ‘int’

Returns

sketch – If x is an empty sketch, return None, else return the deserialized sketch.

Return type

kll_floats_sketch, kll_ints_sketch, or None

whylogs.util.dsketch.deserialize_frequent_strings_sketch(x: bytes)

Deserialize a frequent strings sketch. Compatible with whylogs-java

Wrapper for datasketches.frequent_strings_sketch.deserialize

Parameters

x (bytes) – Serialized sketch

Returns

sketch – If x is an empty string sketch, returns None, else returns the deserialized string sketch

Return type

datasketches.frequent_strings_sketch, None

class whylogs.util.dsketch.FrequentItemsSketch(lg_max_k: int = None, sketch: datasketches.frequent_strings_sketch = None)

A class to implement frequent item counting for mixed data types.

Wraps datasketches.frequent_strings_sketch by encoding numbers as strings since the datasketches python implementation does not implement frequent number tracking.

Parameters
  • lg_max_k (int, optional) – Parameter controlling the size and accuracy of the sketch. A larger number increases accuracy and the memory requirements for the sketch

  • sketch (datasketches.frequent_strings_sketch, optional) – Initialize with an existing frequent strings sketch

DEFAULT_MAX_ITEMS_SIZE = 128
DEFAULT_ERROR_TYPE
get_apriori_error(self, lg_max_map_size: int, estimated_total_weight: int)

Return an apriori estimate of the uncertainty for various parameters

Parameters
  • lg_max_map_size (int) – The lg_max_k value

  • estimated_total_weight – Total weight (see FrequentItems.get_total_weight())

Returns

error – Approximate uncertainty

Return type

float

get_epsilon_for_lg_size(self, lg_max_map_size: int)
get_estimate(self, item)
get_lower_bound(self, item)
get_upper_bound(self, item)
get_frequent_items(self, err_type: datasketches.frequent_items_error_type = None, threshold: int = 0, decode: bool = True)

Retrieve the frequent items.

Parameters
  • err_type (datasketches.frequent_items_error_type) – Override default error type

  • threshold (int) – Minimum count for returned items

  • decode (bool (default=True)) – Decode the returned values. Internally, all items are encoded as strings.

Returns

items – A list of tuples of items: [(item, count)]

Return type

list

get_num_active_items(self)
get_serialized_size_bytes(self)
get_sketch_epsilon(self)
get_total_weight(self)
is_empty(self)
merge(self, other)

Merge the item counts of this sketch with another.

This object will not be modified. This operation is commutative.

Parameters

other (FrequentItemsSketch) – The other sketch

copy(self)
Returns

sketch – A copy of this sketch

Return type

FrequentItemsSketch

serialize(self)

Serialize this sketch as a bytes string.

See also FrequentItemsSketch.deserialize()

Returns

data – Serialized object.

Return type

bytes

to_string(self, print_items=False)
update(self, x, weight=1)

Track an item.

Parameters
  • x (object) – Item to track

  • weight (int) – Number of times the item appears

to_summary(self, max_items=30, min_count=1)

Generate a protobuf summary. Returns None if there are no frequent items.

Parameters
  • max_items (int) – Maximum number of items to return. The most frequent items will be returned

  • min_count (int) – Minimum number counts for all returned items

Returns

summary – Protobuf summary message

Return type

FrequentItemsSummary

to_protobuf(self)

Generate a protobuf representation of this object

static from_protobuf(message: whylogs.proto.FrequentItemsSketchMessage)

Initialize a FrequentItemsSketch from a protobuf FrequentItemsSketchMessage

static _encode_item(x)
static deserialize(x: bytes)

Deserialize a frequent numbers sketch.

If x is an empty sketch, None is returned

whylogs.util.protobuf

Functions for interacting with protobuf

Module Contents
Functions

message_to_json(x: google.protobuf.message, **kwargs)

A wrapper for google.protobuf.json_format.MessageToJson

message_to_dict(x: google.protobuf.message)

Convert a protobuf message to a dictionary

_varint_delim_reader(fp)

_varint_delim_iterator(f)

Return an iterator to read delimited protobuf messages. The iterator will

multi_msg_reader(f, msg_class)

Return an iterator to iterate through protobuf messages in a multi-message

read_multi_msg(f, msg_class)

Wrapper for multi_msg_reader() which reads all the messages and

_encode_one_msg(msg: google.protobuf.message)

_write_multi_msg(msgs: list, fp)

write_multi_msg(msgs: list, f)

Write a list (or iterator) of protobuf messages to a file.

repr_message(x: google.protobuf.message.Message, indent=2, display=True)

Print or generate string preview of a protobuf message. This is mainly

_repr_message(x, level=0, msg='', display=True, indent=2)

whylogs.util.protobuf.message_to_json(x: google.protobuf.message, **kwargs)

A wrapper for google.protobuf.json_format.MessageToJson

Currently a very thin wrapper…x and kwargs are just passed to MessageToJson

whylogs.util.protobuf.message_to_dict(x: google.protobuf.message)

Convert a protobuf message to a dictionary

A thin wrapper around the google built-in function.

whylogs.util.protobuf._varint_delim_reader(fp)
whylogs.util.protobuf._varint_delim_iterator(f)

Return an iterator to read delimited protobuf messages. The iterator will return protobuf messages one by one as raw bytes objects.

whylogs.util.protobuf.multi_msg_reader(f, msg_class)

Return an iterator to iterate through protobuf messages in a multi-message protobuf file.

See also: write_multi_msg()

Parameters
  • f (str, file-object) – Filename or open file object to read from

  • msg_class (class) – The Protobuf message class, gets instantiated with a call to msg_class()

Returns

Iterator which returns protobuf messages

Return type

msg_iterator

whylogs.util.protobuf.read_multi_msg(f, msg_class)

Wrapper for multi_msg_reader() which reads all the messages and returns them as a list.

whylogs.util.protobuf._encode_one_msg(msg: google.protobuf.message)
whylogs.util.protobuf._write_multi_msg(msgs: list, fp)
whylogs.util.protobuf.write_multi_msg(msgs: list, f)

Write a list (or iterator) of protobuf messages to a file.

The multi-message file format is a binary format with:

<varint MessageBytesSize><message>

Which is repeated, where the len(message) in bytes is MessageBytesSize

Parameters
  • msgs (list, iterable) – Protobuf messages to write to disk

  • f (str, file-object) – Filename or open binary file object to write to

whylogs.util.protobuf.repr_message(x: google.protobuf.message.Message, indent=2, display=True)

Print or generate string preview of a protobuf message. This is mainly to get a preview of the attribute names and structure of a protobuf message class.

Parameters
  • x (google.protobuf.message.Message) – Message to preview

  • indent (int) – Indentation

  • display (bool) – If True, print the message and return None. Else, return a string.

Returns

msg – If display == False, return the message, else return None.

Return type

str, None

whylogs.util.protobuf._repr_message(x, level=0, msg='', display=True, indent=2)
whylogs.util.stats

Statistical functions used by whylogs

Module Contents
Functions

is_discrete(num_records: int, cardinality: int, p=0.15)

Estimate whether a feature is discrete given the number of records

Attributes

CARDINALITY_SLOP

whylogs.util.stats.CARDINALITY_SLOP = 1
whylogs.util.stats.is_discrete(num_records: int, cardinality: int, p=0.15)

Estimate whether a feature is discrete given the number of records observed and the cardinality (number of unique values)

The default assumption is that features are not discrete.

Parameters
  • num_records (int) – The number of observed records

  • cardinality (int) – Number of unique observed values

Returns

discrete – Whether the feature is discrete

Return type

bool

whylogs.util.time

Functions for interacting with timestamps and datetime objects

Module Contents
Functions

to_utc_ms(dt: datetime.datetime) → Optional[int]

Convert a datetime object to UTC epoch milliseconds

from_utc_ms(utc: Optional[int]) → Optional[datetime.datetime]

Convert a UTC epoch milliseconds timestamp to a datetime object

whylogs.util.time.to_utc_ms(dt: datetime.datetime) Optional[int]

Convert a datetime object to UTC epoch milliseconds

Returns

timstamp_ms – Timestamp

Return type

int

whylogs.util.time.from_utc_ms(utc: Optional[int]) Optional[datetime.datetime]

Convert a UTC epoch milliseconds timestamp to a datetime object

Parameters

utc (int) – Timestamp

Returns

dt – Datetime object

Return type

datetime.datetime

whylogs.util.util_functions
Module Contents
Functions

encode_to_integers(values, uniques)

whylogs.util.util_functions.encode_to_integers(values, uniques)
whylogs.util.varint

Varint encoder/decoder

varints are a common encoding for variable length integer data, used in libraries such as sqlite, protobuf, v8, and more. Here’s a quick and dirty module to help avoid reimplementing the same thing over and over again.

Taken from https://github.com/fmoo/python-varint/blob/master/varint.py

MIT License

Module Contents
Functions

_byte(b)

encode(number)

Pack number into varint bytes

decode_stream(stream)

Read a varint from stream. Returns None if an EOF is encountered

decode_bytes(buf)

Read a varint from from buf bytes

_read_one(stream)

Read a byte from the file (as an integer)

whylogs.util.varint._byte(b)
whylogs.util.varint.encode(number)

Pack number into varint bytes

whylogs.util.varint.decode_stream(stream)

Read a varint from stream. Returns None if an EOF is encountered

whylogs.util.varint.decode_bytes(buf)

Read a varint from from buf bytes

whylogs.util.varint._read_one(stream)

Read a byte from the file (as an integer) raises EOFError if the stream ends while reading bytes.

whylogs.viz
Subpackages
whylogs.viz.matplotlib
Submodules
whylogs.viz.matplotlib.visualizer
Module Contents
Classes

MatplotlibProfileVisualizer

Functions

array_creation(char_histos, bins, char_list)

class whylogs.viz.matplotlib.visualizer.MatplotlibProfileVisualizer

Bases: whylogs.viz.BaseProfileVisualizer

available_plots(self)

Returns available plots for matplotlib framework.

_init_data_preprocessing(self, profiles)
_init_theming(self)
static _chart_theming()

Applies theming needed for each chart.

_prof_data(self, variable)
_summary_data_preprocessing(self, variable)

Applies general data preprocessing for each chart.

_confirm_profile_data(self)

Checks for that profiles and profile data already set.

plot_token_length(self, variable, ts_format='%d-%b-%y', **kwargs)

Plots token length data .

plot_char_pos(self, variable, character_list=None, ts_format='%d-%b-%y', **kwargs)

Plots character position data .

plot_string_length(self, variable, ts_format='%d-%b-%y', **kwargs)

Plots string length data .

plot_string(self, variable, character_list, ts_format='%d-%b-%y', **kwargs)

Plots string related data .

plot_distribution(self, variable, ts_format='%d-%b-%y', **kwargs)

Plots a distribution chart.

plot_missing_values(self, variable, ts_format='%d-%b-%y', **kwargs)

Plots a Missing Value to Total Count ratio chart.

plot_uniqueness(self, variable, ts_format='%d-%b-%y', **kwargs)

Plots a Estimated Unique Values chart.

plot_data_types(self, variable, ts_format='%d-%b-%y', **kwargs)

Plots a Inferred Data Types chart.

whylogs.viz.matplotlib.visualizer.array_creation(char_histos, bins, char_list)
whylogs.viz.utils
Submodules
whylogs.viz.utils.profile_viz_calculations
Module Contents
Functions

__calculate_variance(profile_jsons, feature_name)

Calculates variance for single feature

__calculate_coefficient_of_variation(profile_jsons, feature_name)

Calculates coefficient of variation for single feature

__calculate_sum(profile_jsons, feature_name)

Calculates sum for single feature

__calculate_quantile_statistics(feature, profile_jsons, feature_name)

Calculates sum for single feature

add_drift_val_to_ref_profile_json(target_profile, reference_profile, reference_profile_json)

Calculates drift value for reference profile based on profile type and inserts that data into reference profile

add_feature_statistics(feature, profile_json, feature_name)

Calculates different values for feature statistics

Attributes

categorical_types

whylogs.viz.utils.profile_viz_calculations.categorical_types
whylogs.viz.utils.profile_viz_calculations.__calculate_variance(profile_jsons, feature_name)

Calculates variance for single feature

Parameters
  • profile_jsons (Profile summary serialized json) –

  • feature_name (Name of feature) –

Returns

variance

Return type

Calculated variance for feature

whylogs.viz.utils.profile_viz_calculations.__calculate_coefficient_of_variation(profile_jsons, feature_name)

Calculates coefficient of variation for single feature

Parameters
  • profile_jsons (Profile summary serialized json) –

  • feature_name (Name of feature) –

Returns

coefficient_of_variation

Return type

Calculated coefficient of variation for feature

whylogs.viz.utils.profile_viz_calculations.__calculate_sum(profile_jsons, feature_name)

Calculates sum for single feature

Parameters
  • profile_jsons (Profile summary serialized json) –

  • feature_name (Name of feature) –

Returns

coefficient_of_variation

Return type

Calculated sum for feature

whylogs.viz.utils.profile_viz_calculations.__calculate_quantile_statistics(feature, profile_jsons, feature_name)

Calculates sum for single feature

Parameters
  • profile_jsons (Profile summary serialized json) –

  • feature_name (Name of feature) –

Returns

coefficient_of_variation

Return type

Calculated sum for feature

whylogs.viz.utils.profile_viz_calculations.add_drift_val_to_ref_profile_json(target_profile, reference_profile, reference_profile_json)

Calculates drift value for reference profile based on profile type and inserts that data into reference profile

Parameters
  • target_profile (Target profile) –

  • reference_profile (Reference profile) –

  • reference_profile_json (Reference profile summary serialized json) –

Returns

reference_profile_json

Return type

Reference profile summary serialized json with drift value for every feature

whylogs.viz.utils.profile_viz_calculations.add_feature_statistics(feature, profile_json, feature_name)

Calculates different values for feature statistics

Parameters
  • feature

  • profile_json (Profile summary serialized json) –

  • feature_name (Name of feature) –

Returns

feature

Return type

Feature data with appended values for statistics report

Submodules
whylogs.viz.base
Module Contents
Classes

BaseProfileVisualizer

class whylogs.viz.base.BaseProfileVisualizer(framework=None, visualizer=None)
set_profiles(self, profiles)
plot_distribution(self, variable, **kwargs)

Plots a distribution chart.

plot_missing_values(self, variable, **kwargs)

Plots a Missing Value to Total Count ratio chart.

plot_uniqueness(self, variable, **kwargs)

Plots a Estimated Unique Values chart.

plot_data_types(self, variable, **kwargs)

Plots a Inferred Data Types chart.

plot_string_length(self, variable, **kwargs)

Plots string length data .

plot_token_length(self, variable, character_list, **kwargs)

Plots token length data .

plot_char_pos(self, variable, character_list, **kwargs)

Plots character position data .

plot_string(self, variable, character_list, **kwargs)

Plots string related data .

available_plots(self)

Returns available plots for selected framework.

whylogs.viz.browser_viz
Module Contents
Functions

is_wsl()

profile_viewer(profiles: List[whylogs.core.DatasetProfile] = None, reference_profiles: List[whylogs.core.DatasetProfile] = None, output_path=None) → str

open a profile viewer loader on your default browser

Attributes

_MY_DIR

logger

whylogs.viz.browser_viz._MY_DIR
whylogs.viz.browser_viz.logger
whylogs.viz.browser_viz.is_wsl()
whylogs.viz.browser_viz.profile_viewer(profiles: List[whylogs.core.DatasetProfile] = None, reference_profiles: List[whylogs.core.DatasetProfile] = None, output_path=None) str

open a profile viewer loader on your default browser

whylogs.viz.jupyter_notebook_viz
Module Contents
Classes

NotebookProfileVisualizer

Attributes

_MY_DIR

logger

numerical_types

whylogs.viz.jupyter_notebook_viz._MY_DIR
whylogs.viz.jupyter_notebook_viz.logger
whylogs.viz.jupyter_notebook_viz.numerical_types
class whylogs.viz.jupyter_notebook_viz.NotebookProfileVisualizer
SUMMARY_REPORT_TEMPLATE_NAME = index-hbs-cdn-all-in-for-jupyter-notebook.html
DOUBLE_HISTOGRAM_TEMPLATE_NAME = index-hbs-cdn-all-in-jupyter-distribution-chart.html
DISTRIBUTION_CHART_TEMPLATE_NAME = index-hbs-cdn-all-in-jupyter-bar-chart.html
DIFFERENCED_CHART_TEMPLATE_NAME = index-hbs-cdn-all-in-jupyter-differenced-chart.html
FEATURE_STATISTICS_TEMPLATE_NAME = index-hbs-cdn-all-in-jupyter-feature-summary-statistics.html
CONSTRAINTS_REPORT_TEMPLATE_NAME = index-hbs-cdn-all-in-jupyter-constraints-report.html
PAGE_SIZES
__get_template_path(self, html_file_name)
__get_compiled_template(self, template_name)
__display_feature_chart(self, feature_names, template_name, preferred_cell_height=None)
__display_rendered_template(self, template, template_name, height)
set_profiles(self, target_profile: whylogs.core.DatasetProfile = None, reference_profile: whylogs.core.DatasetProfile = None)
summary_drift_report(self, preferred_cell_height=None)
double_histogram(self, feature_names, preferred_cell_height=None)
distribution_chart(self, feature_names, preferred_cell_height=None)
difference_distribution_chart(self, feature_names, preferred_cell_height=None)
feature_statistics(self, feature_name, profile='reference', preferred_cell_height=None)
constraints_report(self, constraints, preferred_cell_height=None)
download(self, html, preferred_path=None, html_file_name=None)
whylogs.viz.visualizer
Module Contents
Classes

ProfileVisualizer

class whylogs.viz.visualizer.ProfileVisualizer(framework='matplotlib')

Bases: whylogs.viz.base.BaseProfileVisualizer

__subclass_framework(self, framework='matplotlib')
Package Contents
Classes

NotebookProfileVisualizer

BaseProfileVisualizer

ProfileVisualizer

Functions

profile_viewer(profiles: List[whylogs.core.DatasetProfile] = None, reference_profiles: List[whylogs.core.DatasetProfile] = None, output_path=None) → str

open a profile viewer loader on your default browser

Attributes

__ALL__

whylogs.viz.profile_viewer(profiles: List[whylogs.core.DatasetProfile] = None, reference_profiles: List[whylogs.core.DatasetProfile] = None, output_path=None) str

open a profile viewer loader on your default browser

class whylogs.viz.NotebookProfileVisualizer
SUMMARY_REPORT_TEMPLATE_NAME = index-hbs-cdn-all-in-for-jupyter-notebook.html
DOUBLE_HISTOGRAM_TEMPLATE_NAME = index-hbs-cdn-all-in-jupyter-distribution-chart.html
DISTRIBUTION_CHART_TEMPLATE_NAME = index-hbs-cdn-all-in-jupyter-bar-chart.html
DIFFERENCED_CHART_TEMPLATE_NAME = index-hbs-cdn-all-in-jupyter-differenced-chart.html
FEATURE_STATISTICS_TEMPLATE_NAME = index-hbs-cdn-all-in-jupyter-feature-summary-statistics.html
CONSTRAINTS_REPORT_TEMPLATE_NAME = index-hbs-cdn-all-in-jupyter-constraints-report.html
PAGE_SIZES
__get_template_path(self, html_file_name)
__get_compiled_template(self, template_name)
__display_feature_chart(self, feature_names, template_name, preferred_cell_height=None)
__display_rendered_template(self, template, template_name, height)
set_profiles(self, target_profile: whylogs.core.DatasetProfile = None, reference_profile: whylogs.core.DatasetProfile = None)
summary_drift_report(self, preferred_cell_height=None)
double_histogram(self, feature_names, preferred_cell_height=None)
distribution_chart(self, feature_names, preferred_cell_height=None)
difference_distribution_chart(self, feature_names, preferred_cell_height=None)
feature_statistics(self, feature_name, profile='reference', preferred_cell_height=None)
constraints_report(self, constraints, preferred_cell_height=None)
download(self, html, preferred_path=None, html_file_name=None)
class whylogs.viz.BaseProfileVisualizer(framework=None, visualizer=None)
set_profiles(self, profiles)
plot_distribution(self, variable, **kwargs)

Plots a distribution chart.

plot_missing_values(self, variable, **kwargs)

Plots a Missing Value to Total Count ratio chart.

plot_uniqueness(self, variable, **kwargs)

Plots a Estimated Unique Values chart.

plot_data_types(self, variable, **kwargs)

Plots a Inferred Data Types chart.

plot_string_length(self, variable, **kwargs)

Plots string length data .

plot_token_length(self, variable, character_list, **kwargs)

Plots token length data .

plot_char_pos(self, variable, character_list, **kwargs)

Plots character position data .

plot_string(self, variable, character_list, **kwargs)

Plots string related data .

available_plots(self)

Returns available plots for selected framework.

class whylogs.viz.ProfileVisualizer(framework='matplotlib')

Bases: whylogs.viz.base.BaseProfileVisualizer

__subclass_framework(self, framework='matplotlib')
whylogs.viz.__ALL__
whylogs.whylabs_client

Utils related to optional communication with Whylabs APIs

Submodules
whylogs.whylabs_client.wrapper
Module Contents
Functions

_get_whylabs_client() → whylabs_client.apis.SessionsApi

_get_or_create_log_client() → whylabs_client.api.log_api.LogApi

start_session() → None

upload_profile(dataset_profile: whylogs.core.DatasetProfile) → None

_upload_whylabs(dataset_profile, dataset_timestamp, profile_path)

_upload_guest_session(dataset_timestamp: int, profile_path: str)

end_session() → Optional[str]

Attributes

whylabs_api_endpoint

configuration

_session_token

_logger

_api_key

_api_log_client

whylogs.whylabs_client.wrapper.whylabs_api_endpoint
whylogs.whylabs_client.wrapper.configuration
whylogs.whylabs_client.wrapper._session_token
whylogs.whylabs_client.wrapper._logger
whylogs.whylabs_client.wrapper._api_key
whylogs.whylabs_client.wrapper._api_log_client
whylogs.whylabs_client.wrapper._get_whylabs_client() whylabs_client.apis.SessionsApi
whylogs.whylabs_client.wrapper._get_or_create_log_client() whylabs_client.api.log_api.LogApi
whylogs.whylabs_client.wrapper.start_session() None
whylogs.whylabs_client.wrapper.upload_profile(dataset_profile: whylogs.core.DatasetProfile) None
whylogs.whylabs_client.wrapper._upload_whylabs(dataset_profile, dataset_timestamp, profile_path)
whylogs.whylabs_client.wrapper._upload_guest_session(dataset_timestamp: int, profile_path: str)
whylogs.whylabs_client.wrapper.end_session() Optional[str]
Package Contents
Functions

end_session() → Optional[str]

start_session() → None

upload_profile(dataset_profile: whylogs.core.DatasetProfile) → None

Attributes

__ALL__

whylogs.whylabs_client.end_session() Optional[str]
whylogs.whylabs_client.start_session() None
whylogs.whylabs_client.upload_profile(dataset_profile: whylogs.core.DatasetProfile) None
whylogs.whylabs_client.__ALL__

Submodules

whylogs._version

WhyLabs version number.

Module Contents
whylogs._version.__version__ = 0.7.8

Package Contents

Classes

SessionConfig

Config for a whylogs session.

WriterConfig

Config for whylogs writers

ColumnProfile

Statistics tracking for a column (i.e. a feature)

DatasetProfile

Statistics tracking for a dataset.

Functions

get_or_create_session(path_to_config: Optional[str] = None, report_progress: Optional[bool] = False)

Retrieve the current active global session.

reset_default_session()

Reset and deactivate the global whylogs logging session.

start_whylabs_session(path_to_config: Optional[str] = None, report_progress: Optional[bool] = False)

enable_mlflow(session=None) → bool

Enable whylogs in mlflow module via mlflow.whylogs.

Attributes

__version__

whylogs.__version__ = 0.7.8
class whylogs.SessionConfig(project: str, pipeline: str, writers: List[WriterConfig], metadata: Optional[MetadataConfig] = None, verbose: bool = False, with_rotation_time: str = None, cache_size: int = 1, report_progress: bool = False)

Config for a whylogs session.

See also SessionConfigSchema

Parameters
  • project (str) – Project associated with this whylogs session

  • pipeline (str) – Name of the associated data pipeline

  • writers (list) – A list of WriterConfig objects defining writer outputs

  • metadata (MetadataConfig) – A MetadataConfiguration object. If none, will replace with default.

  • verbose (bool, default=False) – Output verbosity

  • with_rotation_time (str, default = None, to rotate profiles with time, takes values of overall rotation interval,) – “s” for seconds “m” for minutes “h” for hours “d” for days

  • cache_size (int default =1, sets how many dataprofiles to cache in logger during rotation) –

to_yaml(self, stream=None)

Serialize this config to YAML

Parameters

stream – If None (default) return a string, else dump the yaml into this stream.

static from_yaml(stream)

Load config from yaml

Parameters

stream (str, file-obj) – String or file-like object to load yaml from

Returns

config – Generated config

Return type

SessionConfig

class whylogs.WriterConfig(type: str, formats: Optional[List[str]] = None, output_path: Optional[str] = None, path_template: Optional[str] = None, filename_template: Optional[str] = None, data_collection_consent: Optional[bool] = None, transport_parameters: Optional[TransportParameterConfig] = None)

Config for whylogs writers

See also:

Parameters
  • type (str) – Destination for the writer output, e.g. ‘local’ or ‘s3’

  • formats (list) – All output formats. See ALL_SUPPORTED_FORMATS

  • output_path (str) – Prefix of where to output files. A directory for type = ‘local’, or key prefix for type = ‘s3’

  • path_template (str, optional) – Templatized path output using standard python string templates. Variables are accessed via $identifier or ${identifier}. See whylogs.app.writers.Writer.template_params() for a list of available identifers. Default = whylogs.app.writers.DEFAULT_PATH_TEMPLATE

  • filename_template (str, optional) – Templatized output filename using standardized python string templates. Variables are accessed via $identifier or ${identifier}. See whylogs.app.writers.Writer.template_params() for a list of available identifers. Default = whylogs.app.writers.DEFAULT_FILENAME_TEMPLATE

to_yaml(self, stream=None)

Serialize this config to YAML

Parameters

stream – If None (default) return a string, else dump the yaml into this stream.

static from_yaml(stream, **kwargs)

Load config from yaml

Parameters
  • stream (str, file-obj) – String or file-like object to load yaml from

  • kwargs – ignored

Returns

config – Generated config

Return type

WriterConfig

whylogs.get_or_create_session(path_to_config: Optional[str] = None, report_progress: Optional[bool] = False)

Retrieve the current active global session.

If no active session exists, attempt to load config and create a new session.

If an active session exists, return the session without loading new config.

Returns

The global active session

Return type

Session

whylogs.reset_default_session()

Reset and deactivate the global whylogs logging session.

whylogs.start_whylabs_session(path_to_config: Optional[str] = None, report_progress: Optional[bool] = False)
class whylogs.ColumnProfile(name: str, number_tracker: whylogs.core.statistics.NumberTracker = None, string_tracker: whylogs.core.statistics.StringTracker = None, schema_tracker: whylogs.core.statistics.SchemaTracker = None, counters: whylogs.core.statistics.CountersTracker = None, frequent_items: whylogs.util.dsketch.FrequentItemsSketch = None, cardinality_tracker: whylogs.core.statistics.hllsketch.HllSketch = None, constraints: whylogs.core.statistics.constraints.ValueConstraints = None)

Statistics tracking for a column (i.e. a feature)

The primary method for

Parameters
  • name (str (required)) – Name of the column profile

  • number_tracker (NumberTracker) – Implements numeric data statistics tracking

  • string_tracker (StringTracker) – Implements string data-type statistics tracking

  • schema_tracker (SchemaTracker) – Implements tracking of schema-related information

  • counters (CountersTracker) – Keep count of various things

  • frequent_items (FrequentItemsSketch) – Keep track of all frequent items, even for mixed datatype features

  • cardinality_tracker (HllSketch) – Track feature cardinality (even for mixed data types)

  • constraints (ValueConstraints) – Static assertions to be applied to numeric data tracked in this column

  • TODO

    • Proper TypedDataConverter type checking

    • Multi-threading/parallelism

track(self, value, character_list=None, token_method=None)

Add value to tracking statistics.

_unique_count_summary(self) whylogs.proto.UniqueCountSummary
to_summary(self)

Generate a summary of the statistics

Returns

summary – Protobuf summary message.

Return type

ColumnSummary

generate_constraints(self) whylogs.core.statistics.constraints.SummaryConstraints
merge(self, other)

Merge this columnprofile with another.

Parameters

other (ColumnProfile) –

Returns

merged – A new, merged column profile.

Return type

ColumnProfile

to_protobuf(self)

Return the object serialized as a protobuf message

Returns

message

Return type

ColumnMessage

static from_protobuf(message)

Load from a protobuf message

Returns

column_profile

Return type

ColumnProfile

class whylogs.DatasetProfile(name: str, dataset_timestamp: datetime.datetime = None, session_timestamp: datetime.datetime = None, columns: dict = None, multi_columns: whylogs.core.MultiColumnProfile = None, tags: Dict[str, str] = None, metadata: Dict[str, str] = None, session_id: str = None, model_profile: whylogs.core.model_profile.ModelProfile = None, constraints: whylogs.core.statistics.constraints.DatasetConstraints = None)

Statistics tracking for a dataset.

A dataset refers to a collection of columns.

Parameters
  • name (str) – A human readable name for the dataset profile. Could be model name. This is stored under “name” tag

  • dataset_timestamp (datetime.datetime) – The timestamp associated with the data (i.e. batch run). Optional.

  • session_timestamp (datetime.datetime) – Timestamp of the dataset

  • columns (dict) – Dictionary lookup of `ColumnProfile`s

  • tags (dict) – A dictionary of key->value. Can be used upstream for aggregating data. Tags must match when merging with another dataset profile object.

  • metadata (dict) – Metadata that can store arbitrary string mapping. Metadata is not used when aggregating data and can be dropped when merging with another dataset profile object.

  • session_id (str) – The unique session ID run. Should be a UUID.

  • constraints (DatasetConstraints) – Static assertions to be applied to tracked numeric data and profile summaries.

__getstate__(self)
__setstate__(self, serialized_profile)
property name(self)
property tags(self)
property metadata(self)
property session_timestamp(self)
property session_timestamp_ms(self)

Return the session timestamp value in epoch milliseconds.

property total_row_number(self)
add_output_field(self, field: Union[str, List[str]])
track_metrics(self, targets: List[Union[str, bool, float, int]], predictions: List[Union[str, bool, float, int]], scores: List[float] = None, model_type: whylogs.proto.ModelType = None, target_field: str = None, prediction_field: str = None, score_field: str = None)

Function to track metrics based on validation data.

user may also pass the associated attribute names associated with target, prediction, and/or score.

Parameters
  • targets (List[Union[str, bool, float, int]]) – actual validated values

  • predictions (List[Union[str, bool, float, int]]) – inferred/predicted values

  • scores (List[float], optional) – assocaited scores for each inferred, all values set to 1 if not passed

  • target_field (str, optional) – Description

  • prediction_field (str, optional) – Description

  • score_field (str, optional) – Description

  • model_type (ModelType, optional) – Defaul is Classification type.

  • target_field

  • prediction_field

  • score_field

  • score_field

track(self, columns, data=None, character_list=None, token_method=None)

Add value(s) to tracking statistics for column(s).

Parameters
  • columns (str, dict) – Either the name of a column, or a dictionary specifying column names and the data (value) for each column If a string, data must be supplied. Otherwise, data is ignored.

  • data (object, None) – Value to track. Specify if columns is a string.

track_datum(self, column_name, data, character_list=None, token_method=None)
track_multi_column(self, columns)
track_array(self, x: numpy.ndarray, columns=None)

Track statistics for a numpy array

Parameters
  • x (np.ndarray) – 2D array to track.

  • columns (list) – Optional column labels

track_dataframe(self, df: pandas.DataFrame, character_list=None, token_method=None)

Track statistics for a dataframe

Parameters

df (pandas.DataFrame) – DataFrame to track

to_properties(self)

Return dataset profile related metadata

Returns

properties – The metadata as a protobuf object.

Return type

DatasetProperties

to_summary(self)

Generate a summary of the statistics

Returns

summary – Protobuf summary message.

Return type

DatasetSummary

generate_constraints(self) whylogs.core.statistics.constraints.DatasetConstraints

Assemble a sparse dict of constraints for all features.

Returns

summary – Protobuf constraints message.

Return type

DatasetConstraints

flat_summary(self)

Generate and flatten a summary of the statistics.

See flatten_summary() for a description

_column_message_iterator(self)
chunk_iterator(self)

Generate an iterator to iterate over chunks of data

validate(self)

Sanity check for this object. Raises an AssertionError if invalid

merge(self, other)

Merge this profile with another dataset profile object.

We will use metadata and timestamps from the current DatasetProfile in the result.

This operation will drop the metadata from the ‘other’ profile object.

Parameters

other (DatasetProfile) –

Returns

merged – New, merged DatasetProfile

Return type

DatasetProfile

_do_merge(self, other)
merge_strict(self, other)

Merge this profile with another dataset profile object. This throws exception if session_id, timestamps and tags don’t match.

This operation will drop the metadata from the ‘other’ profile object.

Parameters

other (DatasetProfile) –

Returns

merged – New, merged DatasetProfile

Return type

DatasetProfile

serialize_delimited(self) bytes

Write out in delimited format (data is prefixed with the length of the datastream).

This is useful when you are streaming multiple dataset profile objects

Returns

data – A sequence of bytes

Return type

bytes

to_protobuf(self) whylogs.proto.DatasetProfileMessage

Return the object serialized as a protobuf message

Returns

message

Return type

DatasetProfileMessage

write_protobuf(self, protobuf_path: str, delimited_file: bool = True, transport_parameters: dict = None) None

Write the dataset profile to disk in binary format

Parameters
  • protobuf_path (str) – local path or any path supported supported by smart_open: https://github.com/RaRe-Technologies/smart_open#how. The parent directory must already exist

  • delimited_file (bool, optional) – whether to prefix the data with the length of output or not. Default is True

static read_protobuf(protobuf_path: str, delimited_file: bool = True, transport_parameters: dict = None) DatasetProfile

Parse a protobuf file and return a DatasetProfile object

Parameters
Returns

whylogs.DatasetProfile object from the protobuf

Return type

DatasetProfile

static from_protobuf(message: whylogs.proto.DatasetProfileMessage) DatasetProfile

Load from a protobuf message

Parameters

message (DatasetProfileMessage) – The protobuf message. Should match the output of DatasetProfile.to_protobuf()

Returns

dataset_profile

Return type

DatasetProfile

static from_protobuf_string(data: bytes) DatasetProfile

Deserialize a serialized DatasetProfileMessage

Parameters

data (bytes) – The serialized message

Returns

profile – The deserialized dataset profile

Return type

DatasetProfile

static _parse_delimited_generator(data: bytes)
static parse_delimited_single(data: bytes, pos=0)

Parse a single delimited entry from a byte stream :param data: The bytestream :type data: bytes :param pos: The starting position. Default is zero :type pos: int

Returns

  • pos (int) – Current position in the stream after parsing

  • profile (DatasetProfile) – A dataset profile

static parse_delimited(data: bytes)

Parse delimited data (i.e. data prefixed with the message length).

Java protobuf writes delimited messages, which is convenient for storing multiple dataset profiles. This means that the main data is prefixed with the length of the message.

Parameters

data (bytes) – The input byte stream

Returns

profiles – List of all Dataset profile objects

Return type

list

apply_summary_constraints(self, summary_constraints: Optional[Mapping[str, whylogs.core.statistics.constraints.SummaryConstraints]] = None)
apply_table_shape_constraints(self, table_shape_constraints: Optional[whylogs.core.statistics.constraints.SummaryConstraints] = None)
whylogs.enable_mlflow(session=None) bool

Enable whylogs in mlflow module via mlflow.whylogs.

Returns

True if MLFlow has been patched. False otherwise.

Example of whylogs and MLFlow
import mlflow
import whylogs

whylogs.enable_mlflow()

import numpy as np
import pandas as pd
pdf = pd.DataFrame(
    data=[[1, 2, 3, 4, True, "x", bytes([1])]],
    columns=["b", "d", "a", "c", "e", "g", "f"],
    dtype=np.object,
)

active_run = mlflow.start_run()

# log a Pandas dataframe under default name
mlflow.whylogs.log_pandas(pdf)

# log a Pandas dataframe with custom name
mlflow.whylogs.log_pandas(pdf, "another dataset")

# Finish the MLFlow run
mlflow.end_run()
1

Created with sphinx-autoapi

Indices and tables