whylogs documentation can be found at docs.whylabs.ai¶
Visit docs.whylabs.ai for up-to-date documentation
whylogs API reference¶
Profile and monitor your ML data pipeline end-to-end
whylogs is a library for building insights to your data and minimizing data monitoring issues in order to maintain quality and improve communication between teams. To learn more about generating validating, documenting, and profiling your data, read our intro and our Getting Started guide.
Attention
This site is a work in progress. If you have questions, ask them in our Slack channel!
Overview¶
Introduction¶
whylogs is an open source data quality library that uses advanced data science statistics to log and monitor data for your AI/ML application. whylogs is designed to scale with your MLOps workflow, from local development to production terabyte-size datasets.
Whether you are running an experimentation or production pipeline, understanding the properties of the data that flows through your application is critical to the success of your ML project. whylogs enables advanced statistical collection using lightweight techniques, such as building sketches for data, that enable complex monitoring and data quality checks for your pipeline.
Key Features¶
Data Insight: whylogs provides complex statistics across different stages of your ML/AI pipelines and applications.
Scalability: whylogs scales with your system, from local development mode to live production systems in multi-node clusters, and works well with batch and streaming architectures.
Lightweight: whylogs produces small mergeable lightweight outputs in a variety of formats, using sketching algorithms and summarizing statistics.
Unified data instrumentation: To enable data engineering pipelines and ML pipelines to share a common framework for tracking data quality and drifts, the whylogs library supports multiple languages and integrations.
Observability: In addition to supporting traditional monitoring approaches, whylogs data can support advanced ML-focused analytics, error analysis, and data quality and data drift detection.
Getting Started¶
whylogs library comes with quickstart CLI to help you initialize the configuration. You can also use the API directly without going through the CLI.
Quick Start¶
Install the Library¶
Install our library in a Python 3.6+ environment.
pip install whylogs
Configuration¶
To get started, you can generate a simple cnofiguration file with whylogs CLI:
whylogs init
A whylogs config file contains the following parameters:
project sets the name of the project.
pipeline specifies the pipeline to be used.
verbose sets output verbosity. Its default value is
false
.writers specifies how and where output is stored, using path and filename templates that take the following variables:
project
pipeline
dataset_name
dataset_timestamp
session_timestamp
An example config file can be found here.
whylogs.app.config.load_config()
loads your config file. It attempts to load files at the following paths, in order:
The path set in the
WHYLOGS_CONFIG
environment variableThe current directory’s
.whylogs.yaml
file~/.whylogs.yaml
(in the home directory)/opt/whylogs/.whylogs.yaml
Using whylogs API¶
Initialize a Logging Session¶
An example script for creating a logging session can be found here.
Create a Logger¶
Loggers log statistical information about your data. They have the following parameters:
dataset_name sets the name of the dataset, to be used in DatasetProfile metadata and generated filenames.
dataset_timestamp sets a timestamp for the data.
session_timestamp sets a timestamp for the creation of the session.
writers provides a list of writers that will be used to create the DatasetProfile.
verbose sets the verbosity of the output.
For more information, see the documentation for the logger class.
This example code uses logger options to control the output location.
Configure a Writer¶
Writers write the statistics gathered by the logger into an output file. They use the following parameters to create output file paths:
output_path sets the location output files will be stored. Use a directory path if your writer
type = 'local'
, or a key prefix fortype = 's3'
.formats lists all supported output formats.
path_template optionally sets an output path using Python string templates.
filename_template optionally sets output filenames using Python string templates.
dataset_timestamp sets a timestamp for the data.
session_timestamp sets a timestamp for the creation of the session.
For more information, see the documentation for the writer class.
Output whylogs data¶
whylogs supports the following output formats:
Protobuf is a lightweight binary format that maps one-to-one with the memory representation of a whylogs object. Use this format if you plan to apply advanced transformations to whylogs output.
JSON displays the protobuf data in JSON format.
Flat outputs multiple files with both CSV and JSON content to represent different views of the data, including histograms, upperbound, lowerbound, and frequent values.
WhyLabs Platform Sandbox¶
Check out WhyLabs Platform Sandbox to see how whylogs can be used for large-scale data monitoring and visualization in enterprise settings.
Concepts¶
A batch is a collection of datapoints, often grouped by time.
In batch mode, whylogs processes a dataset in batches.
A dataset is a collection of related data that will be analyzed together. whylogs accepts tabular data: each column of the table represents a particular variable, and each row represents a record of the dataset. When used alongside a statistical model, the dataset often represents features as columns, with additional columns for the output. More complex data formats will be supported in the future.
A DatasetProfile is a collection of summary statistics and related metadata for a dataset that whylogs has processed.
Data Sketches are a class of algorithms that efficiently extract information from large or streaming datasets in a single pass. This term is sometimes used to refer specifically to the Apache DataSketches project.
A logger represents the whylogs tracking object for a given dataset (in batch mode) or a collection of data points (in streaming mode). A logger is always associated with a timestamp for its creation and a timestamp for the dataset. Different loggers may write to different storage systems using different output formats.
Metadata is data that describes either a dataset or information from whylogs’ processing of the dataset.
The output formats whylogs supports are protobuf, JSON, and flat. Protobuf is a lightweight binary format that maps one-to-one with the memory representation of a whylogs object. JSON displays the protobuf data in JSON format. Flat outputs multiple files with both CSV and JSON content to represent different views of the data, including histograms, upperbound, lowerbound, and frequent values. To apply advanced transformation on whylogs, we recommend using Protobuf.
A pipeline consists of the components data moves through, as well as any infrastructure associated with those components. A project may have multiple ML pipelines, but it’s common to have one pipeline for a multi-stage project.
Project refers to the project name. A whylogs project is usually associated with one or more ML models. When logging a dataset without a specified name, the system defaults to the project name.
A record is an observation of data. whylogs represents this as a map of keys (string data - feature names) to values (numerical/textual data).
A session represents your configuration for how your application interacts with whylogs, including logger configuration, input and output formats. Using a single session for your application is recommended.
Storage systems: whylogs supports output to local storage and AWS s3.
In streaming mode, whylogs processes individual data points.
Summary statistics are metrics that describe, or summarize, a set of observations.
License¶
Apache License
Version 2.0, January 2004
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
Definitions.
“License” shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document.
“Licensor” shall mean the copyright owner or entity authorized by the copyright owner that is granting the License.
“Legal Entity” shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, “control” means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity.
“You” (or “Your”) shall mean an individual or Legal Entity exercising permissions granted by this License.
“Source” form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files.
“Object” form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types.
“Work” shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below).
“Derivative Works” shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof.
“Contribution” shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, “submitted” means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as “Not a Contribution.”
“Contributor” shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work.
Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form.
Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed.
Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions:
You must give any other recipients of the Work or Derivative Works a copy of this License; and
You must cause any modified files to carry prominent notices stating that You changed the files; and
You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and
If the Work includes a “NOTICE” text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License.
You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License.
Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions.
Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file.
Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License.
Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages.
Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following boilerplate notice, with the fields enclosed by brackets “[]” replaced with your own identifying information. (Don’t include the brackets!) The text should be enclosed in the appropriate comment syntax for the file format. We also recommend that a file or class name and description of purpose be included on the same “printed page” as the copyright notice for easier identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
API Reference¶
This page contains auto-generated API reference documentation 1.
whylogs
¶
Subpackages¶
whylogs.app
¶
The whylogs client application API
Submodules¶
whylogs.app.config
¶Classes/functions for configuring the whylogs app
Generic enumeration. |
|
Marshmallow schema for |
|
Config for whylogs writers |
|
Config for whylogs metadata |
|
Config for a whylogs session. |
|
Marshmallow schema for |
|
Marshmallow schema for |
|
Marshmallow schema for |
|
Load logging configuration, from disk and from the environment. |
Supported output formats for whylogs writer configuration |
|
- class whylogs.app.config.WriterType¶
Bases:
enum.Enum
Generic enumeration.
Derive from this class to define new enumerations.
- local¶
- s3¶
- whylabs¶
- mlflow¶
- whylogs.app.config.SUPPORTED_WRITERS¶
- whylogs.app.config.WHYLOGS_YML = .whylogs.yaml¶
- whylogs.app.config.ALL_SUPPORTED_FORMATS¶
Supported output formats for whylogs writer configuration
- whylogs.app.config.SegmentTag¶
- whylogs.app.config.SegmentTags¶
- class whylogs.app.config.TransportParameterConfig(endpoint_url: str, aws_access_key_id: str, aws_secret_access_key: str, region_name: str, verify: str)¶
- class whylogs.app.config.TransportParameterConfigSchema¶
Bases:
marshmallow.Schema
Marshmallow schema for
WriterConfig
class.- endpoint_url¶
- aws_access_key_id¶
- aws_secret_access_key¶
- region_name¶
- verify¶
- make_writer(self, data, **kwargs)¶
- class whylogs.app.config.WriterConfig(type: str, formats: Optional[List[str]] = None, output_path: Optional[str] = None, path_template: Optional[str] = None, filename_template: Optional[str] = None, data_collection_consent: Optional[bool] = None, transport_parameters: Optional[TransportParameterConfig] = None)¶
Config for whylogs writers
See also:
- Parameters
type (str) – Destination for the writer output, e.g. ‘local’ or ‘s3’
formats (list) – All output formats. See
ALL_SUPPORTED_FORMATS
output_path (str) – Prefix of where to output files. A directory for type = ‘local’, or key prefix for type = ‘s3’
path_template (str, optional) – Templatized path output using standard python string templates. Variables are accessed via $identifier or ${identifier}. See
whylogs.app.writers.Writer.template_params()
for a list of available identifers. Default =whylogs.app.writers.DEFAULT_PATH_TEMPLATE
filename_template (str, optional) – Templatized output filename using standardized python string templates. Variables are accessed via $identifier or ${identifier}. See
whylogs.app.writers.Writer.template_params()
for a list of available identifers. Default =whylogs.app.writers.DEFAULT_FILENAME_TEMPLATE
- to_yaml(self, stream=None)¶
Serialize this config to YAML
- Parameters
stream – If None (default) return a string, else dump the yaml into this stream.
- static from_yaml(stream, **kwargs)¶
Load config from yaml
- Parameters
stream (str, file-obj) – String or file-like object to load yaml from
kwargs – ignored
- Returns
config – Generated config
- Return type
WriterConfig
- class whylogs.app.config.MetadataConfig(type: str, output_path: str, input_path: Optional[str] = '', path_template: Optional[str] = None)¶
Config for whylogs metadata
See also:
- Parameters
type (str) – Destination for the writer output, e.g. ‘local’ or ‘s3’
output_path (str) – Prefix of where to output files. A directory for type = ‘local’, or key prefix for type = ‘s3’
input_path (str) – Path to search for pre-calculated segment files. Paths separated by ‘:’.
path_template (str, optional) – Templatized path output using standard python string templates. Variables are accessed via $identifier or ${identifier}. See
whylogs.app.writers.Writer.template_params()
for a list of available identifers. Default =whylogs.app.metadata_writer.DEFAULT_PATH_TEMPLATE
- to_yaml(self, stream=None)¶
Serialize this config to YAML
- Parameters
stream – If None (default) return a string, else dump the yaml into this stream.
- static from_yaml(stream, **kwargs)¶
Load config from yaml
- Parameters
stream (str, file-obj) – String or file-like object to load yaml from
kwargs – ignored
- Returns
config – Generated config
- Return type
WriterConfig
- class whylogs.app.config.SessionConfig(project: str, pipeline: str, writers: List[WriterConfig], metadata: Optional[MetadataConfig] = None, verbose: bool = False, with_rotation_time: str = None, cache_size: int = 1, report_progress: bool = False)¶
Config for a whylogs session.
See also
SessionConfigSchema
- Parameters
project (str) – Project associated with this whylogs session
pipeline (str) – Name of the associated data pipeline
writers (list) – A list of WriterConfig objects defining writer outputs
metadata (MetadataConfig) – A MetadataConfiguration object. If none, will replace with default.
verbose (bool, default=False) – Output verbosity
with_rotation_time (str, default = None, to rotate profiles with time, takes values of overall rotation interval,) – “s” for seconds “m” for minutes “h” for hours “d” for days
cache_size (int default =1, sets how many dataprofiles to cache in logger during rotation) –
- to_yaml(self, stream=None)¶
Serialize this config to YAML
- Parameters
stream – If None (default) return a string, else dump the yaml into this stream.
- static from_yaml(stream)¶
Load config from yaml
- Parameters
stream (str, file-obj) – String or file-like object to load yaml from
- Returns
config – Generated config
- Return type
- class whylogs.app.config.WriterConfigSchema¶
Bases:
marshmallow.Schema
Marshmallow schema for
WriterConfig
class.- type¶
- formats¶
- output_path¶
- path_template¶
- filename_template¶
- transport_parameters¶
- make_writer(self, data, **kwargs)¶
- class whylogs.app.config.MetadataConfigSchema¶
Bases:
marshmallow.Schema
Marshmallow schema for
MetadataConfig
class.- type¶
- output_path¶
- input_path¶
- path_template¶
- make_metadata(self, data, **kwargs)¶
- class whylogs.app.config.SessionConfigSchema¶
Bases:
marshmallow.Schema
Marshmallow schema for
SessionConfig
class.- project¶
- pipeline¶
- with_rotation_time¶
- cache¶
- verbose¶
- writers¶
- metadata¶
- make_session(self, data, **kwargs)¶
- whylogs.app.config.load_config(path_to_config: str = None)¶
Load logging configuration, from disk and from the environment.
Config is loaded by attempting to load files in the following order. The first valid file will be used
Path set in
WHYLOGS_CONFIG
environment variableCurrent directory’s
.whylogs.yaml
file~/.whylogs.yaml
(home directory)/opt/whylogs/.whylogs.yaml
path
- Returns
config – Config for the logger, if a valid config file is found, else returns None.
- Return type
SessionConfig, None
whylogs.app.logger
¶Class and functions for whylogs logging
Class for logging whylogs statistics. |
|
- whylogs.app.logger.SegmentTag¶
- whylogs.app.logger.Segment¶
- whylogs.app.logger._TAG_PREFIX = whylogs.tag.¶
- whylogs.app.logger._TAG_KEY = key¶
- whylogs.app.logger._TAG_VALUE = value¶
- whylogs.app.logger.logger¶
- class whylogs.app.logger.Logger(session_id: str, dataset_name: str, dataset_timestamp: Optional[datetime.datetime] = None, session_timestamp: Optional[datetime.datetime] = None, tags: Optional[Dict[str, str]] = None, metadata: Optional[Dict[str, str]] = None, writers: Optional[List[whylogs.app.writers.Writer]] = None, metadata_writer: Optional[whylogs.app.metadata_writer.MetadataWriter] = None, verbose: bool = False, with_rotation_time: Optional[str] = None, interval: int = 1, cache_size: int = 1, segments: Optional[Union[List[Segment], List[str], str]] = None, profile_full_dataset: bool = False, constraints: Optional[whylogs.core.statistics.constraints.DatasetConstraints] = None)¶
Class for logging whylogs statistics.
- Parameters
session_id – The session ID value. Should be set by the Session boject
dataset_name – The name of the dataset. Gets included in the DatasetProfile metadata and can be used in generated filenames.
dataset_timestamp – Optional. The timestamp that the logger represents
session_timestamp – Optional. The time the session was created
tags – Optional. Dictionary of key, value for aggregating data upstream
metadata – Optional. Dictionary of key, value. Useful for debugging (associated with every single dataset profile)
writers – Optional. List of Writer objects used to write out the data
metadata_writer – Optional. MetadataWriter object used to write non-profile information
with_rotation_time – Optional. Log rotation interval, consisting of digits with unit specification, e.g. 30s, 2h, d. units are seconds (“s”), minutes (“m”), hours, (“h”), or days (“d”) Output filenames will have a suffix reflecting the rotation interval.
interval – Deprecated: Interval multiplier for with_rotation_time, defaults to 1.
verbose – enable debug logging
cache_size – dataprofiles to cache
segments –
- Can be either:
Autosegmentation source, one of [“auto”, “local”]
List of tag key value pairs for tracking data segments
List of tag keys for which we will track every value
None, no segments will be used
profile_full_dataset – when segmenting dataset, an option to keep the full unsegmented profile of the dataset.
constraints – static assertions to be applied to streams and summaries.
- __enter__(self)¶
- __exit__(self, exc_type, exc_val, exc_tb)¶
- property profile(self) whylogs.core.DatasetProfile ¶
- Returns
the last backing dataset profile
- Return type
- tracking_checks(self)¶
- property segmented_profiles(self) Dict[str, whylogs.core.DatasetProfile] ¶
- Returns
the last backing dataset profile
- Return type
Dict[str, DatasetProfile]
- get_segment(self, segment: Segment) Optional[whylogs.core.DatasetProfile] ¶
- set_segments(self, segments: Union[List[Segment], List[str], str]) None ¶
- _retrieve_local_segments(self) Union[List[Segment], List[str], str] ¶
Retrieves local segments
- _intialize_profiles(self, dataset_timestamp: Optional[datetime.datetime] = datetime.datetime.now(datetime.timezone.utc)) None ¶
- _set_rotation(self, with_rotation_time: str = None)¶
- rotate_when(self, time)¶
- should_rotate(self)¶
- _rotate_time(self)¶
rotate with time add a suffix
- flush(self, rotation_suffix: Optional[str] = None)¶
Synchronously perform all remaining write tasks
- full_profile_check(self) bool ¶
returns a bool to determine if unsegmented dataset should be profiled.
- close(self) Optional[whylogs.core.DatasetProfile] ¶
Flush and close out the logger, outputs the last profile
- Returns
the result dataset profile. None if the logger is closed
- log(self, features: Optional[Dict[str, any]] = None, feature_name: Optional[str] = None, value: any = None, character_list: Optional[str] = None, token_method: Optional[Callable] = None)¶
Logs a collection of features or a single feature (must specify one or the other).
- Parameters
features – a map of key value feature for model input
feature_name – name of a single feature. Cannot be specified if ‘features’ is specified
value – value of as single feature. Cannot be specified if ‘features’ is specified
- log_segment_datum(self, feature_name, value, character_list: str = None, token_method: Optional[Callable] = None)¶
- log_metrics(self, targets, predictions, scores=None, model_type: whylogs.proto.ModelType = None, target_field=None, prediction_field=None, score_field=None)¶
- log_image(self, image, feature_transforms: Optional[List[Callable]] = None, metadata_attributes: Optional[List[str]] = METADATA_DEFAULT_ATTRIBUTES, feature_name: str = '')¶
API to track an image, either in PIL format or as an input path
- Parameters
feature_name – name of the feature
metadata_attributes – metadata attributes to extract for the images
feature_transforms – a list of callables to transform the input into metrics
- log_local_dataset(self, root_dir, folder_feature_name='folder_feature', image_feature_transforms=None, show_progress=False)¶
Log a local folder dataset It will log data from the files, along with structure file data like metadata, and magic numbers. If the folder has single layer for children folders, this will pick up folder names as a segmented feature
- Parameters
show_progress – showing the progress bar
image_feature_transforms – image transform that you would like to use with the image log
root_dir (str) – directory where dataset is located.
folder_feature_name (str, optional) – Name for the subfolder features, i.e. class, store etc.
- log_annotation(self, annotation_data)¶
Log structured annotation data ie. JSON like structures
- Parameters
annotation_data (Dict or List) – Description
- log_csv(self, filepath_or_buffer: Union[str, pathlib.Path, IO[AnyStr]], segments: Optional[Union[List[Segment], List[str]]] = None, profile_full_dataset: bool = False, **kwargs)¶
Log a CSV file. This supports the same parameters as :func`pandas.read_csv<pandas.read_csv>` function.
- Parameters
filepath_or_buffer – the path to the CSV or a CSV buffer
segments – define either a list of segment keys or a list of segments tags: [ {“key”:<featurename>,”value”: <featurevalue>},… ]
profile_full_dataset – when segmenting dataset, an option to keep the full unsegmented profile of the dataset
**kwargs – from pandas:read_csv
- log_dataframe(self, df, segments: Optional[Union[List[Segment], List[str]]] = None, profile_full_dataset: bool = False)¶
Generate and log a whylogs DatasetProfile from a pandas dataframe :param profile_full_dataset: when segmenting dataset, an option to keep the full unsegmented profile of the
dataset.
- Parameters
segments – specify the tag key value pairs for segments
df – the Pandas dataframe to log
- log_segments(self, data)¶
- log_segments_keys(self, data)¶
- log_fixed_segments(self, data)¶
- log_df_segment(self, df, segment: Segment)¶
- is_active(self)¶
Return the boolean state of the logger
- static _prefix_segment_tags(segment_key_values)¶
- whylogs.app.logger.hash_segment(seg: List[Dict]) str ¶
whylogs.app.metadata_writer
¶Class for writing metadata to disk |
|
Construct a whylogs MetadataWriter from a MetadataConfig |
- whylogs.app.metadata_writer.DEFAULT_PATH_TEMPLATE = $name/metadata¶
- whylogs.app.metadata_writer.logger¶
- class whylogs.app.metadata_writer.MetadataWriter(output_path: str, input_path: Optional[str] = '', path_template: Optional[str] = None, writer_type: Optional[str] = 'local')¶
Class for writing metadata to disk
- Parameters
output_path (str) – Prefix of where to output files. A directory for type = ‘local’, or key prefix for type = ‘s3’
path_template (str, optional) – Templatized path output using standard python string templates. Variables are accessed via $identifier or ${identifier}. See
MetadataWriter.template_params()
for a list of available identifers. Default =DEFAULT_PATH_TEMPLATE
- path_suffix(self, name) str ¶
Generate a path string for an output path from the given arguments by applying the path templating defined in self.path_template
- autosegmentation_write(self, name: str, segments: Union[List[Dict], List[str]]) None ¶
- autosegmentation_read(self)¶
- whylogs.app.metadata_writer.metadata_from_config(config: whylogs.app.config.MetadataConfig)¶
Construct a whylogs MetadataWriter from a MetadataConfig
- Returns
metadata_writer – whylogs metadata writer
- Return type
whylogs.app.output_formats
¶Define available output formats
List of output formats that we support. |
- class whylogs.app.output_formats.OutputFormat¶
Bases:
enum.Enum
List of output formats that we support.
- json¶
output as a JSON object. This is a deeply nested structure
- csv¶
output as “flat” files. This will generate multiple output files
- protobuf¶
output as a binary protobuf file. This is the most compact format
- json¶
- flat¶
- protobuf¶
- whylogs.app.output_formats.SUPPORTED_OUTPUT_FORMATS¶
whylogs.app.session
¶whylogs logging session
Create a new logger or return an existing one for a given dataset name. |
|
|
|
Construct a whylogs session from a SessionConfig or from a config_path |
Reset and deactivate the global whylogs logging session. |
|
|
|
|
Retrieve the current active global session. |
Retrieve the logging session without altering or activating it. |
|
Retrieve the global session logger |
- class whylogs.app.session._LoggerKey¶
Create a new logger or return an existing one for a given dataset name. If no dataset_name is specified, we default to project name
- Parameters
metadata –
dataset_name – str Name of the dataset. Default is the project name
dataset_timestamp – datetime.datetime, optional The timestamp associated with the dataset. Could be the timestamp for the batch, or the timestamp for the window that you are tracking
tags – dict Tag the data with groupable information. For example, you might want to tag your data with the stage information (development, testing, production etc…)
metadata – dict Useful to debug the data source. You can associate non-groupable information in this field such as hostname,
session_timestamp – datetime.datetime, optional Override the timestamp associated with the session. Normally you shouldn’t need to override this value
segments – Can be either: - List of tag key value pairs for tracking datasetments - List of tag keys for whylogs to split up the data in the backend
- dataset_name :Optional[str]¶
- dataset_timestamp :Optional[datetime.datetime]¶
- session_timestamp :Optional[datetime.datetime]¶
- tags :Dict[str, str]¶
- metadata :Dict[str, str]¶
- segments :Optional[Union[List[Dict], List[str]]]¶
- profile_full_dataset :bool = False¶
- with_rotation_time :str¶
- cache_size :int = 1¶
- constraints :whylogs.core.statistics.constraints.DatasetConstraints¶
- whylogs.app.session.defaultLoggerArgs¶
- class whylogs.app.session.Session(project: Optional[str] = None, pipeline: Optional[str] = None, writers: Optional[List[whylogs.app.writers.Writer]] = None, metadata_writer: Optional[whylogs.app.metadata_writer.MetadataWriter] = None, verbose: bool = False, with_rotation_time: str = None, cache_size: int = None, report_progress: bool = False)¶
- Parameters
project (str) – The project name. We will default to the project name when logging a dataset if the dataset name is not specified
pipeline (str) – Name of the pipeline associated with this session
writers (list) – configuration for the output writers. This is where the log data will go
verbose (bool) – enable verbose logging for not. Default is
False
- __enter__(self)¶
- __exit__(self, tpe, value, traceback)¶
- __repr__(self)¶
Return repr(self).
- get_config(self)¶
- is_active(self)¶
- logger(self, dataset_name: Optional[str] = None, dataset_timestamp: Optional[datetime.datetime] = None, session_timestamp: Optional[datetime.datetime] = None, tags: Dict[str, str] = None, metadata: Dict[str, str] = None, segments: Optional[Union[List[Dict], List[str], str]] = None, profile_full_dataset: bool = False, with_rotation_time: str = None, cache_size: int = 1, constraints: whylogs.core.statistics.constraints.DatasetConstraints = None) whylogs.app.logger.Logger ¶
Create a new logger or return an existing one for a given dataset name. If no dataset_name is specified, we default to project name
- Parameters
dataset_name – name of the dataset
dataset_timestamp – timestamp of the dataset. Default to now
session_timestamp – timestamp of the session. Inherits from the session
tags – metadata associated with the profile
metadata – same as tags. Will be deprecated
segments – slice of data that the profile belongs to
profile_full_dataset – when segmenting dataset, an option to keep the full unsegmented profile of the dataset
with_rotation_time – rotation time in minutes our hours (“1m”, “1h”)
cache_size – size of the segment cache
constraints – whylogs contrainst to monitor against
- get_logger(self, dataset_name: str = None)¶
- log_dataframe(self, df: pandas.DataFrame, dataset_name: Optional[str] = None, dataset_timestamp: Optional[datetime.datetime] = None, session_timestamp: Optional[datetime.datetime] = None, tags: Dict[str, str] = None, metadata: Dict[str, str] = None, segments: Optional[Union[List[Dict], List[str], str]] = None, profile_full_dataset: bool = False, constraints: whylogs.core.statistics.constraints.DatasetConstraints = None) Optional[whylogs.core.DatasetProfile] ¶
Perform statistics caluclations and log a pandas dataframe
- Parameters
df – the dataframe to profile
dataset_name – name of the dataset
dataset_timestamp – the timestamp for the dataset
session_timestamp – the timestamp for the session. Override the default one
tags – the tags for the profile. Useful when merging
metadata – information about this current profile. Can be discarded when merging
segments – Can be either: - Autosegmentation source, one of [“auto”, “local”] - List of tag key value pairs for tracking data segments - List of tag keys for which we will track every value - None, no segments will be used
profile_full_dataset – when segmenting dataset, an option to keep the full unsegmented profile of the dataset
- Returns
a dataset profile if the session is active
- profile_dataframe(self, df: pandas.DataFrame, dataset_name: Optional[str] = None, dataset_timestamp: Optional[datetime.datetime] = None, session_timestamp: Optional[datetime.datetime] = None, tags: Dict[str, str] = None, metadata: Dict[str, str] = None) Optional[whylogs.core.DatasetProfile] ¶
Profile a Pandas dataframe without actually writing data to disk. This is useful when you just want to quickly capture and explore a dataset profile.
- Parameters
df – the dataframe to profile
dataset_name – name of the dataset
dataset_timestamp – the timestamp for the dataset
session_timestamp – the timestamp for the session. Override the default one
tags – the tags for the profile. Useful when merging
metadata – information about this current profile. Can be discarded when merging
- Returns
a dataset profile if the session is active
- new_profile(self, dataset_name: Optional[str] = None, dataset_timestamp: Optional[datetime.datetime] = None, session_timestamp: Optional[datetime.datetime] = None, tags: Dict[str, str] = None, metadata: Dict[str, str] = None) Optional[whylogs.core.DatasetProfile] ¶
Create an empty dataset profile with the metadata from the session.
- Parameters
dataset_name – name of the dataset
dataset_timestamp – the timestamp for the dataset
session_timestamp – the timestamp for the session. Override the default one
tags – the tags for the profile. Useful when merging
metadata – information about this current profile. Can be discarded when merging
- Returns
a dataset profile if the session is active
- estimate_segments(self, df: pandas.DataFrame, name: str, target_field: str = None, max_segments: int = 30, dry_run: bool = False) Optional[Union[List[Dict], List[str]]] ¶
Estimates the most important features and values on which to segment data profiling using entropy-based methods.
- Parameters
df – the dataframe of data to profile
name – name for discovery in the logger, automatically applied
to loggers with same dataset_name :param target_field: target field (optional) :param max_segments: upper threshold for total combinations of segments, default 30 :param dry_run: run calculation but do not write results to metadata :return: a list of segmentation feature names
- close(self)¶
Deactivate this session and flush all associated loggers
- remove_logger(self, dataset_name: str)¶
Remove a logger from the dataset. This is called by the logger when it’s being closed
- Parameters
logger (dataset_name the name of the dataset. used to identify the) –
None (Returns) –
------- –
- whylogs.app.session._use_whylabs_client = False¶
- whylogs.app.session.session_from_config(config: whylogs.app.config.SessionConfig = None, config_path: Optional[str] = '') Session ¶
Construct a whylogs session from a SessionConfig or from a config_path
- whylogs.app.session._session¶
- whylogs.app.session.reset_default_session()¶
Reset and deactivate the global whylogs logging session.
- whylogs.app.session.start_whylabs_session(path_to_config: Optional[str] = None, report_progress: Optional[bool] = False)¶
- whylogs.app.session.get_or_create_session(path_to_config: Optional[str] = None, report_progress: Optional[bool] = False)¶
Retrieve the current active global session.
If no active session exists, attempt to load config and create a new session.
If an active session exists, return the session without loading new config.
- Returns
The global active session
- Return type
- whylogs.app.session.get_session()¶
Retrieve the logging session without altering or activating it.
- Returns
session – The global session
- Return type
- whylogs.app.session.get_logger()¶
Retrieve the global session logger
- Returns
ylog – The global session logger
- Return type
whylogs.app.utils
¶
|
|
|
|
|
|
Wait for the child process to complete. This is to ensure that we write out the log files before the parent |
- whylogs.app.utils._NO_ASYNC = WHYLOGS_NO_ASYNC¶
- whylogs.app.utils._logger¶
- whylogs.app.utils._threads :List[threading.Thread] = []¶
- whylogs.app.utils._timer_threads :List[threading.Thread] = []¶
- whylogs.app.utils.timer_wrap(func, interval, *args, **kwargs)¶
- whylogs.app.utils._do_wrap(func)¶
- whylogs.app.utils.async_wrap(func, *args, **kwargs)¶
- Parameters
func – the coroutine to run in an asyncio loop
- Returns
an thread for the coroutine
- Return type
threading.Thread
- whylogs.app.utils._wait_for_children()¶
Wait for the child process to complete. This is to ensure that we write out the log files before the parent process finishes
whylogs.app.writers
¶Classes for writing whylogs output
Class for writing to disk |
|
whylogs Writer class that can write to disk. |
|
whylogs Writer class that can write to S3. |
|
Class for writing to disk |
|
Class for writing to disk |
|
Construct a whylogs Writer from a WriterConfig |
- whylogs.app.writers.DEFAULT_PATH_TEMPLATE = $name/$session_id¶
- whylogs.app.writers.DEFAULT_FILENAME_TEMPLATE = dataset_profile¶
- whylogs.app.writers.logger¶
- class whylogs.app.writers.Writer(output_path: str, formats: List[str], path_template: Optional[str] = None, filename_template: Optional[str] = None, transport_params: Optional[whylogs.app.config.TransportParameterConfig] = None)¶
Bases:
abc.ABC
Class for writing to disk
- Parameters
output_path (str) – Prefix of where to output files. A directory for type = ‘local’, or key prefix for type = ‘s3’
formats (list) – All output formats. See
whylogs.app.config.ALL_SUPPORTED_FORMATS
path_template (str, optional) – Templatized path output using standard python string templates. Variables are accessed via $identifier or ${identifier}. See
Writer.template_params()
for a list of available identifers. Default =DEFAULT_PATH_TEMPLATE
filename_template (str, optional) – Templatized output filename using standardized python string templates. Variables are accessed via $identifier or ${identifier}. See
Writer.template_params()
for a list of available identifers. Default =DEFAULT_FILENAME_TEMPLATE
- close(self)¶
- abstract write(self, profile: whylogs.core.DatasetProfile, rotation_suffix: str = None)¶
Abstract method to write a dataset profile to disk. Must be implemented
- path_suffix(self, profile: whylogs.core.DatasetProfile)¶
Generate a path string for an output path from a dataset profile by applying the path templating defined in self.path_template
- file_name(self, profile: whylogs.core.DatasetProfile, file_extension: str, rotation_suffix: Optional[str] = None)¶
For a given DatasetProfile, generate an output filename based on the templating defined in self.filename_template
- static template_params(profile: whylogs.core.DatasetProfile) dict ¶
Return a dictionary of dataset profile metadata which can be used for generating templatized variables or paths.
- Parameters
profile (DatasetProfile) – The dataset profile
- Returns
params – Variables which can be substituted into a template string.
- Return type
dict
Notes
Template params:
name
: name of the datasetsession_timestamp
: session time in UTC epoch millisecondsdataset_timestamp
: timestamp for the data in UTC epoch mssession_id
: Unique identifier for the session
- class whylogs.app.writers.LocalWriter(output_path: str, formats: List[str], path_template: str, filename_template: str)¶
Bases:
Writer
whylogs Writer class that can write to disk.
See
Writer
for a description of arguments- write(self, profile: whylogs.core.DatasetProfile, rotation_suffix: Optional[str] = None)¶
Write a dataset profile to disk
- _do_write(self, profile, rotation_suffix: Optional[str] = None, **kwargs)¶
- _write_json(self, profile: whylogs.core.DatasetProfile, rotation_suffix: Optional[str] = None)¶
Write a JSON summary of the dataset profile to disk
- _write_flat(self, profile: whylogs.core.DatasetProfile, indent: int = 4, rotation_suffix: Optional[str] = None)¶
Write output data for flat format
- Parameters
profile (DatasetProfile) – the dataset profile to output
indent (int) – The JSON indentation to use. Default is 4
- _write_protobuf(self, profile: whylogs.core.DatasetProfile, rotation_suffix: Optional[str] = None)¶
Write a protobuf serialization of the DatasetProfile to disk
- ensure_path(self, suffix: str, addition_part: Optional[str] = None) str ¶
Ensure that a path exists, creating it if not
- class whylogs.app.writers.S3Writer(output_path: str, formats: List[str], path_template: str = None, filename_template: str = None)¶
Bases:
Writer
whylogs Writer class that can write to S3.
See
Writer
for a description of arguments- write(self, profile: whylogs.core.DatasetProfile, rotation_suffix: str = None)¶
Write a dataset profile to S3
- _do_write(self, profile, rotation_suffix: str = None, **kwargs)¶
- _write_json(self, profile: whylogs.core.DatasetProfile, rotation_suffix: Optional[str] = None, transport_params: Optional[dict] = None)¶
Write a dataset profile JSON summary to disk
- _write_flat(self, profile: whylogs.core.DatasetProfile, indent: int = 4, rotation_suffix: Optional[str] = None, transport_params: Optional[dict] = None)¶
Write output data for flat format
- Parameters
profile (DatasetProfile) – the dataset profile to output
indent (int) – The JSON indentation to use. Default is 4
- _write_protobuf(self, profile: whylogs.core.DatasetProfile, rotation_suffix: Optional[str] = None, transport_params: Optional[dict] = None)¶
Write a datasetprofile protobuf serialization to S3
- class whylogs.app.writers.MlFlowWriter(output_path: str, formats: List[str], path_template: str = None, filename_template: str = None)¶
Bases:
Writer
Class for writing to disk
- Parameters
output_path (str) – Prefix of where to output files. A directory for type = ‘local’, or key prefix for type = ‘s3’
formats (list) – All output formats. See
whylogs.app.config.ALL_SUPPORTED_FORMATS
path_template (str, optional) – Templatized path output using standard python string templates. Variables are accessed via $identifier or ${identifier}. See
Writer.template_params()
for a list of available identifers. Default =DEFAULT_PATH_TEMPLATE
filename_template (str, optional) – Templatized output filename using standardized python string templates. Variables are accessed via $identifier or ${identifier}. See
Writer.template_params()
for a list of available identifers. Default =DEFAULT_FILENAME_TEMPLATE
- write(self, profile: whylogs.core.DatasetProfile, rotation_suffix: str = None)¶
Write a dataset profile to MLFlow path
- static _write_protobuf(profile: whylogs.core.DatasetProfile, rotation_suffix: str = None, **kwargs)¶
Write a protobuf the dataset profile to disk in binary format to MlFlow
- class whylogs.app.writers.WhyLabsWriter(output_path='', formats=None)¶
Bases:
Writer
Class for writing to disk
- Parameters
output_path (str) – Prefix of where to output files. A directory for type = ‘local’, or key prefix for type = ‘s3’
formats (list) – All output formats. See
whylogs.app.config.ALL_SUPPORTED_FORMATS
path_template (str, optional) – Templatized path output using standard python string templates. Variables are accessed via $identifier or ${identifier}. See
Writer.template_params()
for a list of available identifers. Default =DEFAULT_PATH_TEMPLATE
filename_template (str, optional) – Templatized output filename using standardized python string templates. Variables are accessed via $identifier or ${identifier}. See
Writer.template_params()
for a list of available identifers. Default =DEFAULT_FILENAME_TEMPLATE
- write(self, profile: whylogs.core.DatasetProfile, rotation_suffix: str = None)¶
Write a dataset profile to WhyLabs
- static _write_protobuf(profile: whylogs.core.DatasetProfile)¶
Write a protobuf profile to WhyLabs
- whylogs.app.writers.writer_from_config(config: whylogs.app.config.WriterConfig)¶
Construct a whylogs Writer from a WriterConfig
- Returns
writer – whylogs writer
- Return type
Package Contents¶
Class for logging whylogs statistics. |
|
|
|
Config for a whylogs session. |
|
Config for whylogs writers |
|
Load logging configuration, from disk and from the environment. |
- whylogs.app.load_config(path_to_config: str = None)¶
Load logging configuration, from disk and from the environment.
Config is loaded by attempting to load files in the following order. The first valid file will be used
Path set in
WHYLOGS_CONFIG
environment variableCurrent directory’s
.whylogs.yaml
file~/.whylogs.yaml
(home directory)/opt/whylogs/.whylogs.yaml
path
- Returns
config – Config for the logger, if a valid config file is found, else returns None.
- Return type
SessionConfig, None
- class whylogs.app.Logger(session_id: str, dataset_name: str, dataset_timestamp: Optional[datetime.datetime] = None, session_timestamp: Optional[datetime.datetime] = None, tags: Optional[Dict[str, str]] = None, metadata: Optional[Dict[str, str]] = None, writers: Optional[List[whylogs.app.writers.Writer]] = None, metadata_writer: Optional[whylogs.app.metadata_writer.MetadataWriter] = None, verbose: bool = False, with_rotation_time: Optional[str] = None, interval: int = 1, cache_size: int = 1, segments: Optional[Union[List[Segment], List[str], str]] = None, profile_full_dataset: bool = False, constraints: Optional[whylogs.core.statistics.constraints.DatasetConstraints] = None)¶
Class for logging whylogs statistics.
- Parameters
session_id – The session ID value. Should be set by the Session boject
dataset_name – The name of the dataset. Gets included in the DatasetProfile metadata and can be used in generated filenames.
dataset_timestamp – Optional. The timestamp that the logger represents
session_timestamp – Optional. The time the session was created
tags – Optional. Dictionary of key, value for aggregating data upstream
metadata – Optional. Dictionary of key, value. Useful for debugging (associated with every single dataset profile)
writers – Optional. List of Writer objects used to write out the data
metadata_writer – Optional. MetadataWriter object used to write non-profile information
with_rotation_time – Optional. Log rotation interval, consisting of digits with unit specification, e.g. 30s, 2h, d. units are seconds (“s”), minutes (“m”), hours, (“h”), or days (“d”) Output filenames will have a suffix reflecting the rotation interval.
interval – Deprecated: Interval multiplier for with_rotation_time, defaults to 1.
verbose – enable debug logging
cache_size – dataprofiles to cache
segments –
- Can be either:
Autosegmentation source, one of [“auto”, “local”]
List of tag key value pairs for tracking data segments
List of tag keys for which we will track every value
None, no segments will be used
profile_full_dataset – when segmenting dataset, an option to keep the full unsegmented profile of the dataset.
constraints – static assertions to be applied to streams and summaries.
- __enter__(self)¶
- __exit__(self, exc_type, exc_val, exc_tb)¶
- property profile(self) whylogs.core.DatasetProfile ¶
- Returns
the last backing dataset profile
- Return type
- tracking_checks(self)¶
- property segmented_profiles(self) Dict[str, whylogs.core.DatasetProfile] ¶
- Returns
the last backing dataset profile
- Return type
Dict[str, DatasetProfile]
- get_segment(self, segment: Segment) Optional[whylogs.core.DatasetProfile] ¶
- set_segments(self, segments: Union[List[Segment], List[str], str]) None ¶
- _retrieve_local_segments(self) Union[List[Segment], List[str], str] ¶
Retrieves local segments
- _intialize_profiles(self, dataset_timestamp: Optional[datetime.datetime] = datetime.datetime.now(datetime.timezone.utc)) None ¶
- _set_rotation(self, with_rotation_time: str = None)¶
- rotate_when(self, time)¶
- should_rotate(self)¶
- _rotate_time(self)¶
rotate with time add a suffix
- flush(self, rotation_suffix: Optional[str] = None)¶
Synchronously perform all remaining write tasks
- full_profile_check(self) bool ¶
returns a bool to determine if unsegmented dataset should be profiled.
- close(self) Optional[whylogs.core.DatasetProfile] ¶
Flush and close out the logger, outputs the last profile
- Returns
the result dataset profile. None if the logger is closed
- log(self, features: Optional[Dict[str, any]] = None, feature_name: Optional[str] = None, value: any = None, character_list: Optional[str] = None, token_method: Optional[Callable] = None)¶
Logs a collection of features or a single feature (must specify one or the other).
- Parameters
features – a map of key value feature for model input
feature_name – name of a single feature. Cannot be specified if ‘features’ is specified
value – value of as single feature. Cannot be specified if ‘features’ is specified
- log_segment_datum(self, feature_name, value, character_list: str = None, token_method: Optional[Callable] = None)¶
- log_metrics(self, targets, predictions, scores=None, model_type: whylogs.proto.ModelType = None, target_field=None, prediction_field=None, score_field=None)¶
- log_image(self, image, feature_transforms: Optional[List[Callable]] = None, metadata_attributes: Optional[List[str]] = METADATA_DEFAULT_ATTRIBUTES, feature_name: str = '')¶
API to track an image, either in PIL format or as an input path
- Parameters
feature_name – name of the feature
metadata_attributes – metadata attributes to extract for the images
feature_transforms – a list of callables to transform the input into metrics
- log_local_dataset(self, root_dir, folder_feature_name='folder_feature', image_feature_transforms=None, show_progress=False)¶
Log a local folder dataset It will log data from the files, along with structure file data like metadata, and magic numbers. If the folder has single layer for children folders, this will pick up folder names as a segmented feature
- Parameters
show_progress – showing the progress bar
image_feature_transforms – image transform that you would like to use with the image log
root_dir (str) – directory where dataset is located.
folder_feature_name (str, optional) – Name for the subfolder features, i.e. class, store etc.
- log_annotation(self, annotation_data)¶
Log structured annotation data ie. JSON like structures
- Parameters
annotation_data (Dict or List) – Description
- log_csv(self, filepath_or_buffer: Union[str, pathlib.Path, IO[AnyStr]], segments: Optional[Union[List[Segment], List[str]]] = None, profile_full_dataset: bool = False, **kwargs)¶
Log a CSV file. This supports the same parameters as :func`pandas.read_csv<pandas.read_csv>` function.
- Parameters
filepath_or_buffer – the path to the CSV or a CSV buffer
segments – define either a list of segment keys or a list of segments tags: [ {“key”:<featurename>,”value”: <featurevalue>},… ]
profile_full_dataset – when segmenting dataset, an option to keep the full unsegmented profile of the dataset
**kwargs – from pandas:read_csv
- log_dataframe(self, df, segments: Optional[Union[List[Segment], List[str]]] = None, profile_full_dataset: bool = False)¶
Generate and log a whylogs DatasetProfile from a pandas dataframe :param profile_full_dataset: when segmenting dataset, an option to keep the full unsegmented profile of the
dataset.
- Parameters
segments – specify the tag key value pairs for segments
df – the Pandas dataframe to log
- log_segments(self, data)¶
- log_segments_keys(self, data)¶
- log_fixed_segments(self, data)¶
- log_df_segment(self, df, segment: Segment)¶
- is_active(self)¶
Return the boolean state of the logger
- static _prefix_segment_tags(segment_key_values)¶
- class whylogs.app.Session(project: Optional[str] = None, pipeline: Optional[str] = None, writers: Optional[List[whylogs.app.writers.Writer]] = None, metadata_writer: Optional[whylogs.app.metadata_writer.MetadataWriter] = None, verbose: bool = False, with_rotation_time: str = None, cache_size: int = None, report_progress: bool = False)¶
- Parameters
project (str) – The project name. We will default to the project name when logging a dataset if the dataset name is not specified
pipeline (str) – Name of the pipeline associated with this session
writers (list) – configuration for the output writers. This is where the log data will go
verbose (bool) – enable verbose logging for not. Default is
False
- __enter__(self)¶
- __exit__(self, tpe, value, traceback)¶
- __repr__(self)¶
Return repr(self).
- get_config(self)¶
- is_active(self)¶
- logger(self, dataset_name: Optional[str] = None, dataset_timestamp: Optional[datetime.datetime] = None, session_timestamp: Optional[datetime.datetime] = None, tags: Dict[str, str] = None, metadata: Dict[str, str] = None, segments: Optional[Union[List[Dict], List[str], str]] = None, profile_full_dataset: bool = False, with_rotation_time: str = None, cache_size: int = 1, constraints: whylogs.core.statistics.constraints.DatasetConstraints = None) whylogs.app.logger.Logger ¶
Create a new logger or return an existing one for a given dataset name. If no dataset_name is specified, we default to project name
- Parameters
dataset_name – name of the dataset
dataset_timestamp – timestamp of the dataset. Default to now
session_timestamp – timestamp of the session. Inherits from the session
tags – metadata associated with the profile
metadata – same as tags. Will be deprecated
segments – slice of data that the profile belongs to
profile_full_dataset – when segmenting dataset, an option to keep the full unsegmented profile of the dataset
with_rotation_time – rotation time in minutes our hours (“1m”, “1h”)
cache_size – size of the segment cache
constraints – whylogs contrainst to monitor against
- get_logger(self, dataset_name: str = None)¶
- log_dataframe(self, df: pandas.DataFrame, dataset_name: Optional[str] = None, dataset_timestamp: Optional[datetime.datetime] = None, session_timestamp: Optional[datetime.datetime] = None, tags: Dict[str, str] = None, metadata: Dict[str, str] = None, segments: Optional[Union[List[Dict], List[str], str]] = None, profile_full_dataset: bool = False, constraints: whylogs.core.statistics.constraints.DatasetConstraints = None) Optional[whylogs.core.DatasetProfile] ¶
Perform statistics caluclations and log a pandas dataframe
- Parameters
df – the dataframe to profile
dataset_name – name of the dataset
dataset_timestamp – the timestamp for the dataset
session_timestamp – the timestamp for the session. Override the default one
tags – the tags for the profile. Useful when merging
metadata – information about this current profile. Can be discarded when merging
segments – Can be either: - Autosegmentation source, one of [“auto”, “local”] - List of tag key value pairs for tracking data segments - List of tag keys for which we will track every value - None, no segments will be used
profile_full_dataset – when segmenting dataset, an option to keep the full unsegmented profile of the dataset
- Returns
a dataset profile if the session is active
- profile_dataframe(self, df: pandas.DataFrame, dataset_name: Optional[str] = None, dataset_timestamp: Optional[datetime.datetime] = None, session_timestamp: Optional[datetime.datetime] = None, tags: Dict[str, str] = None, metadata: Dict[str, str] = None) Optional[whylogs.core.DatasetProfile] ¶
Profile a Pandas dataframe without actually writing data to disk. This is useful when you just want to quickly capture and explore a dataset profile.
- Parameters
df – the dataframe to profile
dataset_name – name of the dataset
dataset_timestamp – the timestamp for the dataset
session_timestamp – the timestamp for the session. Override the default one
tags – the tags for the profile. Useful when merging
metadata – information about this current profile. Can be discarded when merging
- Returns
a dataset profile if the session is active
- new_profile(self, dataset_name: Optional[str] = None, dataset_timestamp: Optional[datetime.datetime] = None, session_timestamp: Optional[datetime.datetime] = None, tags: Dict[str, str] = None, metadata: Dict[str, str] = None) Optional[whylogs.core.DatasetProfile] ¶
Create an empty dataset profile with the metadata from the session.
- Parameters
dataset_name – name of the dataset
dataset_timestamp – the timestamp for the dataset
session_timestamp – the timestamp for the session. Override the default one
tags – the tags for the profile. Useful when merging
metadata – information about this current profile. Can be discarded when merging
- Returns
a dataset profile if the session is active
- estimate_segments(self, df: pandas.DataFrame, name: str, target_field: str = None, max_segments: int = 30, dry_run: bool = False) Optional[Union[List[Dict], List[str]]] ¶
Estimates the most important features and values on which to segment data profiling using entropy-based methods.
- Parameters
df – the dataframe of data to profile
name – name for discovery in the logger, automatically applied
to loggers with same dataset_name :param target_field: target field (optional) :param max_segments: upper threshold for total combinations of segments, default 30 :param dry_run: run calculation but do not write results to metadata :return: a list of segmentation feature names
- close(self)¶
Deactivate this session and flush all associated loggers
- remove_logger(self, dataset_name: str)¶
Remove a logger from the dataset. This is called by the logger when it’s being closed
- Parameters
logger (dataset_name the name of the dataset. used to identify the) –
None (Returns) –
------- –
- class whylogs.app.SessionConfig(project: str, pipeline: str, writers: List[WriterConfig], metadata: Optional[MetadataConfig] = None, verbose: bool = False, with_rotation_time: str = None, cache_size: int = 1, report_progress: bool = False)¶
Config for a whylogs session.
See also
SessionConfigSchema
- Parameters
project (str) – Project associated with this whylogs session
pipeline (str) – Name of the associated data pipeline
writers (list) – A list of WriterConfig objects defining writer outputs
metadata (MetadataConfig) – A MetadataConfiguration object. If none, will replace with default.
verbose (bool, default=False) – Output verbosity
with_rotation_time (str, default = None, to rotate profiles with time, takes values of overall rotation interval,) – “s” for seconds “m” for minutes “h” for hours “d” for days
cache_size (int default =1, sets how many dataprofiles to cache in logger during rotation) –
- to_yaml(self, stream=None)¶
Serialize this config to YAML
- Parameters
stream – If None (default) return a string, else dump the yaml into this stream.
- static from_yaml(stream)¶
Load config from yaml
- Parameters
stream (str, file-obj) – String or file-like object to load yaml from
- Returns
config – Generated config
- Return type
- class whylogs.app.WriterConfig(type: str, formats: Optional[List[str]] = None, output_path: Optional[str] = None, path_template: Optional[str] = None, filename_template: Optional[str] = None, data_collection_consent: Optional[bool] = None, transport_parameters: Optional[TransportParameterConfig] = None)¶
Config for whylogs writers
See also:
WriterConfigSchema
- Parameters
type (str) – Destination for the writer output, e.g. ‘local’ or ‘s3’
formats (list) – All output formats. See
ALL_SUPPORTED_FORMATS
output_path (str) – Prefix of where to output files. A directory for type = ‘local’, or key prefix for type = ‘s3’
path_template (str, optional) – Templatized path output using standard python string templates. Variables are accessed via $identifier or ${identifier}. See
whylogs.app.writers.Writer.template_params()
for a list of available identifers. Default =whylogs.app.writers.DEFAULT_PATH_TEMPLATE
filename_template (str, optional) – Templatized output filename using standardized python string templates. Variables are accessed via $identifier or ${identifier}. See
whylogs.app.writers.Writer.template_params()
for a list of available identifers. Default =whylogs.app.writers.DEFAULT_FILENAME_TEMPLATE
- to_yaml(self, stream=None)¶
Serialize this config to YAML
- Parameters
stream – If None (default) return a string, else dump the yaml into this stream.
- static from_yaml(stream, **kwargs)¶
Load config from yaml
- Parameters
stream (str, file-obj) – String or file-like object to load yaml from
kwargs – ignored
- Returns
config – Generated config
- Return type
WriterConfig
- whylogs.app.__ALL__¶
whylogs.cli
¶
Submodules¶
whylogs.cli.cli
¶
|
Welcome to whylogs CLI! |
|
- whylogs.cli.cli._set_up_logger()¶
- whylogs.cli.cli.cli(verbose)¶
Welcome to whylogs CLI!
Supported basic commands:
whylogs init : create a new whylogs project configuration
- whylogs.cli.cli.main()¶
whylogs.cli.cli_text
¶- whylogs.cli.cli_text.INTRO_MESSAGE = Multiline-String¶
Show Value
1██╗ ██╗██╗ ██╗██╗ ██╗██╗ ██████╗ ██████╗ ███████╗ 2██║ ██║██║ ██║╚██╗ ██╔╝██║ ██╔═══██╗██╔════╝ ██╔════╝ 3██║ █╗ ██║███████║ ╚████╔╝ ██║ ██║ ██║██║ ███╗███████╗ 4██║███╗██║██╔══██║ ╚██╔╝ ██║ ██║ ██║██║ ██║╚════██║ 5╚███╔███╔╝██║ ██║ ██║ ███████╗╚██████╔╝╚██████╔╝███████║ 6 ╚══╝╚══╝ ╚═╝ ╚═╝ ╚═╝ ╚══════╝ ╚═════╝ ╚═════╝ ╚══════╝ 7 / \__ 8 ( @\___ 9 / O 10 / (_____/ 11 /_____/ U 12 13Welcome to whylogs! 14 15Join us our community slack at http://join.slack.whylabs.ai/ 16 17This CLI will guide you through initializing a basic whylogs configurations.
- whylogs.cli.cli_text.DOING_NOTHING_ABORTING = Doing nothing. Aborting¶
- whylogs.cli.cli_text.OVERRIDE_CONFIRM = Would you like to proceed with the above path?¶
- whylogs.cli.cli_text.EMPTY_PATH_WARNING = WARNING: we will override the content in the non-empty path¶
- whylogs.cli.cli_text.BEGIN_WORKFLOW = Multiline-String¶
Show Value
1Great. We will now generate the default configuration for whylogs' 2We'll need a few details from you before we can proceed
- whylogs.cli.cli_text.PIPELINE_DESCRIPTION = "Pipeline" is a series of one or multiple datasets to build a single model or application. A...¶
- whylogs.cli.cli_text.PROJECT_NAME_PROMPT = Project name (alphanumeric, dash, and underscore characters only)¶
- whylogs.cli.cli_text.PROJECT_DESCRIPTION = "Project" is a collection of related datasets that are used for multiple models or applications.¶
- whylogs.cli.cli_text.DATETIME_EXPLANATION = Multiline-String¶
Show Value
1whylogs can break down the data by time for you 2This will enable users to run time-based analysis
- whylogs.cli.cli_text.DATETIME_COLUMN_PROMPT = What is the name of the datetime feature (leave blank to skip)?¶
- whylogs.cli.cli_text.SKIP_DATETIME = Skip grouping by datetime¶
- whylogs.cli.cli_text.DATETIME_FORMAT_PROMPT = What is the format of the column? Leave blank to use datetimeutil to parse¶
- whylogs.cli.cli_text.INITIAL_PROFILING_CONFIRM = Would you like to run an initial profiling job?¶
- whylogs.cli.cli_text.DATA_SOURCE_MESSAGE = Select data source:¶
- whylogs.cli.cli_text.PROFILE_OVERRIDE_CONFIRM = Profile path already exists. This will override existing data¶
- whylogs.cli.cli_text.DATA_WILL_BE_OVERRIDDEN = Previous profile data will be overridden¶
- whylogs.cli.cli_text.OBSERVATORY_EXPLANATION = Multiline-String¶
Show Value
1WhyLabs Platform can visualize your statistics. This will require the CLI to upload 2your statistics to WhyLabs endpoint. Your original data (CSV file) will remain locally.
- whylogs.cli.cli_text.RUN_PROFILING = Run whylogs profiling...¶
- whylogs.cli.cli_text.GENERATE_NOTEBOOKS = Generate Jupyter notebooks¶
- whylogs.cli.cli_text.DONE = Done¶
whylogs.cli.demo_cli
¶
|
Initialize and configure a new whylogs project. |
|
|
|
Welcome to whylogs Demo CLI! |
|
- whylogs.cli.demo_cli._LENDING_CLUB_CSV = lending_club_1000.csv¶
- whylogs.cli.demo_cli._EXAMPLE_REPO = https://github.com/whylabs/whylogs-examples.git¶
- whylogs.cli.demo_cli._set_up_logger()¶
- whylogs.cli.demo_cli.NAME_FORMAT¶
- whylogs.cli.demo_cli.init(project_dir)¶
Initialize and configure a new whylogs project.
This guided input walks the user through setting up a new project and also on-boards a new developer in an existing project.
It scaffolds directories, sets up notebooks, creates a project file, and appends to a .gitignore file.
- whylogs.cli.demo_cli.profile_csv(session_config: whylogs.app.SessionConfig, project_dir: str) str ¶
- whylogs.cli.demo_cli.cli(verbose)¶
Welcome to whylogs Demo CLI!
Supported commands:
whylogs-demo init : create a demo whylogs project with example data and notebooks
- whylogs.cli.demo_cli.main()¶
whylogs.cli.init
¶
|
Initialize and configure a new whylogs project. |
- whylogs.cli.init.LENDING_CLUB_CSV = lending_club_1000.csv¶
- whylogs.cli.init.NAME_FORMAT¶
- whylogs.cli.init.init(project_dir)¶
Initialize and configure a new whylogs project.
This guided input walks the user through setting up a new project and also onboards a new developer in an existing project.
It scaffolds directories, sets up notebooks, creates a project file, and appends to a .gitignore file.
whylogs.cli.utils
¶
|
- whylogs.cli.utils.echo(message: Union[str, list], **styles)¶
Package Contents¶
|
Welcome to whylogs CLI! |
|
|
- whylogs.cli.cli(verbose)¶
Welcome to whylogs CLI!
Supported basic commands:
whylogs init : create a new whylogs project configuration
- whylogs.cli.main()¶
- whylogs.cli.demo_main()¶
- whylogs.cli.__ALL__¶
whylogs.core
¶
Subpackages¶
whylogs.core.metrics
¶whylogs.core.metrics.confusion_matrix
¶Confusion Matrix Class to hold labels and matrix data. |
|
Merges two confusion_matrix since distinc or overlaping labels |
- whylogs.core.metrics.confusion_matrix.SUPPORTED_TYPES = ['binary', 'multiclass']¶
- whylogs.core.metrics.confusion_matrix.MODEL_METRICS_MAX_LABELS = 256¶
- whylogs.core.metrics.confusion_matrix.MODEL_METRICS_LABEL_SIZE_WARNING_THRESHOLD = 64¶
- whylogs.core.metrics.confusion_matrix._logger¶
- class whylogs.core.metrics.confusion_matrix.ConfusionMatrix(labels: List[str] = None, prediction_field: str = None, target_field: str = None, score_field: str = None)¶
Confusion Matrix Class to hold labels and matrix data.
- labels¶
list of labels in a sorted order
- prediction_field¶
name of the prediction field
- target_field¶
name of the target field
- score_field¶
name of the score field
- confusion_matrix¶
Confusion Matrix kept as matrix of NumberTrackers
- Type
nd.array
- labels¶
list of labels for the confusion_matrix axes
- Type
List[str]
- add(self, predictions: List[Union[str, int, bool]], targets: List[Union[str, int, bool]], scores: List[float])¶
Function adds predictions and targets to confusion matrix with scores.
- Parameters
predictions (List[Union[str, int, bool]]) –
targets (List[Union[str, int, bool]]) –
scores (List[float]) –
- Raises
NotImplementedError – in case targets do not fall into binary or
multiclass suport –
ValueError – incase missing validation or predictions
- merge(self, other_cm)¶
Merge two seperate confusion matrix which may or may not overlap in labels.
- Parameters
other_cm (Optional[ConfusionMatrix]) – confusion_matrix to merge with self
- Returns
merged confusion_matrix
- Return type
- to_protobuf(self)¶
Convert to protobuf
- Returns
Description
- Return type
TYPE
- classmethod from_protobuf(cls, message: whylogs.proto.ScoreMatrixMessage)¶
- whylogs.core.metrics.confusion_matrix._merge_CM(old_conf_matrix: ConfusionMatrix, new_conf_matrix: ConfusionMatrix)¶
Merges two confusion_matrix since distinc or overlaping labels
- Parameters
old_conf_matrix (ConfusionMatrix) –
new_conf_matrix (ConfusionMatrix) – Will be overridden
whylogs.core.metrics.model_metrics
¶Container class for various model-related metrics |
- class whylogs.core.metrics.model_metrics.ModelMetrics(confusion_matrix: whylogs.core.metrics.confusion_matrix.ConfusionMatrix = None, regression_metrics: whylogs.core.metrics.regression_metrics.RegressionMetrics = None, nlp_metrics: whylogs.core.metrics.nlp_metrics.NLPMetrics = None, model_type: whylogs.proto.ModelType = ModelType.UNKNOWN)¶
Container class for various model-related metrics
- confusion_matrix¶
ConfusionMatrix which keeps it track of counts with NumberTracker
- Type
- regression_metrics¶
Regression Metrics keeps track of a common regression metrics in case the targets are continous.
- Type
- to_protobuf(self) whylogs.proto.ModelMetricsMessage ¶
- classmethod from_protobuf(cls, message)¶
- init_or_get_model_type(self, scores) whylogs.proto.ModelType ¶
- compute_confusion_matrix(self, predictions: List[Union[str, int, bool, float]], targets: List[Union[str, int, bool, float]], scores: List[float] = None, target_field: str = None, prediction_field: str = None, score_field: str = None)¶
computes the confusion_matrix, if one is already present merges to old one.
- Parameters
predictions (List[Union[str, int, bool]]) –
targets (List[Union[str, int, bool]]) –
scores (List[float], optional) –
target_field (str, optional) –
prediction_field (str, optional) –
score_field (str, optional) –
- compute_regression_metrics(self, predictions: List[Union[float, int]], targets: List[Union[float, int]], target_field: str = None, prediction_field: str = None)¶
- merge(self, other)¶
whylogs.core.metrics.nlp_metrics
¶- whylogs.core.metrics.nlp_metrics.logger¶
- class whylogs.core.metrics.nlp_metrics.NLPMetrics(prediction_field: str = None, target_field: str = None)¶
- update(self, predictions: Union[List[str], str], targets: Union[List[str]], transform=None) None ¶
Function adds predictions and targets computation of nlp metrics.
- Parameters
predictions (Union[str,List[str]]) –
targets (Union[List[str],str]) –
- merge(self, other: NLPMetrics) NLPMetrics ¶
Merge two seperate nlp metrics
- Parameters
other – nlp metrics to merge with self
- Returns
merged nlp metrics
- Return type
- to_protobuf(self) whylogs.proto.NLPMetricsMessage ¶
Convert to protobuf
- Returns
Protobuf Message
- Return type
TYPE
- classmethod from_protobuf(cls: NLPMetrics, message: whylogs.proto.NLPMetricsMessage)¶
whylogs.core.metrics.regression_metrics
¶- whylogs.core.metrics.regression_metrics.SUPPORTED_TYPES = regression¶
- class whylogs.core.metrics.regression_metrics.RegressionMetrics(prediction_field: str = None, target_field: str = None)¶
- add(self, predictions: List[float], targets: List[float])¶
Function adds predictions and targets computation of regression metrics.
- Parameters
predictions (List[float]) –
targets (List[float]) –
- mean_absolute_error(self)¶
- mean_squared_error(self)¶
- root_mean_squared_error(self)¶
- merge(self, other)¶
Merge two seperate confusion matrix which may or may not overlap in labels.
- Parameters
other – regression metrics to merge with self
- Returns
merged regression metrics
- Return type
- to_protobuf(self)¶
Convert to protobuf
- Returns
Protobuf Message
- Return type
TYPE
- classmethod from_protobuf(cls, message: whylogs.proto.RegressionMetricsMessage)¶
whylogs.core.statistics
¶Define classes for tracking statistics
whylogs.core.statistics.datatypes
¶Define classes for tracking statistics for various data types
whylogs.core.statistics.datatypes.floattracker
¶Track statistics for floating point numbers |
- class whylogs.core.statistics.datatypes.floattracker.FloatTracker(min: float = None, max: float = None, sum: float = None, count: int = None)¶
Track statistics for floating point numbers
- Parameters
min (float) – Current min value
max (float) – Current max value
sum (float) – Sum of the numbers
count (int) – Total count of numbers
- update(self, value: float)¶
Add a number to the tracking statistics
- add_integers(self, tracker)¶
Copy data from a IntTracker into this object, overwriting the current values.
- Parameters
tracker (IntTracker) –
- mean(self)¶
Calculate the current mean
- merge(self, other)¶
Merge this tracker with another.
- Parameters
other (FloatTracker) – The other float tracker
- Returns
merged – A new float tracker
- Return type
- to_protobuf(self)¶
Return the object serialized as a protobuf message
- Returns
message
- Return type
DoublesMessage
- static from_protobuf(message)¶
Load from a protobuf message
- Returns
number_tracker
- Return type
whylogs.core.statistics.datatypes.integertracker
¶Track statistics for integers |
- class whylogs.core.statistics.datatypes.integertracker.IntTracker(min: int = None, max: int = None, sum: int = None, count: int = None)¶
Track statistics for integers
- Parameters
min – Current min value
max – Current max value
sum – Sum of the numbers
count – Total count of numbers
- DEFAULTS¶
- set_defaults(self)¶
Set attribute values to defaults
- mean(self)¶
Calculate the current mean. Returns None if self.count = 0
- update(self, value)¶
Add a number to the tracking statistics
- merge(self, other)¶
Merge values of another IntTracker with this one.
- Parameters
other (IntTracker) – Other tracker
- Returns
new – New, merged tracker
- Return type
- to_protobuf(self)¶
Return the object serialized as a protobuf message
- Returns
message
- Return type
LongsMessage
- static from_protobuf(message)¶
Load from a protobuf message
- Returns
number_tracker
- Return type
whylogs.core.statistics.datatypes.variancetracker
¶Class that implements variance estimates for streaming data and for |
- class whylogs.core.statistics.datatypes.variancetracker.VarianceTracker(count=0, sum=0.0, mean=0.0)¶
Class that implements variance estimates for streaming data and for batched data.
- Parameters
count – Number tracked elements
sum – Sum of all numbers
mean – Current estimate of the mean
- update(self, new_value)¶
Add a number to tracking estimates
Based on https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford’s_online_algorithm
- Parameters
new_value (int, float) –
- stddev(self)¶
Return an estimate of the sample standard deviation
- variance(self)¶
Return an estimate of the sample variance
- merge(self, other: VarianceTracker)¶
Merge statistics from another VarianceTracker into this one
See: https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
- Parameters
other (VarianceTracker) – Other variance tracker
- Returns
merged – A new variance tracker from the merged statistics
- Return type
- copy(self)¶
Return a copy of this tracker
- to_protobuf(self)¶
Return the object serialized as a protobuf message
- Returns
message
- Return type
VarianceMessage
- static from_protobuf(message)¶
Load from a protobuf message
- Returns
variance_tracker
- Return type
Track statistics for floating point numbers |
|
Track statistics for integers |
|
Class that implements variance estimates for streaming data and for |
- class whylogs.core.statistics.datatypes.FloatTracker(min: float = None, max: float = None, sum: float = None, count: int = None)¶
Track statistics for floating point numbers
- Parameters
min (float) – Current min value
max (float) – Current max value
sum (float) – Sum of the numbers
count (int) – Total count of numbers
- update(self, value: float)¶
Add a number to the tracking statistics
- add_integers(self, tracker)¶
Copy data from a IntTracker into this object, overwriting the current values.
- Parameters
tracker (IntTracker) –
- mean(self)¶
Calculate the current mean
- merge(self, other)¶
Merge this tracker with another.
- Parameters
other (FloatTracker) – The other float tracker
- Returns
merged – A new float tracker
- Return type
- to_protobuf(self)¶
Return the object serialized as a protobuf message
- Returns
message
- Return type
DoublesMessage
- static from_protobuf(message)¶
Load from a protobuf message
- Returns
number_tracker
- Return type
- class whylogs.core.statistics.datatypes.IntTracker(min: int = None, max: int = None, sum: int = None, count: int = None)¶
Track statistics for integers
- Parameters
min – Current min value
max – Current max value
sum – Sum of the numbers
count – Total count of numbers
- DEFAULTS¶
- set_defaults(self)¶
Set attribute values to defaults
- mean(self)¶
Calculate the current mean. Returns None if self.count = 0
- update(self, value)¶
Add a number to the tracking statistics
- merge(self, other)¶
Merge values of another IntTracker with this one.
- Parameters
other (IntTracker) – Other tracker
- Returns
new – New, merged tracker
- Return type
- to_protobuf(self)¶
Return the object serialized as a protobuf message
- Returns
message
- Return type
LongsMessage
- static from_protobuf(message)¶
Load from a protobuf message
- Returns
number_tracker
- Return type
- class whylogs.core.statistics.datatypes.VarianceTracker(count=0, sum=0.0, mean=0.0)¶
Class that implements variance estimates for streaming data and for batched data.
- Parameters
count – Number tracked elements
sum – Sum of all numbers
mean – Current estimate of the mean
- update(self, new_value)¶
Add a number to tracking estimates
Based on https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford’s_online_algorithm
- Parameters
new_value (int, float) –
- stddev(self)¶
Return an estimate of the sample standard deviation
- variance(self)¶
Return an estimate of the sample variance
- merge(self, other: VarianceTracker)¶
Merge statistics from another VarianceTracker into this one
See: https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm
- Parameters
other (VarianceTracker) – Other variance tracker
- Returns
merged – A new variance tracker from the merged statistics
- Return type
- copy(self)¶
Return a copy of this tracker
- to_protobuf(self)¶
Return the object serialized as a protobuf message
- Returns
message
- Return type
VarianceMessage
- static from_protobuf(message)¶
Load from a protobuf message
- Returns
variance_tracker
- Return type
- whylogs.core.statistics.datatypes.__ALL__¶
whylogs.core.statistics.constraints
¶ValueConstraints express a binary boolean relationship between an implied numeric value and a literal. |
|
Summary constraints specify a relationship between a summary field and a static value, |
|
ValueConstraints express a binary boolean relationship between an implied numeric value and a literal. |
|
|
Return whether the string is in a strftime format. |
|
Return whether the string can be interpreted as a date. |
|
Return whether the string can be interpreted as json. |
|
Return whether the provided json matches the provided schema. |
|
|
|
|
|
|
|
Defines a summary constraint on the standard deviation of a feature. The standard deviation can be defined to be |
|
Defines a summary constraint on the mean (average) of a feature. The mean can be defined to be |
|
Defines a summary constraint on the minimum value of a feature. The minimum can be defined to be |
|
Defines a summary constraint on the minimum value of a feature. The minimum can be defined to be |
|
Defines a summary constraint on the maximum value of a feature. The maximum can be defined to be |
|
Defines a summary constraint on the maximum value of a feature. The maximum can be defined to be |
|
Defines a summary constraint on the distinct values of a feature. All of the distinct values should |
|
Defines a summary constraint on the distinct values of a feature. The set of the distinct values should |
|
Defines a summary constraint on the distinct values of a feature. The set of user-supplied reference values, |
|
Defines a value constraint with set operations on the values of a single feature. |
|
Defines a value constraint with email regex matching operations on the values of a single feature. |
|
Defines a value constraint with credit card number regex matching operations on the values of a single feature. |
|
Defines a value constraint which checks if the values of a single feature |
|
Defines a value constraint which checks if the values of a single feature |
|
Defines a value constraint which checks if the values of a single feature |
|
Defines a value constraint which checks if the values of a single feature |
|
Defines a value constraint with social security number (SSN) matching operations |
|
Defines a value constraint with URL regex matching operations on the values of a single feature. |
|
Defines a value constraint which checks if the string values of a single feature |
|
Defines a value constraint which checks if the string values' length of a single feature |
|
Defines a summary constraint on the n-th quantile value of a numeric feature. |
|
Defines a summary constraint on the cardinality of a specific feature. |
|
Defines a summary constraint on the proportion of unique values of a specific feature. |
|
Defines a constraint on the data set schema. |
|
Defines a constraint on the data set schema. |
|
Defines a constraint on the data set schema. |
|
Defines a summary constraint on the most common value of a feature. |
|
Defines a non-null summary constraint on the value of a feature. |
|
Defines a summary constraint on the proportion of missing values of a specific feature. |
|
Defines a summary constraint on the type of the feature values. |
|
Defines a summary constraint on the type of the feature values. |
|
Defines a summary constraint specifying the expected interval of the features estimated entropy. |
|
Defines a summary constraint specifying the expected |
|
Defines a summary constraint specifying the expected |
|
Defines a summary constraint specifying the expected |
|
Defines a multi-column value constraint which specifies that each value in column A, |
|
Defines a multi-column value constraint which specifies that each value in column A, |
|
Defines a multi-column value constraint which specifies that each value in column A, |
|
Defines a multi-column value constraint which specifies that each value in column A, |
|
Defines a multi-column value constraint which specifies that each value in column A, |
|
Defines a multi-column value constraint which specifies that each value in column A, |
|
Defines a multi-column value constraint which specifies that the sum of the values in each row |
|
Defines a multi-column value constraint which specifies that the pair of values of columns A and B, |
|
Defines a multi-column value constraint which specifies that the values of column A |
Dict indexed by constraint operator. |
|
- whylogs.core.statistics.constraints.TYPES¶
- whylogs.core.statistics.constraints.logger¶
- whylogs.core.statistics.constraints._try_parse_strftime_format(strftime_val: str, format: str) Optional[datetime.datetime] ¶
Return whether the string is in a strftime format. :param strftime_val: str, string to check for date :param format: format to check if strftime_val can be parsed :return None if not parseable, otherwise the parsed datetime.datetime object
- whylogs.core.statistics.constraints._try_parse_dateutil(dateutil_val: str, ref_val=None) Optional[datetime.datetime] ¶
Return whether the string can be interpreted as a date. :param dateutil_val: str, string to check for date :param ref_val: any, not used, interface design requirement :return None if not parseable, otherwise the parsed datetime.datetime object
- whylogs.core.statistics.constraints._try_parse_json(json_string: str, ref_val=None) Optional[dict] ¶
Return whether the string can be interpreted as json. :param json_string: str, string to check for json :param ref_val: any, not used, interface design requirement :return None if not parseable, otherwise the parsed json object
- whylogs.core.statistics.constraints._matches_json_schema(json_data: Union[str, dict], json_schema: Union[str, dict]) bool ¶
Return whether the provided json matches the provided schema. :param json_data: json object to check :param json_schema: schema to check if the json object matches it :return True if the json data matches the schema, False otherwise
- whylogs.core.statistics.constraints.MAX_SET_DISPLAY_MESSAGE_LENGTH = 20¶
Dict indexed by constraint operator.
These help translate from constraint schema to language-specific functions that are faster to evaluate. This is just a form of currying, and I chose to bind the boolean comparison operator first.
- whylogs.core.statistics.constraints._value_funcs¶
- whylogs.core.statistics.constraints._summary_funcs1¶
- whylogs.core.statistics.constraints._summary_funcs2¶
- whylogs.core.statistics.constraints._multi_column_value_funcs¶
- class whylogs.core.statistics.constraints.ValueConstraint(op: whylogs.proto.Op, value=None, regex_pattern: str = None, apply_function=None, name: str = None, verbose=False)¶
ValueConstraints express a binary boolean relationship between an implied numeric value and a literal. When associated with a ColumnProfile, the relation is evaluated for every incoming value that is processed by whylogs.
- Parameters
op (whylogs.proto.Op (required)) – Enumeration of binary comparison operator applied between static value and incoming stream. Enum values are mapped to operators like ‘==’, ‘<’, and ‘<=’, etc.
value ((one-of)) – When value is provided, regex_pattern must be None. Static value to compare against incoming stream using operator specified in op.
regex_pattern ((one-of)) – When regex_pattern is provided, value must be None. Regex pattern to use when MATCH or NOMATCH operations are used.
apply_function – To be supplied only when using APPLY_FUNC operation. In case when the apply_function requires argument, to be supplied in the value param.
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- property name(self)¶
- update(self, v) bool ¶
- apply_func_validate(self, value) str ¶
- merge(self, other) ValueConstraint ¶
- static from_protobuf(msg: whylogs.proto.ValueConstraintMsg) ValueConstraint ¶
- to_protobuf(self) whylogs.proto.ValueConstraintMsg ¶
- report(self)¶
- class whylogs.core.statistics.constraints.SummaryConstraint(first_field: str, op: whylogs.proto.Op, value=None, upper_value=None, quantile_value: Union[int, float] = None, second_field: str = None, third_field: str = None, reference_set: Union[List[Any], Set[Any], datasketches.kll_floats_sketch, whylogs.proto.ReferenceDistributionDiscreteMessage] = None, name: str = None, verbose=False)¶
Summary constraints specify a relationship between a summary field and a static value, or between two summary fields. e.g. ‘min’ < 6
‘std_dev’ < 2.17 ‘min’ > ‘avg’
- Parameters
first_field (str) – Name of field in NumberSummary that will be compared against either a second field or a static value.
op (whylogs.proto.Op (required)) – Enumeration of binary comparison operator applied between summary values. Enum values are mapped to operators like ‘==’, ‘<’, and ‘<=’, etc.
value ((one-of)) – Static value to be compared against summary field specified in first_field. Only one of value or second_field should be supplied.
upper_value ((one-of)) – Only to be supplied when using Op.BTWN. Static upper boundary value to be compared against summary field specified in first_field. Only one of upper_value or third_field should be supplied.
second_field ((one-of)) – Name of second field in NumberSummary to be compared against summary field specified in first_field. Only one of value or second_field should be supplied.
third_field ((one-of)) –
- Only to be supplied when op == Op.BTWN. Name of third field in NumberSummary, used as an upper boundary,
to be compared against summary field specified in first_field.
Only one of upper_value or third_field should be supplied.
reference_set ((one-of)) – Only to be supplied when using set operations or distributional measures. Used as a reference set to be compared with the column distinct values. Or is instance of datasketches.kll_floats_sketch or ReferenceDistributionDiscreteMessage. Only to be supplied for constraints on distributional measures, such as KS test, KL divergence and Chi-Squared test.
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- property name(self)¶
- _get_field_name(self)¶
- _get_value_or_field(self)¶
- _get_constraint_type(self)¶
- _check_and_init_table_shape_constraint(self, reference_set)¶
- _check_and_init_valid_set_constraint(self, reference_set)¶
- _check_and_init_distributional_measure_constraint(self, reference_set)¶
- _check_and_init_between_constraint(self)¶
- _get_str_from_ref_set(self) str ¶
- _try_cast_set(self) Set[Any] ¶
- _get_string_and_numbers_sets(self)¶
- _create_theta_sketch(self, ref_set: set = None)¶
- update(self, update_summary: object) bool ¶
- merge(self, other) SummaryConstraint ¶
- _check_if_summary_constraint_message_is_valid(msg: whylogs.proto.SummaryConstraintMsg)¶
- static from_protobuf(msg: whylogs.proto.SummaryConstraintMsg) SummaryConstraint ¶
- to_protobuf(self) whylogs.proto.SummaryConstraintMsg ¶
- report(self)¶
- class whylogs.core.statistics.constraints.ValueConstraints(constraints: Mapping[str, ValueConstraint] = None)¶
- static from_protobuf(msg: whylogs.proto.ValueConstraintMsgs) ValueConstraints ¶
- __getitem__(self, name: str) Optional[ValueConstraint] ¶
- to_protobuf(self) whylogs.proto.ValueConstraintMsgs ¶
- update(self, v)¶
- update_typed(self, v)¶
- merge(self, other) ValueConstraints ¶
- report(self) List[tuple] ¶
- class whylogs.core.statistics.constraints.SummaryConstraints(constraints: Mapping[str, SummaryConstraint] = None)¶
- static from_protobuf(msg: whylogs.proto.SummaryConstraintMsgs) SummaryConstraints ¶
- __getitem__(self, name: str) Optional[SummaryConstraint] ¶
- to_protobuf(self) whylogs.proto.SummaryConstraintMsgs ¶
- update(self, v)¶
- merge(self, other) SummaryConstraints ¶
- report(self) List[tuple] ¶
- class whylogs.core.statistics.constraints.MultiColumnValueConstraint(dependent_columns: Union[str, List[str], Tuple[str], numpy.ndarray], op: whylogs.proto.Op, reference_columns: Union[str, List[str], Tuple[str], numpy.ndarray] = None, internal_dependent_cols_op: whylogs.proto.Op = None, value=None, name: str = None, verbose: bool = False)¶
Bases:
ValueConstraint
ValueConstraints express a binary boolean relationship between an implied numeric value and a literal. When associated with a ColumnProfile, the relation is evaluated for every incoming value that is processed by whylogs.
- Parameters
op (whylogs.proto.Op (required)) – Enumeration of binary comparison operator applied between static value and incoming stream. Enum values are mapped to operators like ‘==’, ‘<’, and ‘<=’, etc.
value ((one-of)) – When value is provided, regex_pattern must be None. Static value to compare against incoming stream using operator specified in op.
regex_pattern ((one-of)) – When regex_pattern is provided, value must be None. Regex pattern to use when MATCH or NOMATCH operations are used.
apply_function – To be supplied only when using APPLY_FUNC operation. In case when the apply_function requires argument, to be supplied in the value param.
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- property name(self)¶
- update(self, column_values_dictionary)¶
- merge(self, other) MultiColumnValueConstraint ¶
- static from_protobuf(msg: whylogs.proto.MultiColumnValueConstraintMsg) MultiColumnValueConstraint ¶
- to_protobuf(self) whylogs.proto.MultiColumnValueConstraintMsg ¶
- class whylogs.core.statistics.constraints.MultiColumnValueConstraints(constraints: Mapping[str, MultiColumnValueConstraint] = None)¶
Bases:
ValueConstraints
- static from_protobuf(msg: whylogs.proto.ValueConstraintMsgs) MultiColumnValueConstraints ¶
- to_protobuf(self) whylogs.proto.ValueConstraintMsgs ¶
- class whylogs.core.statistics.constraints.DatasetConstraints(props: whylogs.proto.DatasetProperties, value_constraints: Mapping[str, ValueConstraints] = None, summary_constraints: Mapping[str, SummaryConstraints] = None, table_shape_constraints: Mapping[str, SummaryConstraints] = None, multi_column_value_constraints: Optional[MultiColumnValueConstraints] = None)¶
- __getitem__(self, key)¶
- static from_protobuf(msg: whylogs.proto.DatasetConstraintMsg) DatasetConstraints ¶
- static from_json(data: str) DatasetConstraints ¶
- to_protobuf(self) whylogs.proto.DatasetConstraintMsg ¶
- to_json(self) str ¶
- report(self)¶
- whylogs.core.statistics.constraints._check_between_constraint_valid_initialization(lower_value, upper_value, lower_field, upper_field)¶
- whylogs.core.statistics.constraints._set_between_constraint_default_name(field, lower_value, upper_value, lower_field, upper_field)¶
- whylogs.core.statistics.constraints._format_set_values_for_display(reference_set)¶
- whylogs.core.statistics.constraints.stddevBetweenConstraint(lower_value=None, upper_value=None, lower_field=None, upper_field=None, name=None, verbose=False)¶
Defines a summary constraint on the standard deviation of a feature. The standard deviation can be defined to be between two values, or between the values of two other summary fields of the same feature, such as the minimum and the maximum. The defined interval is a closed interval, which includes both of its limit points.
- Parameters
lower_value (numeric (one-of)) – Represents the lower value limit of the interval for the standard deviation. If lower_value is supplied, then upper_value must also be supplied, and none of lower_field and upper_field should be provided.
upper_value (numeric (one-of)) – Represents the upper value limit of the interval for the standard deviation. If upper_value is supplied, then lower_value must also be supplied, and none of lower_field and upper_field should be provided.
lower_field (str (one-of)) – Represents the lower field limit of the interval for the standard deviation. The lower field is a string representing a summary field e.g. min, mean, max, stddev, etc., for which the value will be used as a lower bound. If lower_field is supplied, then upper_field must also be supplied, and none of lower_value and upper_value should be provided.
upper_field (str (one-of)) – Represents the upper field limit of the interval for the standard deviation. The upper field is a string representing a summary field e.g. min, mean, max, stddev, etc., for which the value will be used as an upper bound. If upper_field is supplied, then lower_field must also be supplied, and none of lower_value and upper_value should be provided.
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Return type
SummaryConstraint - a summary constraint defining an interval of values for the standard deviation of a feature
- whylogs.core.statistics.constraints.meanBetweenConstraint(lower_value=None, upper_value=None, lower_field=None, upper_field=None, name=None, verbose=False)¶
Defines a summary constraint on the mean (average) of a feature. The mean can be defined to be between two values, or between the values of two other summary fields of the same feature, such as the minimum and the maximum. The defined interval is a closed interval, which includes both of its limit points.
- Parameters
lower_value (numeric (one-of)) – Represents the lower value limit of the interval for the mean. If lower_value is supplied, then upper_value must also be supplied, and none of lower_field and upper_field should be provided.
upper_value (numeric (one-of)) – Represents the upper value limit of the interval for the mean. If upper_value is supplied, then lower_value must also be supplied, and none of lower_field and upper_field should be provided.
lower_field (str (one-of)) – Represents the lower field limit of the interval for the mean. The lower field is a string representing a summary field e.g. min, mean, max, stddev, etc., for which the value will be used as a lower bound. If lower_field is supplied, then upper_field must also be supplied, and none of lower_value and upper_value should be provided.
upper_field (str (one-of)) – Represents the upper field limit of the interval for the mean. The upper field is a string representing a summary field e.g. min, mean, max, stddev, etc., for which the value will be used as an upper bound. If upper_field is supplied, then lower_field must also be supplied, and none of lower_value and upper_value should be provided.
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Return type
SummaryConstraint - a summary constraint defining an interval of values for the mean of a feature
- whylogs.core.statistics.constraints.minBetweenConstraint(lower_value=None, upper_value=None, lower_field=None, upper_field=None, name=None, verbose=False)¶
Defines a summary constraint on the minimum value of a feature. The minimum can be defined to be between two values, or between the values of two other summary fields of the same feature, such as the minimum and the maximum. The defined interval is a closed interval, which includes both of its limit points.
- Parameters
lower_value (numeric (one-of)) – Represents the lower value limit of the interval for the minimum. If lower_value is supplied, then upper_value must also be supplied, and none of lower_field and upper_field should be provided.
upper_value (numeric (one-of)) – Represents the upper value limit of the interval for the minimum. If upper_value is supplied, then lower_value must also be supplied, and none of lower_field and upper_field should be provided.
lower_field (str (one-of)) – Represents the lower field limit of the interval for the minimum. The lower field is a string representing a summary field e.g. min, mean, max, stddev, etc., for which the value will be used as a lower bound. If lower_field is supplied, then upper_field must also be supplied, and none of lower_value and upper_value should be provided.
upper_field (str (one-of)) – Represents the upper field limit of the interval for the minimum. The upper field is a string representing a summary field e.g. min, mean, max, stddev, etc., for which the value will be used as an upper bound. If upper_field is supplied, then lower_field must also be supplied, and none of lower_value and upper_value should be provided.
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Return type
SummaryConstraint - a summary constraint defining an interval of values for the minimum value of a feature
- whylogs.core.statistics.constraints.minGreaterThanEqualConstraint(value=None, field=None, name=None, verbose=False)¶
Defines a summary constraint on the minimum value of a feature. The minimum can be defined to be greater than or equal to some value, or greater than or equal to the values of another summary field of the same feature, such as the mean (average).
- Parameters
value (numeric (one-of)) – Represents the value which should be compared to the minimum value of the specified feature, for checking the greater than or equal to constraint. Only one of value and field should be supplied.
field (str (one-of)) – The field is a string representing a summary field e.g. min, mean, max, stddev, etc., for which the value will be used for checking the greater than or equal to constraint. Only one of field and value should be supplied.
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
SummaryConstraint - a summary constraint defining a constraint on the minimum value to be greater than
or equal to some value / summary field
- whylogs.core.statistics.constraints.maxBetweenConstraint(lower_value=None, upper_value=None, lower_field=None, upper_field=None, name=None, verbose=False)¶
Defines a summary constraint on the maximum value of a feature. The maximum can be defined to be between two values, or between the values of two other summary fields of the same feature, such as the minimum and the maximum. The defined interval is a closed interval, which includes both of its limit points.
- Parameters
lower_value (numeric (one-of)) – Represents the lower value limit of the interval for the maximum. If lower_value is supplied, then upper_value must also be supplied, and none of lower_field and upper_field should be provided.
upper_value (numeric (one-of)) – Represents the upper value limit of the interval for the maximum. If upper_value is supplied, then lower_value must also be supplied, and none of lower_field and upper_field should be provided.
lower_field (str (one-of)) – Represents the lower field limit of the interval for the maximum. The lower field is a string representing a summary field e.g. min, mean, max, stddev, etc., for which the value will be used as a lower bound. If lower_field is supplied, then upper_field must also be supplied, and none of lower_value and upper_value should be provided.
upper_field (str (one-of)) – Represents the upper field limit of the interval for the maximum. The upper field is a string representing a summary field e.g. min, mean, max, stddev, etc., for which the value will be used as an upper bound. If upper_field is supplied, then lower_field must also be supplied, and none of lower_value and upper_value should be provided.
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Return type
SummaryConstraint - a summary constraint defining an interval of values for the maximum value of a feature
- whylogs.core.statistics.constraints.maxLessThanEqualConstraint(value=None, field=None, name=None, verbose=False)¶
Defines a summary constraint on the maximum value of a feature. The maximum can be defined to be less than or equal to some value, or less than or equal to the values of another summary field of the same feature, such as the mean (average).
- Parameters
value (numeric (one-of)) – Represents the value which should be compared to the maximum value of the specified feature, for checking the less than or equal to constraint. Only one of value and field should be supplied.
field (str (one-of)) – The field is a string representing a summary field e.g. min, mean, max, stddev, etc., for which the value will be used for checking the less than or equal to constraint. Only one of field and value should be supplied.
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
SummaryConstraint - a summary constraint defining a constraint on the maximum value to be less than
or equal to some value / summary field
- whylogs.core.statistics.constraints.distinctValuesInSetConstraint(reference_set: Set[Any], name=None, verbose=False)¶
Defines a summary constraint on the distinct values of a feature. All of the distinct values should belong in the user-provided set or reference values reference_set. Useful for categorical features, for checking if the set of values present in a feature is contained in the set of expected categories.
- Parameters
reference_set (Set[Any] (required)) – Represents the set of reference (expected) values for a feature. The provided values can be of any type. If at least one of the distinct values of the feature is not in the user specified set reference_set, then the constraint will fail.
name (str) – The name of the constraint.
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
SummaryConstraint - a summary constraint defining a constraint on the distinct values of a feature
to belong in a user supplied set of values
- whylogs.core.statistics.constraints.distinctValuesEqualSetConstraint(reference_set: Set[Any], name=None, verbose=False)¶
Defines a summary constraint on the distinct values of a feature. The set of the distinct values should be equal to the user-provided set or reference values, reference_set. Useful for categorical features, for checking if the set of values present in a feature is the same as the set of expected categories.
- Parameters
reference_set (Set[Any] (required)) – Represents the set of reference (expected) values for a feature. The provided values can be of any type. If the distinct values of the feature are not equal to the user specified set reference_set, then the constraint will fail.
name (str) – The name of the constraint.
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
SummaryConstraint - a summary constraint defining a constraint on the distinct values of a feature
to be equal to a user supplied set of values
- whylogs.core.statistics.constraints.distinctValuesContainSetConstraint(reference_set: Set[Any], name=None, verbose=False)¶
Defines a summary constraint on the distinct values of a feature. The set of user-supplied reference values, reference_set should be a subset of the set of distinct values for the current feature. Useful for categorical features, for checking if the set of values present in a feature is a superset of the set of expected categories.
- Parameters
reference_set (Set[Any] (required)) – Represents the set of reference (expected) values for a feature. The provided values can be of any type. If at least one of the values of the reference set, specified in reference_set, is not contained in the set of distinct values of the feature, then the constraint will fail.
name (str) – The name of the constraint.
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
SummaryConstraint - a summary constraint defining a constraint on the distinct values of a feature
to be a super set of the user supplied set of values
- whylogs.core.statistics.constraints.columnValuesInSetConstraint(value_set: Set[Any], name=None, verbose=False)¶
Defines a value constraint with set operations on the values of a single feature. The values of the feature should all be in the set of user-supplied values, specified in value_set. Useful for categorical features, for checking if the values in a feature belong in a predefined set.
- Parameters
value_set (Set[Any] (required)) – Represents the set of expected values for a feature. The provided values can be of any type. Each value in the feature is checked against the constraint. The total number of failures equals the number of values not in the provided set value_set.
name (str) – The name of the constraint.
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
ValueConstraint - a value constraint specifying a constraint on the values of a feature
to be drawn from a predefined set of values.
- whylogs.core.statistics.constraints.containsEmailConstraint(regex_pattern: str = None, name=None, verbose=False)¶
Defines a value constraint with email regex matching operations on the values of a single feature. The constraint defines a default email regex pattern, but a user-defined pattern can be supplied to override it. Useful for checking the validity of features with values representing email addresses.
- Parameters
regex_pattern (str (optional)) – User-defined email regex pattern. If supplied, will override the default email regex pattern provided by whylogs.
name (str) – Name of the constraint used for reporting
verbose (bool (optional)) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Return type
ValueConstraint - a value constraint for email regex matching of the values of a single feature
- whylogs.core.statistics.constraints.containsCreditCardConstraint(regex_pattern: str = None, name=None, verbose=False)¶
Defines a value constraint with credit card number regex matching operations on the values of a single feature. The constraint defines a default credit card number regex pattern, but a user-defined pattern can be supplied to override it. Useful for checking the validity of features with values representing credit card numbers.
- Parameters
regex_pattern (str (optional)) – User-defined credit card number regex pattern. If supplied, will override the default credit card number regex pattern provided by whylogs.
name (str) – Name of the constraint used for reporting
verbose (bool (optional)) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Return type
ValueConstraint - a value constraint for credit card number regex matching of the values of a single feature
- whylogs.core.statistics.constraints.dateUtilParseableConstraint(name=None, verbose=False)¶
Defines a value constraint which checks if the values of a single feature can be parsed by the dateutil parser. Useful for checking if the date time values of a feature are compatible with dateutil.
- Parameters
name (str) – Name of the constraint used for reporting
verbose (bool (optional)) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Return type
ValueConstraint - a value constraint for checking if a feature’s values are dateutil parseable
- whylogs.core.statistics.constraints.jsonParseableConstraint(name=None, verbose=False)¶
Defines a value constraint which checks if the values of a single feature are JSON parseable. Useful for checking if the values of a feature can be serialized to JSON.
- Parameters
name (str) – Name of the constraint used for reporting
verbose (bool (optional)) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Return type
ValueConstraint - a value constraint for checking if a feature’s values are JSON parseable
- whylogs.core.statistics.constraints.matchesJsonSchemaConstraint(json_schema, name=None, verbose=False)¶
Defines a value constraint which checks if the values of a single feature match a user-provided JSON schema. Useful for checking if the values of a feature can be serialized to match a predefined JSON schema.
- Parameters
json_schema (Union[str, dict] (required)) – A string or dictionary of key-value pairs representing the expected JSON schema.
name (str) – Name of the constraint used for reporting
verbose (bool (optional)) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Return type
ValueConstraint - a value constraint for checking if a feature’s values match a user-provided JSON schema
- whylogs.core.statistics.constraints.strftimeFormatConstraint(format, name=None, verbose=False)¶
Defines a value constraint which checks if the values of a single feature are strftime parsable.
- Parameters
format (str (required)) – A string representing the expected strftime format for parsing the values.
name (str) – Name of the constraint used for reporting
verbose (bool (optional)) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Return type
ValueConstraint - a value constraint for checking if a feature’s values are strftime parseable
- whylogs.core.statistics.constraints.containsSSNConstraint(regex_pattern: str = None, name=None, verbose=False)¶
Defines a value constraint with social security number (SSN) matching operations on the values of a single feature. The constraint defines a default SSN regex pattern, but a user-defined pattern can be supplied to override it. Useful for checking the validity of features with values representing SNN numbers.
- Parameters
regex_pattern (str (optional)) – User-defined SSN regex pattern. If supplied, will override the default SSN regex pattern provided by whylogs.
name (str) – Name of the constraint used for reporting
verbose (bool (optional)) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Return type
ValueConstraint - a value constraint for SSN regex matching of the values of a single feature
- whylogs.core.statistics.constraints.containsURLConstraint(regex_pattern: str = None, name=None, verbose=False)¶
Defines a value constraint with URL regex matching operations on the values of a single feature. The constraint defines a default URL regex pattern, but a user-defined pattern can be supplied to override it. Useful for checking the validity of features with values representing URL addresses.
- Parameters
regex_pattern (str (optional)) – User-defined URL regex pattern. If supplied, will override the default URL regex pattern provided by whylogs.
name (str) – Name of the constraint used for reporting
verbose (bool (optional)) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Return type
ValueConstraint - a value constraint for URL regex matching of the values of a single feature
- whylogs.core.statistics.constraints.stringLengthEqualConstraint(length: int, name=None, verbose=False)¶
Defines a value constraint which checks if the string values of a single feature have a predefined length.
- Parameters
length (int (required)) – A numeric value which represents the expected length of the string values in the specified feature.
name (str) – Name of the constraint used for reporting
verbose (bool (optional)) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Return type
ValueConstraint - a value constraint for checking if a feature’s string values have a predefined length
- whylogs.core.statistics.constraints.stringLengthBetweenConstraint(lower_value: int, upper_value: int, name=None, verbose=False)¶
Defines a value constraint which checks if the string values’ length of a single feature is in some predefined interval.
- Parameters
lower_value (int (required)) – A numeric value which represents the expected lower bound of the length of the string values in the specified feature.
upper_value (int (required)) – A numeric value which represents the expected upper bound of the length of the string values in the specified feature.
name (str) – Name of the constraint used for reporting
verbose (bool (optional)) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
ValueConstraint - a value constraint for checking if a feature’s string values’
length is in a predefined interval
- whylogs.core.statistics.constraints.quantileBetweenConstraint(quantile_value: Union[int, float], lower_value: Union[int, float], upper_value: Union[int, float], name=None, verbose: bool = False)¶
Defines a summary constraint on the n-th quantile value of a numeric feature. The n-th quantile can be defined to be between two values. The defined interval is a closed interval, which includes both of its limit points.
- Parameters
quantile_value (numeric (required)) – The n-the quantile for which the constraint will be executed
lower_value (numeric (required)) – Represents the lower value limit of the interval for the n-th quantile.
upper_value (numeric (required)) – Represents the upper value limit of the interval for the n-th quantile.
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
SummaryConstraint - a summary constraint defining a closed interval of valid values
for the n-th quantile value of a specific feature
- whylogs.core.statistics.constraints.columnUniqueValueCountBetweenConstraint(lower_value: int, upper_value: int, name=None, verbose: bool = False)¶
Defines a summary constraint on the cardinality of a specific feature. The cardinality can be defined to be between two values. The defined interval is a closed interval, which includes both of its limit points. Useful for checking the unique count of values for discrete features.
- Parameters
lower_value (numeric (required)) – Represents the lower value limit of the interval for the feature cardinality.
upper_value (numeric (required)) – Represents the upper value limit of the interval for the feature cardinality.
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
SummaryConstraint - a summary constraint defining a closed interval
for the valid cardinality of a specific feature
- whylogs.core.statistics.constraints.columnUniqueValueProportionBetweenConstraint(lower_fraction: float, upper_fraction: float, name=None, verbose: bool = False)¶
Defines a summary constraint on the proportion of unique values of a specific feature. The proportion of unique values can be defined to be between two values. The defined interval is a closed interval, which includes both of its limit points. Useful for checking the frequency of unique values for discrete features.
- Parameters
lower_fraction (fraction between 0 and 1 (required)) – Represents the lower fraction limit of the interval for the feature unique value proportion.
upper_fraction (fraction between 0 and 1 (required)) – Represents the upper fraction limit of the interval for the feature cardinality.
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
SummaryConstraint - a summary constraint defining a closed interval
for the valid proportion of unique values of a specific feature
- whylogs.core.statistics.constraints.columnExistsConstraint(column: str, name=None, verbose=False)¶
Defines a constraint on the data set schema. Checks if the user-supplied column, identified by column, is present in the data set schema.
- Parameters
column (str (required)) – Represents the name of the column to be checked for existence in the data set.
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
SummaryConstraint - a summary constraint which checks the existence of a column
in the current data set.
- whylogs.core.statistics.constraints.numberOfRowsConstraint(n_rows: int, name=None, verbose=False)¶
Defines a constraint on the data set schema. Checks if the number of rows in the data set equals the user-supplied number of rows.
- Parameters
n_rows (int (required)) – Represents the user-supplied expected number of rows.
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Return type
SummaryConstraint - a summary constraint which checks the number of rows in the data set
- whylogs.core.statistics.constraints.columnsMatchSetConstraint(reference_set: Set[str], name=None, verbose=False)¶
Defines a constraint on the data set schema. Checks if the set of columns in the data set is equal to the user-supplied set of expected columns.
- Parameters
reference_set (Set[str] (required)) – Represents the expected columns in the current data set.
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
SummaryConstraint - a summary constraint which checks if the column set
of the current data set matches the expected column set
- whylogs.core.statistics.constraints.columnMostCommonValueInSetConstraint(value_set: Set[Any], name=None, verbose=False)¶
Defines a summary constraint on the most common value of a feature. The most common value of the feature should be in the set of user-supplied values, value_set. Useful for categorical features, for checking if the most common value of a feature belongs in an expected set of common categories.
- Parameters
value_set (Set[Any] (required)) – Represents the set of expected values for a feature. The provided values can be of any type. If the most common value of the feature is not in the values of the user-specified value_set, the constraint will fail.
name (str) – The name of the constraint.
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
SummaryConstraint - a summary constraint defining a constraint on the most common value of a feature
to belong to a set of user-specified expected values
- whylogs.core.statistics.constraints.columnValuesNotNullConstraint(name=None, verbose=False)¶
Defines a non-null summary constraint on the value of a feature. Useful for features for which there is no tolerance for missing values. The constraint will fail if there is at least one missing value in the specified feature.
- Parameters
name (str) – The name of the constraint.
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
SummaryConstraint - a summary constraint defining that no missing values
are allowed for the specified feature
- whylogs.core.statistics.constraints.missingValuesProportionBetweenConstraint(lower_fraction: float, upper_fraction: float, name: str = None, verbose: bool = False)¶
Defines a summary constraint on the proportion of missing values of a specific feature. The proportion of missing values can be defined to be between two frequency values. The defined interval is a closed interval, which includes both of its limit points. Useful for checking features with expected amounts of missing values.
- Parameters
lower_fraction (fraction between 0 and 1 (required)) – Represents the lower fraction limit of the interval for the feature missing value proportion.
upper_fraction (fraction between 0 and 1 (required)) – Represents the upper fraction limit of the interval for the feature missing value proportion.
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
SummaryConstraint - a summary constraint defining a closed interval
for the valid proportion of missing values of a specific feature
- whylogs.core.statistics.constraints.columnValuesTypeEqualsConstraint(expected_type: Union[whylogs.proto.InferredType, int], name=None, verbose: bool = False)¶
Defines a summary constraint on the type of the feature values. The type of values should be equal to the user-provided expected type.
- Parameters
expected_type (Union[InferredType, int]) –
whylogs.proto.InferredType.Type - Enumeration of allowed inferred data types If supplied as integer value, should be one of:
UNKNOWN = 0 NULL = 1 FRACTIONAL = 2 INTEGRAL = 3 BOOLEAN = 4 STRING = 5
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
equal to a user-provided expected type
- Return type
SummaryConstraint - a summary constraint defining that the feature values type should be
- whylogs.core.statistics.constraints.columnValuesTypeInSetConstraint(type_set: Set[int], name=None, verbose: bool = False)¶
Defines a summary constraint on the type of the feature values. The type of values should be in the set of to the user-provided expected types.
- Parameters
type_set (Set[int]) –
whylogs.proto.InferredType.Type - Enumeration of allowed inferred data types If supplied as integer value, should be one of:
UNKNOWN = 0 NULL = 1 FRACTIONAL = 2 INTEGRAL = 3 BOOLEAN = 4 STRING = 5
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
in the set of user-provided expected types
- Return type
SummaryConstraint - a summary constraint defining that the feature values type should be
- whylogs.core.statistics.constraints.approximateEntropyBetweenConstraint(lower_value: Union[int, float], upper_value: float, name=None, verbose=False)¶
Defines a summary constraint specifying the expected interval of the features estimated entropy. The defined interval is a closed interval, which includes both of its limit points.
- Parameters
lower_value (numeric (required)) – Represents the lower value limit of the interval for the feature’s estimated entropy.
upper_value (numeric (required)) – Represents the upper value limit of the interval for the feature’s estimated entropy.
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
SummaryConstraint - a summary constraint defining the interval of valid values
of the feature’s estimated entropy
- whylogs.core.statistics.constraints.parametrizedKSTestPValueGreaterThanConstraint(reference_distribution: Union[List[float], numpy.ndarray], p_value=0.05, name=None, verbose=False)¶
Defines a summary constraint specifying the expected upper limit of the p-value for rejecting the null hypothesis of the KS test. Can be used only for continuous data.
- Parameters
reference_distribution (Array-like) – Represents the reference distribution for calculating the KS Test p_value of the column, should be an array-like object with floating point numbers, Only numeric distributions are accepted
p_value (float) – Represents the reference p_value value to compare with the p_value of the test Should be between 0 and 1, inclusive
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
SummaryConstraint - a summary constraint specifying the upper limit of the
KS test p-value for rejecting the null hypothesis
- whylogs.core.statistics.constraints.columnKLDivergenceLessThanConstraint(reference_distribution: Union[List[Any], numpy.ndarray], threshold: float = 0.5, name=None, verbose: bool = False)¶
Defines a summary constraint specifying the expected upper limit of the threshold for the KL divergence of the specified feature.
- Parameters
reference_distribution (Array-like) – Represents the reference distribution for calculating the KL Divergence of the column, should be an array-like object with floating point numbers, or integers, strings and booleans, but not both Both numeric and categorical distributions are accepted
threshold (float) – Represents the threshold value which if exceeded from the KL Divergence, the constraint would fail
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
SummaryConstraint - a summary constraint specifying the upper threshold of the
feature’s KL divergence
- whylogs.core.statistics.constraints.columnChiSquaredTestPValueGreaterThanConstraint(reference_distribution: Union[List[Any], numpy.ndarray, Mapping[str, int]], p_value: float = 0.05, name=None, verbose: bool = False)¶
Defines a summary constraint specifying the expected upper limit of the p-value for rejecting the null hypothesis of the Chi-Squared test. Can be used only for discrete data.
- Parameters
reference_distribution (Array-like) – Represents the reference distribution for calculating the Chi-Squared test, should be an array-like object with integer, string or boolean values or a mapping of type key: value where the keys are the items and the values are the per-item counts Only categorical distributions are accepted
p_value (float) – Represents the reference p_value value to compare with the p_value of the test Should be between 0 and 1, inclusive
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
SummaryConstraint - a summary constraint specifying the upper limit of the
Chi-Squared test p-value for rejecting the null hypothesis
- whylogs.core.statistics.constraints.columnValuesAGreaterThanBConstraint(column_A: str, column_B: str, name=None, verbose: bool = False)¶
Defines a multi-column value constraint which specifies that each value in column A, specified in column_A, is greater than the corresponding value of column B, specified in column_B in the same row.
- Parameters
column_A (str) – The name of column A
column_B (str) – The name of column B
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
MultiColumnValueConstraint - multi-column value constraint specifying that values from column A
should always be greater than the corresponding values of column B
- whylogs.core.statistics.constraints.columnValuesAGreaterThanEqualBConstraint(column_A: str, column_B: str, name=None, verbose: bool = False)¶
Defines a multi-column value constraint which specifies that each value in column A, specified in column_A, is greater than or equal to the corresponding value of column B, specified in column_B in the same row.
- Parameters
column_A (str) – The name of column A
column_B (str) – The name of column B
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
MultiColumnValueConstraint - multi-column value constraint specifying that values from column A
should always be greater than or equal to the corresponding values of column B
- whylogs.core.statistics.constraints.columnValuesALessThanBConstraint(column_A: str, column_B: str, name=None, verbose: bool = False)¶
Defines a multi-column value constraint which specifies that each value in column A, specified in column_A, is less than the corresponding value of column B, specified in column_B in the same row.
- Parameters
column_A (str) – The name of column A
column_B (str) – The name of column B
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
MultiColumnValueConstraint - multi-column value constraint specifying that values from column A
should always be less the corresponding values of column B
- whylogs.core.statistics.constraints.columnValuesALessThanEqualBConstraint(column_A: str, column_B: str, name=None, verbose: bool = False)¶
Defines a multi-column value constraint which specifies that each value in column A, specified in column_A, is less than or equal to the corresponding value of column B, specified in column_B in the same row.
- Parameters
column_A (str) – The name of column A
column_B (str) – The name of column B
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
MultiColumnValueConstraint - multi-column value constraint specifying that values from column A
should always be less than or equal to the corresponding values of column B
- whylogs.core.statistics.constraints.columnValuesAEqualBConstraint(column_A: str, column_B: str, name=None, verbose: bool = False)¶
Defines a multi-column value constraint which specifies that each value in column A, specified in column_A, is equal to the corresponding value of column B, specified in column_B in the same row.
- Parameters
column_A (str) – The name of column A
column_B (str) – The name of column B
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
MultiColumnValueConstraint - multi-column value constraint specifying that values from column A
should always be equal to the corresponding values of column B
- whylogs.core.statistics.constraints.columnValuesANotEqualBConstraint(column_A: str, column_B: str, name=None, verbose: bool = False)¶
Defines a multi-column value constraint which specifies that each value in column A, specified in column_A, is different from the corresponding value of column B, specified in column_B in the same row.
- Parameters
column_A (str) – The name of column A
column_B (str) – The name of column B
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Returns
MultiColumnValueConstraint - multi-column value constraint specifying that values from column A
should always be different from the corresponding values of column B
- whylogs.core.statistics.constraints.sumOfRowValuesOfMultipleColumnsEqualsConstraint(columns: Union[List[str], Set[str], numpy.array], value: Union[float, int, str], name=None, verbose: bool = False)¶
Defines a multi-column value constraint which specifies that the sum of the values in each row of the provided columns, specified in columns, should be equal to the user-predefined value, specified in value, or to the corresponding value of another column, which will be specified with a name in the value parameter.
- Parameters
columns (List[str]) – List of columns for which the sum of row values should equal the provided-value
value (Union[float, int, str]) – Numeric value to compare with the sum of the column row values, or a string indicating a column name for which the row value will be compared with the sum
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Return type
MultiColumnValueConstraint - specifying the expected value of the sum of the values in multiple columns
- whylogs.core.statistics.constraints.columnPairValuesInSetConstraint(column_A: str, column_B: str, value_set: Set[Tuple[Any, Any]], name=None, verbose: bool = False)¶
Defines a multi-column value constraint which specifies that the pair of values of columns A and B, should be in a user-predefined set of expected pairs of values.
- Parameters
column_A (str) – The name of the first column
column_B (str) – The name of the second column
value_set (Set[Tuple[Any, Any]]) – A set of expected pairs of values for the columns A and B, in that order
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Return type
MultiColumnValueConstraint - specifying the expected set of value pairs of two columns in the data set
- whylogs.core.statistics.constraints.columnValuesUniqueWithinRow(column_A: str, name=None, verbose: bool = False)¶
Defines a multi-column value constraint which specifies that the values of column A should be unique within each row of the data set.
- Parameters
column_A (str) – The name of the column for which it is expected that the values are unique within each row
name (str) – Name of the constraint used for reporting
verbose (bool) – If true, log every application of this constraint that fails. Useful to identify specific streaming values that fail the constraint.
- Return type
MultiColumnValueConstraint - specifying that the provided column’s values are unique within each row
whylogs.core.statistics.counterstracker
¶Class to keep track of the counts of various data types |
- class whylogs.core.statistics.counterstracker.CountersTracker(count=0, true_count=0)¶
Class to keep track of the counts of various data types
- Parameters
count (int, optional) – Current number of objects
true_count (int, optional) – Number of boolean values
null_count (int, optional) – Number of nulls encountered
- increment_count(self)¶
Add one to the count of total objects
- increment_bool(self)¶
Add one to the boolean count
- increment_null(self)¶
Add one to the null count
- merge(self, other)¶
Merge another counter tracker with this one
- Returns
new_tracker – The merged tracker
- Return type
- to_protobuf(self, null_count=0)¶
Return the object serialized as a protobuf message
- static from_protobuf(message: whylogs.proto.Counters)¶
Load from a protobuf message
- Returns
counters
- Return type
whylogs.core.statistics.hllsketch
¶- whylogs.core.statistics.hllsketch.DEFAULT_LG_K = 12¶
- class whylogs.core.statistics.hllsketch.HllSketch(lg_k=None, sketch=None)¶
- update(self, value)¶
- merge(self, other)¶
- get_estimate(self)¶
- get_lower_bound(self, num_std_devs: int = 1)¶
- get_upper_bound(self, num_std_devs: int = 1)¶
- to_protobuf(self)¶
- _serialize_item(self, x)¶
- is_empty(self)¶
- static from_protobuf(message: whylogs.proto.HllSketchMessage)¶
- to_summary(self, num_std_devs=1)¶
whylogs.core.statistics.numbertracker
¶Class to track statistics for numeric data. |
- whylogs.core.statistics.numbertracker.DEFAULT_HIST_K = 256¶
- whylogs.core.statistics.numbertracker.logger¶
- class whylogs.core.statistics.numbertracker.NumberTracker(variance: whylogs.core.statistics.datatypes.VarianceTracker = None, floats: whylogs.core.statistics.datatypes.FloatTracker = None, ints: whylogs.core.statistics.datatypes.IntTracker = None, theta_sketch: whylogs.core.statistics.thetasketch.ThetaSketch = None, histogram: datasketches.kll_floats_sketch = None)¶
Class to track statistics for numeric data.
- Parameters
variance – Tracker to follow the variance
floats – Float tracker for tracking all floats
ints – Integer tracker
- variance¶
See above
- floats¶
See above
- ints¶
See above
- theta_sketch¶
Sketch which tracks approximate cardinality
- Type
whylabs.logs.core.statistics.thetasketch.ThetaSketch
- property count(self)¶
- track(self, number)¶
Add a number to statistics tracking
- Parameters
number (int, float) – A numeric value
- merge(self, other)¶
- to_protobuf(self)¶
Return the object serialized as a protobuf message
- static from_protobuf(message: whylogs.proto.NumbersMessage)¶
Load from a protobuf message
- Returns
number_tracker
- Return type
- to_summary(self)¶
Construct a NumberSummary message
- Returns
summary – Summary of the tracker statistics
- Return type
NumberSummary
whylogs.core.statistics.schematracker
¶Track information about a column's schema and present datatypes |
- whylogs.core.statistics.schematracker.Type¶
- class whylogs.core.statistics.schematracker.SchemaTracker(type_counts: dict = None, legacy_null_count=0)¶
Track information about a column’s schema and present datatypes
- type_countsdict
If specified, a dictionary containing information about the counts of all data types.
- UNKNOWN_TYPE¶
- NULL_TYPE¶
- CANDIDATE_MIN_FRAC = 0.7¶
- _non_null_type_counts(self)¶
- track(self, item_type)¶
Track an item type
- get_count(self, item_type)¶
Return the count of a given item type
- infer_type(self)¶
Generate a guess at what type the tracked values are.
- Returns
type_guess – The guess tome. See InferredType.Type for candidates
- Return type
object
- _get_most_popular_type(self, total_count)¶
- merge(self, other)¶
Merge another schema tracker with this and return a new one. Does not alter this object.
- Parameters
other (SchemaTracker) –
- Returns
merged – Merged tracker
- Return type
- copy(self)¶
Return a copy of this tracker
- to_protobuf(self)¶
Return the object serialized as a protobuf message
- Returns
message
- Return type
SchemaMessage
- static from_protobuf(message, legacy_null_count=0)¶
Load from a protobuf message
- Returns
schema_tracker
- Return type
- to_summary(self)¶
Generate a summary of the statistics
- Returns
summary – Protobuf summary message.
- Return type
SchemaSummary
whylogs.core.statistics.stringtracker
¶Track statistics for character positions within a string |
|
Track statistics for strings |
- whylogs.core.statistics.stringtracker.MAX_ITEMS_SIZE = 128¶
- whylogs.core.statistics.stringtracker.MAX_SUMMARY_ITEMS = 100¶
- whylogs.core.statistics.stringtracker.logger¶
- class whylogs.core.statistics.stringtracker.CharPosTracker(character_list: str = None)¶
Track statistics for character positions within a string
- Parameters
character_list (str) – string containing all characters to be tracked this list can include specific unicode characters to track.
- update(self, value: str, character_list: str = None) None ¶
update
- Parameters
value (str) – utf-16 string
character_list (str, optional) – use a specific character_list for the tracked string. Note that modifing it from a previous saved choice, will reset the character position map, since NITL no longer has the same context.
- merge(self, other: CharPosTracker) CharPosTracker ¶
Merges two Char Pos Frequency Maps
- Parameters
other (CharPosTracker) – to be merged
- to_protobuf(self)¶
Return the object serialized as a protobuf message
- static from_protobuf(message: whylogs.proto.CharPosMessage)¶
Load from a CharPosMessage protobuf message
- Return type
- to_summary(self)¶
- class whylogs.core.statistics.stringtracker.StringTracker(count: int = None, items: datasketches.frequent_strings_sketch = None, theta_sketch: whylogs.core.statistics.thetasketch.ThetaSketch = None, length: whylogs.core.statistics.numbertracker.NumberTracker = None, token_length: whylogs.core.statistics.numbertracker.NumberTracker = None, char_pos_tracker: CharPosTracker = None, token_method: Callable[[], List[str]] = None)¶
Track statistics for strings
- Parameters
count (int) – Total number of processed values
items (frequent_strings_sketch) – Sketch for tracking string counts
theta_sketch (ThetaSketch) – Sketch for approximate cardinality tracking
length (NumberTracker) – tracks the distribution of length of strings
token_length (NumberTracker) – counts token per sentence
token_method (funtion) – method used to turn string into tokens
char_pos_tracker (CharPosTracker) –
- update(self, value: str, character_list=None, token_method=None)¶
Add a string to the tracking statistics.
If value is None, nothing will be done
- merge(self, other)¶
Merge the values of this string tracker with another
- Parameters
other (StringTracker) – The other StringTracker
- Returns
new – Merged values
- Return type
- to_protobuf(self)¶
Return the object serialized as a protobuf message
- Returns
message
- Return type
StringsMessage
- static from_protobuf(message: whylogs.proto.StringsMessage)¶
Load from a protobuf message
- Returns
string_tracker
- Return type
- to_summary(self)¶
Generate a summary of the statistics
- Returns
summary – Protobuf summary message.
- Return type
StringsSummary
whylogs.core.statistics.thetasketch
¶A sketch for approximate cardinality tracking. |
|
|
|
Generate a summary protobuf message from a thetasketch based on numeric |
- whylogs.core.statistics.thetasketch._copy_union(union)¶
- class whylogs.core.statistics.thetasketch.ThetaSketch(theta_sketch=None, union=None, compact_theta=None)¶
A sketch for approximate cardinality tracking.
A wrapper class for datasketches.update_theta_sketch which implements merging for updatable theta sketches.
Currently, datasketches only implements merging for compact (read-only) theta sketches.
- update(self, value)¶
Update the statistics tracking
- Parameters
value (object) – Value to follow
- merge(self, other)¶
Merge another ThetaSketch with this one, returning a new object
- Parameters
other (ThetaSketch) – Other theta sketch
- Returns
new – New theta sketch with merged statistics
- Return type
- get_result(self)¶
Generate a theta sketch
- Returns
compact_sketch – Read-only compact theta sketch with full statistics.
- Return type
datasketches.compact_theta_sketch
- serialize(self)¶
Serialize this object.
Note that serialization only preserves the object approximately.
- Returns
msg – Serialized to bytes
- Return type
bytes
- static deserialize(msg: bytes)¶
Deserialize from a serialized message.
msg
- Parameters
msg (bytes) –
- Serialized object. can be a serialized version of:
ThetaSketch
datasketches.update_theta_sketch,
datasketches.compact_theta_sketch
- Returns
sketch – ThetaSketch object
- Return type
- to_summary(self, num_std_devs=1)¶
Generate a summary protobuf message
- Parameters
num_std_devs (float) – For estimating bounds
- Returns
summary – Summary protobuf message
- Return type
UniqueCountSummary
- whylogs.core.statistics.thetasketch.numbers_summary(sketch: ThetaSketch, num_std_devs=1)¶
Generate a summary protobuf message from a thetasketch based on numeric values
- Parameters
sketch –
num_std_devs (float) – For estimating bounds
- Returns
summary – Summary protobuf message
- Return type
UniqueCountSummary
Class to keep track of the counts of various data types |
|
Class to track statistics for numeric data. |
|
Track information about a column's schema and present datatypes |
|
Track statistics for strings |
|
A sketch for approximate cardinality tracking. |
- class whylogs.core.statistics.CountersTracker(count=0, true_count=0)¶
Class to keep track of the counts of various data types
- Parameters
count (int, optional) – Current number of objects
true_count (int, optional) – Number of boolean values
null_count (int, optional) – Number of nulls encountered
- increment_count(self)¶
Add one to the count of total objects
- increment_bool(self)¶
Add one to the boolean count
- increment_null(self)¶
Add one to the null count
- merge(self, other)¶
Merge another counter tracker with this one
- Returns
new_tracker – The merged tracker
- Return type
- to_protobuf(self, null_count=0)¶
Return the object serialized as a protobuf message
- static from_protobuf(message: whylogs.proto.Counters)¶
Load from a protobuf message
- Returns
counters
- Return type
- class whylogs.core.statistics.NumberTracker(variance: whylogs.core.statistics.datatypes.VarianceTracker = None, floats: whylogs.core.statistics.datatypes.FloatTracker = None, ints: whylogs.core.statistics.datatypes.IntTracker = None, theta_sketch: whylogs.core.statistics.thetasketch.ThetaSketch = None, histogram: datasketches.kll_floats_sketch = None)¶
Class to track statistics for numeric data.
- Parameters
variance – Tracker to follow the variance
floats – Float tracker for tracking all floats
ints – Integer tracker
- variance¶
See above
- floats¶
See above
- ints¶
See above
- theta_sketch¶
Sketch which tracks approximate cardinality
- Type
whylabs.logs.core.statistics.thetasketch.ThetaSketch
- property count(self)¶
- track(self, number)¶
Add a number to statistics tracking
- Parameters
number (int, float) – A numeric value
- merge(self, other)¶
- to_protobuf(self)¶
Return the object serialized as a protobuf message
- static from_protobuf(message: whylogs.proto.NumbersMessage)¶
Load from a protobuf message
- Returns
number_tracker
- Return type
- to_summary(self)¶
Construct a NumberSummary message
- Returns
summary – Summary of the tracker statistics
- Return type
NumberSummary
- class whylogs.core.statistics.SchemaTracker(type_counts: dict = None, legacy_null_count=0)¶
Track information about a column’s schema and present datatypes
- type_countsdict
If specified, a dictionary containing information about the counts of all data types.
- UNKNOWN_TYPE¶
- NULL_TYPE¶
- CANDIDATE_MIN_FRAC = 0.7¶
- _non_null_type_counts(self)¶
- track(self, item_type)¶
Track an item type
- get_count(self, item_type)¶
Return the count of a given item type
- infer_type(self)¶
Generate a guess at what type the tracked values are.
- Returns
type_guess – The guess tome. See InferredType.Type for candidates
- Return type
object
- _get_most_popular_type(self, total_count)¶
- merge(self, other)¶
Merge another schema tracker with this and return a new one. Does not alter this object.
- Parameters
other (SchemaTracker) –
- Returns
merged – Merged tracker
- Return type
- copy(self)¶
Return a copy of this tracker
- to_protobuf(self)¶
Return the object serialized as a protobuf message
- Returns
message
- Return type
SchemaMessage
- static from_protobuf(message, legacy_null_count=0)¶
Load from a protobuf message
- Returns
schema_tracker
- Return type
- to_summary(self)¶
Generate a summary of the statistics
- Returns
summary – Protobuf summary message.
- Return type
SchemaSummary
- class whylogs.core.statistics.StringTracker(count: int = None, items: datasketches.frequent_strings_sketch = None, theta_sketch: whylogs.core.statistics.thetasketch.ThetaSketch = None, length: whylogs.core.statistics.numbertracker.NumberTracker = None, token_length: whylogs.core.statistics.numbertracker.NumberTracker = None, char_pos_tracker: CharPosTracker = None, token_method: Callable[[], List[str]] = None)¶
Track statistics for strings
- Parameters
count (int) – Total number of processed values
items (frequent_strings_sketch) – Sketch for tracking string counts
theta_sketch (ThetaSketch) – Sketch for approximate cardinality tracking
length (NumberTracker) – tracks the distribution of length of strings
token_length (NumberTracker) – counts token per sentence
token_method (funtion) – method used to turn string into tokens
char_pos_tracker (CharPosTracker) –
- update(self, value: str, character_list=None, token_method=None)¶
Add a string to the tracking statistics.
If value is None, nothing will be done
- merge(self, other)¶
Merge the values of this string tracker with another
- Parameters
other (StringTracker) – The other StringTracker
- Returns
new – Merged values
- Return type
- to_protobuf(self)¶
Return the object serialized as a protobuf message
- Returns
message
- Return type
StringsMessage
- static from_protobuf(message: whylogs.proto.StringsMessage)¶
Load from a protobuf message
- Returns
string_tracker
- Return type
- to_summary(self)¶
Generate a summary of the statistics
- Returns
summary – Protobuf summary message.
- Return type
StringsSummary
- class whylogs.core.statistics.ThetaSketch(theta_sketch=None, union=None, compact_theta=None)¶
A sketch for approximate cardinality tracking.
A wrapper class for datasketches.update_theta_sketch which implements merging for updatable theta sketches.
Currently, datasketches only implements merging for compact (read-only) theta sketches.
- update(self, value)¶
Update the statistics tracking
- Parameters
value (object) – Value to follow
- merge(self, other)¶
Merge another ThetaSketch with this one, returning a new object
- Parameters
other (ThetaSketch) – Other theta sketch
- Returns
new – New theta sketch with merged statistics
- Return type
- get_result(self)¶
Generate a theta sketch
- Returns
compact_sketch – Read-only compact theta sketch with full statistics.
- Return type
datasketches.compact_theta_sketch
- serialize(self)¶
Serialize this object.
Note that serialization only preserves the object approximately.
- Returns
msg – Serialized to bytes
- Return type
bytes
- static deserialize(msg: bytes)¶
Deserialize from a serialized message.
msg
- Parameters
msg (bytes) –
- Serialized object. can be a serialized version of:
ThetaSketch
datasketches.update_theta_sketch,
datasketches.compact_theta_sketch
- Returns
sketch – ThetaSketch object
- Return type
- to_summary(self, num_std_devs=1)¶
Generate a summary protobuf message
- Parameters
num_std_devs (float) – For estimating bounds
- Returns
summary – Summary protobuf message
- Return type
UniqueCountSummary
- whylogs.core.statistics.__ALL__¶
whylogs.core.types
¶whylogs.core.types.typeddataconverter
¶TODO: implement this using something other than yaml
A class to coerce types on data. |
- whylogs.core.types.typeddataconverter.TYPES¶
- whylogs.core.types.typeddataconverter.TYPENUM_TO_NAME¶
- whylogs.core.types.typeddataconverter.INTEGRAL_TYPES¶
- whylogs.core.types.typeddataconverter.FLOAT_TYPES¶
- class whylogs.core.types.typeddataconverter.TypedDataConverter¶
A class to coerce types on data.
To see available types:
>>> from whylogs.core.types.typeddataconverter import TYPES >>> print("\n".join(sorted(TYPES.keys())))
- static convert(data)¶
Convert data to a typed value
If a data is a string, parse data with yaml. Else, return data unchanged
Note: this method is very slow, since it relies on the complex and python-based implementation of yaml.
- static _is_array_like(value)¶
- static _are_nulls(value)¶
- static get_type(typed_data)¶
Extract the data type of a value. See typeddataconvert.TYPES for available types.
- Parameters
typed_data – Data processed by TypedDataConverter.convert
- Returns
dtype
- Return type
TYPES
A class to coerce types on data. |
- class whylogs.core.types.TypedDataConverter¶
A class to coerce types on data.
To see available types:
>>> from whylogs.core.types.typeddataconverter import TYPES >>> print("\n".join(sorted(TYPES.keys())))
- static convert(data)¶
Convert data to a typed value
If a data is a string, parse data with yaml. Else, return data unchanged
Note: this method is very slow, since it relies on the complex and python-based implementation of yaml.
- static _is_array_like(value)¶
- static _are_nulls(value)¶
- static get_type(typed_data)¶
Extract the data type of a value. See typeddataconvert.TYPES for available types.
- Parameters
typed_data – Data processed by TypedDataConverter.convert
- Returns
dtype
- Return type
TYPES
- whylogs.core.types.__ALL__¶
Submodules¶
whylogs.core.annotation_profiling
¶Helper class to compute minimal bouding box intersections and/or iou |
|
- class whylogs.core.annotation_profiling.Rectangle(boundingBox, confidence=None, labels=None)¶
Helper class to compute minimal bouding box intersections and/or iou minimal stats properties of boudning box
- area¶
Description
- Type
float
- aspect_ratio¶
Description
- Type
TYPE
- boundingBox¶
Description
- Type
TYPE
- centroid¶
Description
- Type
TYPE
- confidence¶
Description
- Type
TYPE
- height¶
Description
- Type
TYPE
- labels¶
Description
- Type
TYPE
- width¶
Description
- Type
TYPE
- property x1(self)¶
- property x2(self)¶
- property y1(self)¶
- property y2(self)¶
- intersection(self, Rectangle_2)¶
- iou(self, Rectangle_2)¶
- whylogs.core.annotation_profiling.BB_ATTRIBUTES = ['annotation_count', 'annotation_density', 'area_coverage', 'bb_width', 'bb_height', 'bb_area',...¶
whylogs.core.columnprofile
¶Defines the ColumnProfile class for tracking per-column statistics
Statistics tracking for a column (i.e. a feature) |
|
Statistics tracking for a multiple columns (i.e. a features) |
- whylogs.core.columnprofile._TYPES¶
- whylogs.core.columnprofile._NUMERIC_TYPES¶
- whylogs.core.columnprofile._UNIQUE_COUNT_BOUNDS_STD = 1¶
- class whylogs.core.columnprofile.ColumnProfile(name: str, number_tracker: whylogs.core.statistics.NumberTracker = None, string_tracker: whylogs.core.statistics.StringTracker = None, schema_tracker: whylogs.core.statistics.SchemaTracker = None, counters: whylogs.core.statistics.CountersTracker = None, frequent_items: whylogs.util.dsketch.FrequentItemsSketch = None, cardinality_tracker: whylogs.core.statistics.hllsketch.HllSketch = None, constraints: whylogs.core.statistics.constraints.ValueConstraints = None)¶
Statistics tracking for a column (i.e. a feature)
The primary method for
- Parameters
name (str (required)) – Name of the column profile
number_tracker (NumberTracker) – Implements numeric data statistics tracking
string_tracker (StringTracker) – Implements string data-type statistics tracking
schema_tracker (SchemaTracker) – Implements tracking of schema-related information
counters (CountersTracker) – Keep count of various things
frequent_items (FrequentItemsSketch) – Keep track of all frequent items, even for mixed datatype features
cardinality_tracker (HllSketch) – Track feature cardinality (even for mixed data types)
constraints (ValueConstraints) – Static assertions to be applied to numeric data tracked in this column
TODO –
Proper TypedDataConverter type checking
Multi-threading/parallelism
- track(self, value, character_list=None, token_method=None)¶
Add value to tracking statistics.
- _unique_count_summary(self) whylogs.proto.UniqueCountSummary ¶
- to_summary(self)¶
Generate a summary of the statistics
- Returns
summary – Protobuf summary message.
- Return type
ColumnSummary
- generate_constraints(self) whylogs.core.statistics.constraints.SummaryConstraints ¶
- merge(self, other)¶
Merge this columnprofile with another.
- Parameters
other (ColumnProfile) –
- Returns
merged – A new, merged column profile.
- Return type
- to_protobuf(self)¶
Return the object serialized as a protobuf message
- Returns
message
- Return type
ColumnMessage
- static from_protobuf(message)¶
Load from a protobuf message
- Returns
column_profile
- Return type
- class whylogs.core.columnprofile.MultiColumnProfile(constraints: whylogs.core.statistics.constraints.MultiColumnValueConstraints = None)¶
Statistics tracking for a multiple columns (i.e. a features)
The primary method for
- Parameters
constraints (MultiColumnValueConstraints) – Static assertions to be applied to data tracked between all columns
- track(self, column_dict, character_list=None, token_method=None)¶
TODO: Add column_dict to tracking statistics.
- abstract to_summary(self)¶
Generate a summary of the statistics
- Returns
summary – Protobuf summary message.
- Return type
(Multi)ColumnSummary
- merge(self, other) MultiColumnProfile ¶
Merge this columnprofile with another.
- Parameters
other (MultiColumnProfile) –
- Returns
merged – A new, merged multi column profile.
- Return type
- abstract to_protobuf(self)¶
Return the object serialized as a protobuf message
- Returns
message
- Return type
ColumnMessage
- abstract static from_protobuf(message)¶
Load from a protobuf message
- Returns
column_profile
- Return type
whylogs.core.datasetprofile
¶Defines the primary interface class for tracking dataset statistics.
Statistics tracking for a dataset. |
|
Create an iterator to return column messages in batches |
|
Generate a dataset profile for a dataframe |
|
Generate a dataset profile for an array |
|
Wrapper method for summary constraints update object creation |
- whylogs.core.datasetprofile.SCHEMA_MAJOR_VERSION = 1¶
- whylogs.core.datasetprofile.SCHEMA_MINOR_VERSION = 2¶
- whylogs.core.datasetprofile.logger¶
- whylogs.core.datasetprofile.cudfDataFrame¶
- whylogs.core.datasetprofile.COLUMN_CHUNK_MAX_LEN_IN_BYTES¶
- class whylogs.core.datasetprofile.DatasetProfile(name: str, dataset_timestamp: datetime.datetime = None, session_timestamp: datetime.datetime = None, columns: dict = None, multi_columns: whylogs.core.MultiColumnProfile = None, tags: Dict[str, str] = None, metadata: Dict[str, str] = None, session_id: str = None, model_profile: whylogs.core.model_profile.ModelProfile = None, constraints: whylogs.core.statistics.constraints.DatasetConstraints = None)¶
Statistics tracking for a dataset.
A dataset refers to a collection of columns.
- Parameters
name (str) – A human readable name for the dataset profile. Could be model name. This is stored under “name” tag
dataset_timestamp (datetime.datetime) – The timestamp associated with the data (i.e. batch run). Optional.
session_timestamp (datetime.datetime) – Timestamp of the dataset
columns (dict) – Dictionary lookup of `ColumnProfile`s
tags (dict) – A dictionary of key->value. Can be used upstream for aggregating data. Tags must match when merging with another dataset profile object.
metadata (dict) – Metadata that can store arbitrary string mapping. Metadata is not used when aggregating data and can be dropped when merging with another dataset profile object.
session_id (str) – The unique session ID run. Should be a UUID.
constraints (DatasetConstraints) – Static assertions to be applied to tracked numeric data and profile summaries.
- __getstate__(self)¶
- __setstate__(self, serialized_profile)¶
- property name(self)¶
- property tags(self)¶
- property metadata(self)¶
- property session_timestamp(self)¶
- property session_timestamp_ms(self)¶
Return the session timestamp value in epoch milliseconds.
- property total_row_number(self)¶
- add_output_field(self, field: Union[str, List[str]])¶
- track_metrics(self, targets: List[Union[str, bool, float, int]], predictions: List[Union[str, bool, float, int]], scores: List[float] = None, model_type: whylogs.proto.ModelType = None, target_field: str = None, prediction_field: str = None, score_field: str = None)¶
Function to track metrics based on validation data.
user may also pass the associated attribute names associated with target, prediction, and/or score.
- Parameters
targets (List[Union[str, bool, float, int]]) – actual validated values
predictions (List[Union[str, bool, float, int]]) – inferred/predicted values
scores (List[float], optional) – assocaited scores for each inferred, all values set to 1 if not passed
target_field (str, optional) – Description
prediction_field (str, optional) – Description
score_field (str, optional) – Description
model_type (ModelType, optional) – Defaul is Classification type.
target_field –
prediction_field –
score_field –
score_field –
- track(self, columns, data=None, character_list=None, token_method=None)¶
Add value(s) to tracking statistics for column(s).
- Parameters
columns (str, dict) – Either the name of a column, or a dictionary specifying column names and the data (value) for each column If a string, data must be supplied. Otherwise, data is ignored.
data (object, None) – Value to track. Specify if columns is a string.
- track_datum(self, column_name, data, character_list=None, token_method=None)¶
- track_multi_column(self, columns)¶
- track_array(self, x: numpy.ndarray, columns=None)¶
Track statistics for a numpy array
- Parameters
x (np.ndarray) – 2D array to track.
columns (list) – Optional column labels
- track_dataframe(self, df: pandas.DataFrame, character_list=None, token_method=None)¶
Track statistics for a dataframe
- Parameters
df (pandas.DataFrame) – DataFrame to track
- to_properties(self)¶
Return dataset profile related metadata
- Returns
properties – The metadata as a protobuf object.
- Return type
DatasetProperties
- to_summary(self)¶
Generate a summary of the statistics
- Returns
summary – Protobuf summary message.
- Return type
DatasetSummary
- generate_constraints(self) whylogs.core.statistics.constraints.DatasetConstraints ¶
Assemble a sparse dict of constraints for all features.
- Returns
summary – Protobuf constraints message.
- Return type
- flat_summary(self)¶
Generate and flatten a summary of the statistics.
See
flatten_summary()
for a description
- _column_message_iterator(self)¶
- chunk_iterator(self)¶
Generate an iterator to iterate over chunks of data
- validate(self)¶
Sanity check for this object. Raises an AssertionError if invalid
- merge(self, other)¶
Merge this profile with another dataset profile object.
We will use metadata and timestamps from the current DatasetProfile in the result.
This operation will drop the metadata from the ‘other’ profile object.
- Parameters
other (DatasetProfile) –
- Returns
merged – New, merged DatasetProfile
- Return type
- _do_merge(self, other)¶
- merge_strict(self, other)¶
Merge this profile with another dataset profile object. This throws exception if session_id, timestamps and tags don’t match.
This operation will drop the metadata from the ‘other’ profile object.
- Parameters
other (DatasetProfile) –
- Returns
merged – New, merged DatasetProfile
- Return type
- serialize_delimited(self) bytes ¶
Write out in delimited format (data is prefixed with the length of the datastream).
This is useful when you are streaming multiple dataset profile objects
- Returns
data – A sequence of bytes
- Return type
bytes
- to_protobuf(self) whylogs.proto.DatasetProfileMessage ¶
Return the object serialized as a protobuf message
- Returns
message
- Return type
DatasetProfileMessage
- write_protobuf(self, protobuf_path: str, delimited_file: bool = True, transport_parameters: dict = None) None ¶
Write the dataset profile to disk in binary format
- Parameters
protobuf_path (str) – local path or any path supported supported by smart_open: https://github.com/RaRe-Technologies/smart_open#how. The parent directory must already exist
delimited_file (bool, optional) – whether to prefix the data with the length of output or not. Default is True
- static read_protobuf(protobuf_path: str, delimited_file: bool = True, transport_parameters: dict = None) DatasetProfile ¶
Parse a protobuf file and return a DatasetProfile object
- Parameters
protobuf_path (str) – the path of the protobuf data, can be local or any other path supported by smart_open: https://github.com/RaRe-Technologies/smart_open#how
delimited_file (bool, optional) – whether the data is delimited or not. Default is True
- Returns
whylogs.DatasetProfile object from the protobuf
- Return type
- static from_protobuf(message: whylogs.proto.DatasetProfileMessage) DatasetProfile ¶
Load from a protobuf message
- Parameters
message (DatasetProfileMessage) – The protobuf message. Should match the output of DatasetProfile.to_protobuf()
- Returns
dataset_profile
- Return type
- static from_protobuf_string(data: bytes) DatasetProfile ¶
Deserialize a serialized DatasetProfileMessage
- Parameters
data (bytes) – The serialized message
- Returns
profile – The deserialized dataset profile
- Return type
- static _parse_delimited_generator(data: bytes)¶
- static parse_delimited_single(data: bytes, pos=0)¶
Parse a single delimited entry from a byte stream :param data: The bytestream :type data: bytes :param pos: The starting position. Default is zero :type pos: int
- Returns
pos (int) – Current position in the stream after parsing
profile (DatasetProfile) – A dataset profile
- static parse_delimited(data: bytes)¶
Parse delimited data (i.e. data prefixed with the message length).
Java protobuf writes delimited messages, which is convenient for storing multiple dataset profiles. This means that the main data is prefixed with the length of the message.
- Parameters
data (bytes) – The input byte stream
- Returns
profiles – List of all Dataset profile objects
- Return type
list
- apply_summary_constraints(self, summary_constraints: Optional[Mapping[str, whylogs.core.statistics.constraints.SummaryConstraints]] = None)¶
- apply_table_shape_constraints(self, table_shape_constraints: Optional[whylogs.core.statistics.constraints.SummaryConstraints] = None)¶
- whylogs.core.datasetprofile.columns_chunk_iterator(iterator, marker: str)¶
Create an iterator to return column messages in batches
- Parameters
iterator – An iterator which returns protobuf column messages
marker – Value used to mark a group of column messages
- whylogs.core.datasetprofile.dataframe_profile(df: pandas.DataFrame, name: str = None, timestamp: datetime.datetime = None)¶
Generate a dataset profile for a dataframe
- Parameters
df (pandas.DataFrame) – Dataframe to track, treated as a complete dataset.
name (str) – Name of the dataset
timestamp (datetime.datetime, float) – Timestamp of the dataset. Defaults to current UTC time. Can be a datetime or UTC epoch seconds.
- Returns
prof
- Return type
- whylogs.core.datasetprofile.array_profile(x: numpy.ndarray, name: str = None, timestamp: datetime.datetime = None, columns: list = None)¶
Generate a dataset profile for an array
- Parameters
x (np.ndarray) – Array-like object to track. Will be treated as an full dataset
name (str) – Name of the dataset
timestamp (datetime.datetime) – Timestamp of the dataset. Defaults to current UTC time
columns (list) – Optional column labels
- Returns
prof
- Return type
- whylogs.core.datasetprofile._create_column_profile_summary_object(number_summary: whylogs.proto.NumberSummary, **kwargs)¶
Wrapper method for summary constraints update object creation
- Parameters
number_summary (NumberSummary) – Summary object generated from NumberTracker Used to unpack the metrics as separate items in the dictionary
kwargs (Summary objects or datasketches objects) – Used to update specific constraints that need additional calculations
- Return type
Anonymous object containing all of the metrics as fields with their corresponding values
whylogs.core.flatten_datasetprofile
¶
|
Flatten a DatasetSummary |
|
|
|
Flatten quantiles from a dataset summary |
|
Flatten quantiles from a dataset summary |
|
Flatten histograms from a dataset summary |
|
Flatten frequent strings summaries from a dataset summary |
|
Get a dataframe from scalar values flattened from a dataset summary |
- whylogs.core.flatten_datasetprofile.TYPENUM_COLUMN_NAMES¶
- whylogs.core.flatten_datasetprofile.SCALAR_NAME_MAPPING¶
- whylogs.core.flatten_datasetprofile.flatten_summary(dataset_summary: whylogs.proto.DatasetSummary) dict ¶
Flatten a DatasetSummary
- Parameters
dataset_summary (DatasetSummary) – Summary to flatten
- Returns
data –
A dictionary with the following keys:
- summarypandas.DataFrame
Per-column summary statistics
- histpandas.Series
Series of histogram Series with (column name, histogram) key, value pairs. Histograms are formatted as a pandas.Series
- frequent_stringspandas.Series
Series of frequent string counts with (column name, counts) key, val pairs. counts are a pandas Series.
- Return type
dict
Notes
Some relevant info on the summary mapping:
>>> from whylogs.core.datasetprofile import SCALAR_NAME_MAPPING >>> import json >>> print(json.dumps(SCALAR_NAME_MAPPING, indent=2))
- whylogs.core.flatten_datasetprofile._quantile_strings(quantiles: list)¶
- whylogs.core.flatten_datasetprofile.flatten_dataset_quantiles(dataset_summary: whylogs.proto.DatasetSummary)¶
Flatten quantiles from a dataset summary
- whylogs.core.flatten_datasetprofile.flatten_dataset_string_quantiles(dataset_summary: whylogs.proto.DatasetSummary)¶
Flatten quantiles from a dataset summary
- whylogs.core.flatten_datasetprofile.flatten_dataset_histograms(dataset_summary: whylogs.proto.DatasetSummary)¶
Flatten histograms from a dataset summary
- whylogs.core.flatten_datasetprofile.flatten_dataset_frequent_strings(dataset_summary: whylogs.proto.DatasetSummary)¶
Flatten frequent strings summaries from a dataset summary
- whylogs.core.flatten_datasetprofile.get_dataset_frame(dataset_summary: whylogs.proto.DatasetSummary, mapping: dict = None)¶
Get a dataframe from scalar values flattened from a dataset summary
- Parameters
dataset_summary (DatasetSummary) – The dataset summary.
mapping (dict, optional) – Override the default variable mapping.
- Returns
summary – Scalar values, flattened and re-named according to mapping
- Return type
pd.DataFrame
whylogs.core.image_profiling
¶This is a class that computes image features and visits profiles and so image features can be sketched. |
|
|
|
Compute statistics data for a PIL Image |
|
Grab statistics data from a PIL ImageStats.Stat |
|
- whylogs.core.image_profiling.logger¶
- whylogs.core.image_profiling.ImageType¶
- whylogs.core.image_profiling.DEFAULT_IMAGE_FEATURES = []¶
- whylogs.core.image_profiling._DEFAULT_TAGS_ATTRIBUTES = ['ImagePixelWidth', 'ImagePixelHeight', 'Colorspace']¶
- whylogs.core.image_profiling._IMAGE_HSV_CHANNELS = ['Hue', 'Saturation', 'Brightness']¶
- whylogs.core.image_profiling._STATS_PROPERTIES = ['mean', 'stddev']¶
- whylogs.core.image_profiling._DEFAULT_STAT_ATTRIBUTES¶
- whylogs.core.image_profiling._METADATA_DEFAULT_ATTRIBUTES¶
- whylogs.core.image_profiling.image_loader(path: str = None) PIL.Image.Image ¶
- class whylogs.core.image_profiling.TrackImage(filepath: str = None, img: PIL.Image.Image = None, feature_transforms: List[Callable] = DEFAULT_IMAGE_FEATURES, feature_name: str = '', metadata_attributes: Union[str, List[str]] = _METADATA_DEFAULT_ATTRIBUTES)¶
This is a class that computes image features and visits profiles and so image features can be sketched.
- feature_name¶
name given to this image feature, will prefix all image based features
- Type
str
- feature_transforms¶
Feature transforms to be apply to image data.
- Type
List[Callable]
- img¶
the PIL.Image
- Type
PIL.Image
- metadata_attributes¶
metadata attributes to track
- Type
TYPE
- __call__(self, profiles)¶
Call method to add image data and metadata to associated profiles :param profiles: DatasetProfile :type profiles: Union[List[DatasetProfile],DatasetProfile]
- whylogs.core.image_profiling.get_pil_image_statistics(img: PIL.Image.Image, channels: List[str] = _IMAGE_HSV_CHANNELS, image_stats: List[str] = _STATS_PROPERTIES) Dict ¶
Compute statistics data for a PIL Image
- Parameters
img (ImageType) – PIL Image
- Returns
of metadata
- Return type
Dict
- whylogs.core.image_profiling.get_pil_image_metadata(img: PIL.Image.Image) Dict ¶
Grab statistics data from a PIL ImageStats.Stat
- Parameters
img (ImageType) – PIL Image
- Returns
of metadata
- Return type
Dict
- whylogs.core.image_profiling.image_based_metadata(img)¶
whylogs.core.model_profile
¶Model Class for sketch metrics for model outputs |
- whylogs.core.model_profile.SUPPORTED_TYPES = ['binary', 'multiclass']¶
- class whylogs.core.model_profile.ModelProfile(output_fields=None, metrics: whylogs.core.metrics.model_metrics.ModelMetrics = None)¶
Model Class for sketch metrics for model outputs
- metrics¶
the model metrics object
- Type
- model_type¶
Type of mode, CLASSIFICATION, REGRESSION, UNKNOWN, etc.
- Type
ModelType
- output_fields¶
list of fields that map to model outputs
- Type
list
- add_output_field(self, field: str)¶
- compute_metrics(self, targets, predictions, scores=None, model_type: whylogs.proto.ModelType = None, target_field=None, prediction_field=None, score_field=None)¶
Compute and track metrics for confusion_matrix
- Parameters
targets (List) – targets (or actuals) for validation, if these are floats it is assumed the model is a regression type model
predictions (List) – predictions (or inferred values)
scores (List, optional) – associated scores for each prediction (for binary and multiclass problems)
target_field (str, optional) –
prediction_field (str, optional) –
score_field (str, optional (for binary and multiclass problems)) –
- Raises
NotImplementedError –
- to_protobuf(self)¶
- classmethod from_protobuf(cls, message: whylogs.proto.ModelProfileMessage)¶
- merge(self, model_profile)¶
whylogs.core.summaryconverters
¶Library module defining function for generating summaries
|
Generate a protobuf summary message from a datasketches theta sketch |
|
Generate a protobuf summary message from a string sketch |
|
Calculate quantiles from a data sketch |
|
Calculate the specified quantile from a data sketch |
|
|
|
Generate a summary of a kll_floats_sketch, including a histogram |
|
Calculate the estimated entropy for a ColumnProfile, using the ColumnSummary |
|
Compute the Kolmogorov-Smirnov test p-value of two continuous distributions. |
|
Calculates the KL divergence between a target feature and a reference feature. |
|
Calculates the estimated KL divergence for two continuous distributions. |
|
Calculates the estimated KL divergence for two discrete distributions. |
|
Calculates the Chi-Squared test p-value for two discrete distributions. |
- whylogs.core.summaryconverters.MAX_HIST_BUCKETS = 30¶
- whylogs.core.summaryconverters.HIST_AVG_NUMBER_PER_BUCKET = 4.0¶
- whylogs.core.summaryconverters.QUANTILES = [0.0, 0.01, 0.05, 0.25, 0.5, 0.75, 0.95, 0.99, 1.0]¶
- whylogs.core.summaryconverters.logger¶
- whylogs.core.summaryconverters.from_sketch(sketch: datasketches.update_theta_sketch, num_std_devs: float = 1)¶
Generate a protobuf summary message from a datasketches theta sketch
- Parameters
sketch – Theta sketch to summarize
num_std_devs – Number of standard deviations for calculating bounds
- Returns
summary
- Return type
UniqueCountSummary
- whylogs.core.summaryconverters.from_string_sketch(sketch: datasketches.frequent_strings_sketch)¶
Generate a protobuf summary message from a string sketch
- Parameters
sketch – Frequent strings sketch
- Returns
summary
- Return type
FrequentStringsSummary
- whylogs.core.summaryconverters.quantiles_from_sketch(sketch: datasketches.kll_floats_sketch, quantiles=None)¶
Calculate quantiles from a data sketch
- Parameters
sketch (kll_floats_sketch) – Data sketch
quantiles (list-like) – Override the default quantiles. Should be a list of values from 0 to 1 inclusive.
- whylogs.core.summaryconverters.single_quantile_from_sketch(sketch: datasketches.kll_floats_sketch, quantile: float)¶
Calculate the specified quantile from a data sketch
- Parameters
sketch (kll_floats_sketch) – Data sketch
quantile (float) – Override the default quantiles to a single quantile. Should be a value from 0 to 1 inclusive.
- Return type
Anonymous object with one filed equal to the quantile value
- whylogs.core.summaryconverters._calculate_bins(end: float, start: float, n: int, avg_per_bucket: float, max_buckets: int)¶
- whylogs.core.summaryconverters.histogram_from_sketch(sketch: datasketches.kll_floats_sketch, max_buckets: int = None, avg_per_bucket: int = None)¶
Generate a summary of a kll_floats_sketch, including a histogram
- Parameters
sketch (kll_floats_sketch) – Data sketch
max_buckets (int) – Override the default maximum number of buckets
avg_per_bucket (int) – Override the default target number of items per bucket.
- Returns
histogram – Protobuf histogram message
- Return type
HistogramSummary
- whylogs.core.summaryconverters.entropy_from_column_summary(summary: whylogs.proto.ColumnSummary, histogram: datasketches.kll_floats_sketch)¶
Calculate the estimated entropy for a ColumnProfile, using the ColumnSummary Can be used for both continuous and discrete types of data.
- Parameters
summary (ColumnSummary) – Protobuf summary message
histogram (datasketches.kll_floats_sketch) – Data sketch for quantiles
- Returns
entropy – Estimated entropy value, np.nan if the inferred data type of the column is not categorical or numeric
- Return type
float
- whylogs.core.summaryconverters.ks_test_compute_p_value(target_distribution: datasketches.kll_floats_sketch, reference_distribution: datasketches.kll_floats_sketch)¶
Compute the Kolmogorov-Smirnov test p-value of two continuous distributions. Uses the quantile values and the corresponding CDFs to calculate the approximate KS statistic. Only applicable to continuous distributions. The null hypothesis expects the samples to come from the same distribution.
- Parameters
target_distribution (datasketches.kll_floats_sketch) – A kll_floats_sketch (quantiles sketch) from the target distribution’s values
reference_distribution (datasketches.kll_floats_sketch) – A kll_floats_sketch (quantiles sketch) from the reference (expected) distribution’s values Can be generated from a theoretical distribution, or another sample for the same feature.
- Returns
p_value (float)
The estimated p-value from the parametrized KS test, applied on the target and reference distributions’
kll_floats_sketch summaries
- whylogs.core.summaryconverters.compute_kl_divergence(target_distribution: Union[datasketches.kll_floats_sketch, whylogs.proto.ReferenceDistributionDiscreteMessage], reference_distribution: Union[datasketches.kll_floats_sketch, whylogs.proto.ReferenceDistributionDiscreteMessage])¶
Calculates the KL divergence between a target feature and a reference feature. Applicable to both continuous and discrete distributions. Uses the pmf and the datasketches.kll_floats_sketch to calculate the KL divergence in the continuous case. Uses the top frequent items to calculate the KL divergence in the discrete case.
- Parameters
target_distribution (Union[kll_floats_sketch, ReferenceDistributionDiscreteMessage]) – The target distribution. Should be a datasketches.kll_floats_sketch if the target distribution is continuous. Should be a ReferenceDistributionDiscreteMessage if the target distribution is discrete. Both the target distribution, specified in target_distribution, and the reference distribution, specified in reference_distribution must be of the same type.
reference_distribution (Union[kll_floats_sketch, ReferenceDistributionDiscreteMessage]) – The reference distribution. Should be a datasketches.kll_floats_sketch if the reference distribution is continuous. Should be a ReferenceDistributionDiscreteMessage if the reference distribution is discrete. Both the target distribution, specified in target_distribution, and the reference distribution, specified in reference_distribution must be of the same type.
- Returns
kl_divergence (float)
The estimated value of the KL divergence between the target and the reference feature
- whylogs.core.summaryconverters._compute_kl_divergence_continuous_distributions(target_distribution: datasketches.kll_floats_sketch, reference_distribution: datasketches.kll_floats_sketch)¶
Calculates the estimated KL divergence for two continuous distributions. Uses the datasketches.kll_floats_sketch sketch to calculate the KL divergence based on the PMFs. Only applicable to continuous distributions.
- Parameters
target_distribution (datasketches.kll_floats_sketch) – The quantiles summary of the target feature’s distribution.
reference_distribution (datasketches.kll_floats_sketch) – The quantiles summary of the reference feature’s distribution.
- Returns
kl_divergence (float)
The estimated KL divergence between two continuous features.
- whylogs.core.summaryconverters._compute_kl_divergence_discrete_distributions(target_distribution: whylogs.proto.ReferenceDistributionDiscreteMessage, reference_distribution: whylogs.proto.ReferenceDistributionDiscreteMessage)¶
Calculates the estimated KL divergence for two discrete distributions. Uses the frequent items summary to calculate the estimated frequencies of items in each distribution. Only applicable to discrete distributions.
- Parameters
target_distribution (ReferenceDistributionDiscreteMessage) – The summary message of the target feature’s distribution. Should be a ReferenceDistributionDiscreteMessage containing the frequent items, unique, and total count summaries.
reference_distribution (ReferenceDistributionDiscreteMessage) – The summary message of the reference feature’s distribution. Should be a ReferenceDistributionDiscreteMessage containing the frequent items, unique, and total count summaries.
- Returns
kl_divergence (float)
The estimated KL divergence between two discrete features.
- whylogs.core.summaryconverters.compute_chi_squared_test_p_value(target_distribution: whylogs.proto.ReferenceDistributionDiscreteMessage, reference_distribution: whylogs.proto.ReferenceDistributionDiscreteMessage)¶
Calculates the Chi-Squared test p-value for two discrete distributions. Uses the top frequent items summary, unique count estimate and total count estimate for each feature, to calculate the estimated Chi-Squared statistic. Applicable only to discrete distributions.
- Parameters
target_distribution (ReferenceDistributionDiscreteMessage) – The summary message of the target feature’s distribution. Should be a ReferenceDistributionDiscreteMessage containing the frequent items, unique, and total count summaries.
reference_distribution (ReferenceDistributionDiscreteMessage) – The summary message of the reference feature’s distribution. Should be a ReferenceDistributionDiscreteMessage containing the frequent items, unique, and total count summaries.
- Returns
p_value (float)
The estimated p-value from the Chi-Squared test, applied on the target and reference distributions’
frequent and unique items summaries
Package Contents¶
Statistics tracking for a column (i.e. a feature) |
|
Statistics tracking for a multiple columns (i.e. a features) |
|
Statistics tracking for a dataset. |
|
This is a class that computes image features and visits profiles and so image features can be sketched. |
- whylogs.core.BB_ATTRIBUTES = ['annotation_count', 'annotation_density', 'area_coverage', 'bb_width', 'bb_height', 'bb_area',...¶
- class whylogs.core.TrackBB(filepath: str = None, obj: Dict = None, feature_transforms: Optional[List[Callable]] = None, feature_names: str = '')¶
- calculate_metrics(self)¶
- __call__(self, profiles)¶
- class whylogs.core.ColumnProfile(name: str, number_tracker: whylogs.core.statistics.NumberTracker = None, string_tracker: whylogs.core.statistics.StringTracker = None, schema_tracker: whylogs.core.statistics.SchemaTracker = None, counters: whylogs.core.statistics.CountersTracker = None, frequent_items: whylogs.util.dsketch.FrequentItemsSketch = None, cardinality_tracker: whylogs.core.statistics.hllsketch.HllSketch = None, constraints: whylogs.core.statistics.constraints.ValueConstraints = None)¶
Statistics tracking for a column (i.e. a feature)
The primary method for
- Parameters
name (str (required)) – Name of the column profile
number_tracker (NumberTracker) – Implements numeric data statistics tracking
string_tracker (StringTracker) – Implements string data-type statistics tracking
schema_tracker (SchemaTracker) – Implements tracking of schema-related information
counters (CountersTracker) – Keep count of various things
frequent_items (FrequentItemsSketch) – Keep track of all frequent items, even for mixed datatype features
cardinality_tracker (HllSketch) – Track feature cardinality (even for mixed data types)
constraints (ValueConstraints) – Static assertions to be applied to numeric data tracked in this column
TODO –
Proper TypedDataConverter type checking
Multi-threading/parallelism
- track(self, value, character_list=None, token_method=None)¶
Add value to tracking statistics.
- _unique_count_summary(self) whylogs.proto.UniqueCountSummary ¶
- to_summary(self)¶
Generate a summary of the statistics
- Returns
summary – Protobuf summary message.
- Return type
ColumnSummary
- generate_constraints(self) whylogs.core.statistics.constraints.SummaryConstraints ¶
- merge(self, other)¶
Merge this columnprofile with another.
- Parameters
other (ColumnProfile) –
- Returns
merged – A new, merged column profile.
- Return type
- to_protobuf(self)¶
Return the object serialized as a protobuf message
- Returns
message
- Return type
ColumnMessage
- static from_protobuf(message)¶
Load from a protobuf message
- Returns
column_profile
- Return type
- class whylogs.core.MultiColumnProfile(constraints: whylogs.core.statistics.constraints.MultiColumnValueConstraints = None)¶
Statistics tracking for a multiple columns (i.e. a features)
The primary method for
- Parameters
constraints (MultiColumnValueConstraints) – Static assertions to be applied to data tracked between all columns
- track(self, column_dict, character_list=None, token_method=None)¶
TODO: Add column_dict to tracking statistics.
- abstract to_summary(self)¶
Generate a summary of the statistics
- Returns
summary – Protobuf summary message.
- Return type
(Multi)ColumnSummary
- merge(self, other) MultiColumnProfile ¶
Merge this columnprofile with another.
- Parameters
other (MultiColumnProfile) –
- Returns
merged – A new, merged multi column profile.
- Return type
- abstract to_protobuf(self)¶
Return the object serialized as a protobuf message
- Returns
message
- Return type
ColumnMessage
- abstract static from_protobuf(message)¶
Load from a protobuf message
- Returns
column_profile
- Return type
- class whylogs.core.DatasetProfile(name: str, dataset_timestamp: datetime.datetime = None, session_timestamp: datetime.datetime = None, columns: dict = None, multi_columns: whylogs.core.MultiColumnProfile = None, tags: Dict[str, str] = None, metadata: Dict[str, str] = None, session_id: str = None, model_profile: whylogs.core.model_profile.ModelProfile = None, constraints: whylogs.core.statistics.constraints.DatasetConstraints = None)¶
Statistics tracking for a dataset.
A dataset refers to a collection of columns.
- Parameters
name (str) – A human readable name for the dataset profile. Could be model name. This is stored under “name” tag
dataset_timestamp (datetime.datetime) – The timestamp associated with the data (i.e. batch run). Optional.
session_timestamp (datetime.datetime) – Timestamp of the dataset
columns (dict) – Dictionary lookup of `ColumnProfile`s
tags (dict) – A dictionary of key->value. Can be used upstream for aggregating data. Tags must match when merging with another dataset profile object.
metadata (dict) – Metadata that can store arbitrary string mapping. Metadata is not used when aggregating data and can be dropped when merging with another dataset profile object.
session_id (str) – The unique session ID run. Should be a UUID.
constraints (DatasetConstraints) – Static assertions to be applied to tracked numeric data and profile summaries.
- __getstate__(self)¶
- __setstate__(self, serialized_profile)¶
- property name(self)¶
- property tags(self)¶
- property metadata(self)¶
- property session_timestamp(self)¶
- property session_timestamp_ms(self)¶
Return the session timestamp value in epoch milliseconds.
- property total_row_number(self)¶
- add_output_field(self, field: Union[str, List[str]])¶
- track_metrics(self, targets: List[Union[str, bool, float, int]], predictions: List[Union[str, bool, float, int]], scores: List[float] = None, model_type: whylogs.proto.ModelType = None, target_field: str = None, prediction_field: str = None, score_field: str = None)¶
Function to track metrics based on validation data.
user may also pass the associated attribute names associated with target, prediction, and/or score.
- Parameters
targets (List[Union[str, bool, float, int]]) – actual validated values
predictions (List[Union[str, bool, float, int]]) – inferred/predicted values
scores (List[float], optional) – assocaited scores for each inferred, all values set to 1 if not passed
target_field (str, optional) – Description
prediction_field (str, optional) – Description
score_field (str, optional) – Description
model_type (ModelType, optional) – Defaul is Classification type.
target_field –
prediction_field –
score_field –
score_field –
- track(self, columns, data=None, character_list=None, token_method=None)¶
Add value(s) to tracking statistics for column(s).
- Parameters
columns (str, dict) – Either the name of a column, or a dictionary specifying column names and the data (value) for each column If a string, data must be supplied. Otherwise, data is ignored.
data (object, None) – Value to track. Specify if columns is a string.
- track_datum(self, column_name, data, character_list=None, token_method=None)¶
- track_multi_column(self, columns)¶
- track_array(self, x: numpy.ndarray, columns=None)¶
Track statistics for a numpy array
- Parameters
x (np.ndarray) – 2D array to track.
columns (list) – Optional column labels
- track_dataframe(self, df: pandas.DataFrame, character_list=None, token_method=None)¶
Track statistics for a dataframe
- Parameters
df (pandas.DataFrame) – DataFrame to track
- to_properties(self)¶
Return dataset profile related metadata
- Returns
properties – The metadata as a protobuf object.
- Return type
DatasetProperties
- to_summary(self)¶
Generate a summary of the statistics
- Returns
summary – Protobuf summary message.
- Return type
DatasetSummary
- generate_constraints(self) whylogs.core.statistics.constraints.DatasetConstraints ¶
Assemble a sparse dict of constraints for all features.
- Returns
summary – Protobuf constraints message.
- Return type
- flat_summary(self)¶
Generate and flatten a summary of the statistics.
See
flatten_summary()
for a description
- _column_message_iterator(self)¶
- chunk_iterator(self)¶
Generate an iterator to iterate over chunks of data
- validate(self)¶
Sanity check for this object. Raises an AssertionError if invalid
- merge(self, other)¶
Merge this profile with another dataset profile object.
We will use metadata and timestamps from the current DatasetProfile in the result.
This operation will drop the metadata from the ‘other’ profile object.
- Parameters
other (DatasetProfile) –
- Returns
merged – New, merged DatasetProfile
- Return type
- _do_merge(self, other)¶
- merge_strict(self, other)¶
Merge this profile with another dataset profile object. This throws exception if session_id, timestamps and tags don’t match.
This operation will drop the metadata from the ‘other’ profile object.
- Parameters
other (DatasetProfile) –
- Returns
merged – New, merged DatasetProfile
- Return type
- serialize_delimited(self) bytes ¶
Write out in delimited format (data is prefixed with the length of the datastream).
This is useful when you are streaming multiple dataset profile objects
- Returns
data – A sequence of bytes
- Return type
bytes
- to_protobuf(self) whylogs.proto.DatasetProfileMessage ¶
Return the object serialized as a protobuf message
- Returns
message
- Return type
DatasetProfileMessage
- write_protobuf(self, protobuf_path: str, delimited_file: bool = True, transport_parameters: dict = None) None ¶
Write the dataset profile to disk in binary format
- Parameters
protobuf_path (str) – local path or any path supported supported by smart_open: https://github.com/RaRe-Technologies/smart_open#how. The parent directory must already exist
delimited_file (bool, optional) – whether to prefix the data with the length of output or not. Default is True
- static read_protobuf(protobuf_path: str, delimited_file: bool = True, transport_parameters: dict = None) DatasetProfile ¶
Parse a protobuf file and return a DatasetProfile object
- Parameters
protobuf_path (str) – the path of the protobuf data, can be local or any other path supported by smart_open: https://github.com/RaRe-Technologies/smart_open#how
delimited_file (bool, optional) – whether the data is delimited or not. Default is True
- Returns
whylogs.DatasetProfile object from the protobuf
- Return type
- static from_protobuf(message: whylogs.proto.DatasetProfileMessage) DatasetProfile ¶
Load from a protobuf message
- Parameters
message (DatasetProfileMessage) – The protobuf message. Should match the output of DatasetProfile.to_protobuf()
- Returns
dataset_profile
- Return type
- static from_protobuf_string(data: bytes) DatasetProfile ¶
Deserialize a serialized DatasetProfileMessage
- Parameters
data (bytes) – The serialized message
- Returns
profile – The deserialized dataset profile
- Return type
- static _parse_delimited_generator(data: bytes)¶
- static parse_delimited_single(data: bytes, pos=0)¶
Parse a single delimited entry from a byte stream :param data: The bytestream :type data: bytes :param pos: The starting position. Default is zero :type pos: int
- Returns
pos (int) – Current position in the stream after parsing
profile (DatasetProfile) – A dataset profile
- static parse_delimited(data: bytes)¶
Parse delimited data (i.e. data prefixed with the message length).
Java protobuf writes delimited messages, which is convenient for storing multiple dataset profiles. This means that the main data is prefixed with the length of the message.
- Parameters
data (bytes) – The input byte stream
- Returns
profiles – List of all Dataset profile objects
- Return type
list
- apply_summary_constraints(self, summary_constraints: Optional[Mapping[str, whylogs.core.statistics.constraints.SummaryConstraints]] = None)¶
- apply_table_shape_constraints(self, table_shape_constraints: Optional[whylogs.core.statistics.constraints.SummaryConstraints] = None)¶
- whylogs.core.METADATA_DEFAULT_ATTRIBUTES¶
- class whylogs.core.TrackImage(filepath: str = None, img: PIL.Image.Image = None, feature_transforms: List[Callable] = DEFAULT_IMAGE_FEATURES, feature_name: str = '', metadata_attributes: Union[str, List[str]] = _METADATA_DEFAULT_ATTRIBUTES)¶
This is a class that computes image features and visits profiles and so image features can be sketched.
- feature_name¶
name given to this image feature, will prefix all image based features
- Type
str
- feature_transforms¶
Feature transforms to be apply to image data.
- Type
List[Callable]
- img¶
the PIL.Image
- Type
PIL.Image
- metadata_attributes¶
metadata attributes to track
- Type
TYPE
- __call__(self, profiles)¶
Call method to add image data and metadata to associated profiles :param profiles: DatasetProfile :type profiles: Union[List[DatasetProfile],DatasetProfile]
- whylogs.core.__ALL__¶
whylogs.features
¶
Submodules¶
whylogs.features.autosegmentation
¶
|
Entropy calculation. If normalized, use log cardinality. |
|
Entropy calculation. If normalized, use log cardinality. |
|
Entropy calculation. If normalized, use log cardinality. |
|
|
|
Estimates the most important features and values on which to segment |
- whylogs.features.autosegmentation._entropy(series: pandas.Series, normalized: bool = True) numpy.float64 ¶
Entropy calculation. If normalized, use log cardinality.
- whylogs.features.autosegmentation._weighted_entropy(df: pandas.DataFrame, split_columns: List[Optional[str]], target_column_name: str, normalized: bool = True)¶
Entropy calculation. If normalized, use log cardinality.
- whylogs.features.autosegmentation._information_gain_ratio(df: pandas.DataFrame, prev_split_columns: List[Optional[str]], column_name: str, target_column_name: str, normalized: bool = True)¶
Entropy calculation. If normalized, use log cardinality.
- whylogs.features.autosegmentation._find_best_split(df: pandas.DataFrame, prev_split_columns: List[str], valid_column_names: List[str], target_column_name: str)¶
- whylogs.features.autosegmentation._estimate_segments(df: pandas.DataFrame, target_field: str = None, max_segments: int = 30) Optional[Union[List[Dict], List[str]]] ¶
Estimates the most important features and values on which to segment data profiling using entropy-based methods.
If no target column provided, maximum entropy column is substituted.
- Parameters
df – the dataframe of data to profile
target_field – target field (optional)
max_segments – upper threshold for total combinations of segments,
default 30 :return: a list of segmentation feature names
whylogs.features.transforms
¶Outputs the composition of each transformation passed in transforms |
|
Outputs the Brightness of each pixel in the image |
|
Summary |
|
Helper Transform to resize images. |
|
Simple Blur Ammount computation based on variance of laplacian |
- whylogs.features.transforms.logger¶
- whylogs.features.transforms.ImageType¶
- class whylogs.features.transforms.ComposeTransforms(transforms: List, name=None)¶
Outputs the composition of each transformation passed in transforms
- __call__(self, x)¶
- __repr__(self)¶
Return repr(self).
- class whylogs.features.transforms.Brightness¶
Outputs the Brightness of each pixel in the image
- __call__(self, img: Union[PIL.Image.Image, numpy.ndarray]) numpy.ndarray ¶
- Parameters
img (Union[Image, np.ndarray]) – Either a PIL image or numpy array with int8 values
- Returns
Converted image.
- Return type
np.ndarray
- Deleted Parameters:
pic (PIL Image or numpy.ndarray): Image to be converted to tensor.
- __repr__(self)¶
Return repr(self).
- class whylogs.features.transforms.Saturation¶
Summary Outputs the saturation of each pixel in the image
- __call__(self, img: Union[PIL.Image.Image, numpy.ndarray]) numpy.ndarray ¶
- Parameters
img (Union[Image, np.ndarray]) – Either a PIL image or numpy array with int8 values
- Returns
(1,number_pixels) array for saturation values for the image
- Return type
np.ndarray
- __repr__(self)¶
Return repr(self).
- class whylogs.features.transforms.Resize(size)¶
Helper Transform to resize images.
- size¶
Description
- Type
TYPE
- __call__(self, img: Union[PIL.Image.Image, numpy.ndarray]) numpy.ndarray ¶
- Parameters
img (Union[ImageType, np.ndarray]) – Description
- Returns
Description
- Return type
np.ndarray
- __repr__(self)¶
Return repr(self).
- class whylogs.features.transforms.Hue¶
- __call__(self, img: Union[PIL.Image.Image, numpy.ndarray]) numpy.ndarray ¶
- Parameters
img (Union[Image, np.ndarray]) – Either a PIL image or numpy array with int8 values
- Returns
(1,number_pixels) array for hue values for the image
- Return type
np.ndarray
- __repr__(self)¶
Return repr(self).
- class whylogs.features.transforms.SimpleBlur¶
Simple Blur Ammount computation based on variance of laplacian Overall metric of how blurry is the image. No overall scale.
- __call__(self, img: Union[PIL.Image.Image, numpy.ndarray]) float ¶
- Parameters
img (Union[Image, np.ndarray]) – Either a PIL image or numpy array with int8 values
- Returns
variance of laplacian of image.
- Return type
float
- __repr__(self)¶
Return repr(self).
Package Contents¶
- whylogs.features._IMAGE_FEATURES = ['Hue', 'Brightness', 'Saturation']¶
whylogs.io
¶
Submodules¶
whylogs.io.file_loader
¶
|
simple check if extension is part of the implemented ones |
|
Check the enconding format based on the magic number |
|
tries to load image using the PIL lib |
|
Loads json or jsonl data |
|
Factory for file data |
- whylogs.io.file_loader.EXTENSIONS = ['.csv', '.jpg', '.jpeg', '.png', '.ppm', '.bmp', '.jsonl', '.json', '.pgm', '.tif', '.tiff',...¶
- whylogs.io.file_loader.IMAGE_EXTENSIONS = ['.jpg', '.jpeg', '.png', '.ppm', '.bmp', '.pgm', '.tif', '.tiff', '.webp', '.gif']¶
- whylogs.io.file_loader.PD_EXCEL_FORMATS = ['.xls', '.xlsx', '.xlsm', '.xlsb', '.odf', '.ods', '.odt']¶
- whylogs.io.file_loader.valid_file(fname: str)¶
simple check if extension is part of the implemented ones
- Parameters
fname (str) – file path
- Returns
bool
- whylogs.io.file_loader.extension_file(path: str)¶
Check the enconding format based on the magic number if file has no magic number we simply use extension. More advance analytics of file content is needed, potentially extendind to a lib like libmagic
- Parameters
path (str) – File path
- Returns
str: extension of encoding data magic_data : dic : any magic data information available including
magic number : byte mime_type: str name : str
- Return type
file_extension_given
- whylogs.io.file_loader.image_loader(path: str)¶
tries to load image using the PIL lib
- Parameters
path (str) – path to image files
- Returns
image data and image encoding format
- Return type
PIL.Image.Image
- whylogs.io.file_loader.json_loader(path: str = None) Union[Dict, list] ¶
Loads json or jsonl data
- Parameters
path (str, optional) – path to file
- Returns
Union[Dict, list]: Returns a list or dict of json data json_format : format of file (json or jsonl)
- Return type
objs
- whylogs.io.file_loader.file_loader(path: str, valid_file: Callable[[str], bool] = valid_file) Any ¶
Factory for file data
- Parameters
path (str) – path to file
valid_file (Callable[[str], bool], optional) – Optional valid file check,
- Returns
Tuple( [] Dataframe or Image data (PIL format), or Dict], magic_data: Dict of magic number data)
- Return type
data
- Raises
NotImplementedError – Description
whylogs.io.local_dataset
¶Helper class that provides a standard way to create an ABC using |
|
Helper class that provides a standard way to create an ABC using |
- class whylogs.io.local_dataset.Dataset(root_folder: str = '', feature_transforms: Optional[List[Callable]] = None)¶
Bases:
abc.ABC
Helper class that provides a standard way to create an ABC using inheritance.
- abstract __getitem__(self, index: int) Any ¶
- abstract __len__(self) int ¶
- __repr__(self) str ¶
Return repr(self).
- class whylogs.io.local_dataset.LocalDataset(root_folder, loader: Callable[[str], Any] = file_loader, extensions: List[str] = EXTENSIONS, feature_transforms: Optional[List[Callable]] = None, valid_file: Optional[Callable[[str], bool]] = valid_file)¶
Bases:
Dataset
Helper class that provides a standard way to create an ABC using inheritance.
- _find_folder_feature(self) None ¶
- _init_dataset(self) List[Tuple[str, int]] ¶
- __getitem__(self, index: int) Tuple[Any, Any] ¶
- __len__(self)¶
Package Contents¶
Helper class that provides a standard way to create an ABC using |
- class whylogs.io.LocalDataset(root_folder, loader: Callable[[str], Any] = file_loader, extensions: List[str] = EXTENSIONS, feature_transforms: Optional[List[Callable]] = None, valid_file: Optional[Callable[[str], bool]] = valid_file)¶
Bases:
Dataset
Helper class that provides a standard way to create an ABC using inheritance.
- _find_folder_feature(self) None ¶
- _init_dataset(self) List[Tuple[str, int]] ¶
- __getitem__(self, index: int) Tuple[Any, Any] ¶
- __len__(self)¶
- whylogs.io.__ALL__¶
whylogs.logs
¶
Convenience module for displaying/configuring python logs for whylogs
Package Contents¶
|
Convenience utility for setting whylogs to print logs to stdout. |
- whylogs.logs.display_logging(level='DEBUG', root_logger=False)¶
Convenience utility for setting whylogs to print logs to stdout.
- Parameters
level (str) – Logging level
root_logger (bool, default=False) – Redirect to the root logger.
whylogs.mlflow
¶
Submodules¶
whylogs.mlflow.model_wrapper
¶- whylogs.mlflow.model_wrapper.logger¶
- whylogs.mlflow.model_wrapper.PyFuncOutput¶
- class whylogs.mlflow.model_wrapper.ModelWrapper(model)¶
Bases:
object
- create_logger(self)¶
- predict(self, data: pandas.DataFrame) PyFuncOutput ¶
Wrapper around https://www.mlflow.org/docs/latest/_modules/mlflow/pyfunc.html#PyFuncModel.predict This allows us to capture input and predictions into whylogs
whylogs.mlflow.patcher
¶
|
|
|
Replaces the MLFLow's original add_to_model |
|
Hijack the mlflow.models.Model.log method and upload the .whylogs.yaml configuration to the model path |
|
Enable whylogs in |
- whylogs.mlflow.patcher.logger¶
- whylogs.mlflow.patcher._mlflow¶
- whylogs.mlflow.patcher._original_end_run¶
- whylogs.mlflow.patcher._active_whylogs = []¶
- whylogs.mlflow.patcher._is_patched = False¶
- whylogs.mlflow.patcher._original_mlflow_conda_env¶
- whylogs.mlflow.patcher._original_add_to_model¶
- whylogs.mlflow.patcher._original_model_log¶
- class whylogs.mlflow.patcher.WhyLogsRun(session=None)¶
Bases:
object
- _session¶
- _active_run_id¶
- _loggers :Dict[str, whylogs.app.logger.Logger]¶
- _create_logger(self, dataset_name: Optional[str] = None, dataset_timestamp: Optional[datetime.datetime] = None)¶
- log_pandas(self, df: pandas.DataFrame, dataset_name: Optional[str] = None, dataset_timestamp: Optional[datetime.datetime] = None)¶
Log the statistics of a Pandas dataframe. Note that this method is additive within a run: calling this method with a specific dataset name will not generate a new profile; instead, data will be aggregated into the existing profile.
In order to create a new profile, please specify a dataset_name
- Parameters
df – the Pandas dataframe to log
dataset_name – the name of the dataset (Optional). If not specified, the experiment name is used
- log(self, features: Optional[Dict[str, any]] = None, feature_name: Optional[str] = None, value: any = None, dataset_name: Optional[str] = None)¶
Logs a collection of features or a single feature (must specify one or the other).
- Parameters
features – a map of key value feature for model input
feature_name – name of a single feature. Cannot be specified if ‘features’ is specified
value – value of as single feature. Cannot be specified if ‘features’ is specified
dataset_name – the name of the dataset. If not specified, we fall back to using the experiment name
- _get_or_create_logger(self, dataset_name: Optional[str] = None, dataset_timestamp: Optional[datetime.datetime] = None)¶
- _close(self)¶
- whylogs.mlflow.patcher._new_mlflow_conda_env(path=None, additional_conda_deps=None, additional_pip_deps=None, additional_conda_channels=None, install_mlflow=True)¶
- whylogs.mlflow.patcher._new_add_to_model(model, loader_module, data=None, code=None, env=None, **kwargs)¶
Replaces the MLFLow’s original add_to_model https://github.com/mlflow/mlflow/blob/4e68f960d4520ade6b64a28c297816f622adc83e/mlflow/pyfunc/__init__.py#L242
Accepts the same signature as MLFlow’s original add_to_model call. We inject our loader module.
We also inject whylogs into the Conda environment by patching _mlflow_conda_env.
- Parameters
model – Existing model.
loader_module – The module to be used to load the model.
data – Path to the model data.
code – Path to the code dependencies.
env – Conda environment.
kwargs – Additional key-value pairs to include in the
pyfunc
flavor specification. Values must be YAML-serializable.
- Returns
Updated model configuration.
- whylogs.mlflow.patcher.WHYLOG_YAML = .whylogs.yaml¶
- whylogs.mlflow.patcher.new_model_log(**kwargs)¶
Hijack the mlflow.models.Model.log method and upload the .whylogs.yaml configuration to the model path This will allow us to pick up the configuration later under /opt/ml/model/.whylogs.yaml path
- whylogs.mlflow.patcher.enable_mlflow(session=None) bool ¶
Enable whylogs in
mlflow
module viamlflow.whylogs
.- Returns
True if MLFlow has been patched. False otherwise.
Example of whylogs and MLFlow¶import mlflow import whylogs whylogs.enable_mlflow() import numpy as np import pandas as pd pdf = pd.DataFrame( data=[[1, 2, 3, 4, True, "x", bytes([1])]], columns=["b", "d", "a", "c", "e", "g", "f"], dtype=np.object, ) active_run = mlflow.start_run() # log a Pandas dataframe under default name mlflow.whylogs.log_pandas(pdf) # log a Pandas dataframe with custom name mlflow.whylogs.log_pandas(pdf, "another dataset") # Finish the MLFlow run mlflow.end_run()
- whylogs.mlflow.patcher.disable_mlflow()¶
whylogs.mlflow.sklearn
¶
|
- whylogs.mlflow.sklearn._load_pyfunc(path: str)¶
Package Contents¶
|
Enable whylogs in |
|
List all the runs from an experiment that contains whylogs |
|
Retrieve all whylogs DatasetProfile for a given run and a given dataset name. |
|
Retrieve all whylogs profiles for a given experiment. This only |
- whylogs.mlflow.disable_mlflow()¶
- whylogs.mlflow.enable_mlflow(session=None) bool ¶
Enable whylogs in
mlflow
module viamlflow.whylogs
.- Returns
True if MLFlow has been patched. False otherwise.
Example of whylogs and MLFlow¶import mlflow import whylogs whylogs.enable_mlflow() import numpy as np import pandas as pd pdf = pd.DataFrame( data=[[1, 2, 3, 4, True, "x", bytes([1])]], columns=["b", "d", "a", "c", "e", "g", "f"], dtype=np.object, ) active_run = mlflow.start_run() # log a Pandas dataframe under default name mlflow.whylogs.log_pandas(pdf) # log a Pandas dataframe with custom name mlflow.whylogs.log_pandas(pdf, "another dataset") # Finish the MLFlow run mlflow.end_run()
- whylogs.mlflow.list_whylogs_runs(experiment_id: str, dataset_name: str = 'default')¶
List all the runs from an experiment that contains whylogs
- Return type
typing.List[mlflow.entities.Run]
- Parameters
experiment_id – the experiment id
dataset_name – the name of the dataset. Default to “default”
- whylogs.mlflow.get_run_profiles(run_id: str, dataset_name: str = 'default', client=None)¶
Retrieve all whylogs DatasetProfile for a given run and a given dataset name.
- Parameters
client –
mlflow.tracking.MlflowClient
run_id – the run id
dataset_name – the dataset name within a run. If not set, use the default value “default”
- Return type
typing.List[whylogs.DatasetProfile]
- whylogs.mlflow.get_experiment_profiles(experiment_id: str, dataset_name: str = 'default')¶
Retrieve all whylogs profiles for a given experiment. This only returns Active Runs at the moment.
- Return type
typing.List[whylogs.DatasetProfile]
- Parameters
experiment_id – the experiment ID string
dataset_name – the dataset name within a run. If not set, use the default value “default”
whylogs.proto
¶
Auto-generated protobuf class definitions.
Protobuf allows us to serialize/deserialize classes across languages
whylogs.util
¶
Utilities for whylogs
Submodules¶
whylogs.util.data
¶Utility functions for interacting with data
|
get an attribute (from an object) or key (from a dict-like object) |
|
Flatten a nested dictionary/object according to a specified name mapping. |
|
|
Return the given string converted to a string that can be used for a clean |
- whylogs.util.data.getter(x, k: str, *args)¶
get an attribute (from an object) or key (from a dict-like object)
getter(x, k) raise KeyError if k not present
getter(x, k, default) return default if k not present
This is a convenience function that allows you to interact the same with an object or a dictionary
- Parameters
x (object, dict) – Item to get attribute from
k (str) – Key or attribute name
default (optional) – Default value if k not present
- Returns
val – Associated value
- Return type
object
- whylogs.util.data.remap(x, mapping: dict)¶
Flatten a nested dictionary/object according to a specified name mapping.
- Parameters
x (object, dict) –
An object or dict which can be treated as a nested dictionary, where attributes can be accessed as:
attr = x.a.b[‘key_name’][‘other_Name’].d
- Indexing list values is not implemented, e.g.:
x.a.b[3].d[‘key_name’]
mapping (dict) –
Nested dictionary specifying the mapping. ONLY values specified in the mapping will be returned. For example:
{'a': { 'b': { 'c': 'new_name' } }
could flatten x.a.b.c or x.a[‘b’][‘c’] to x[‘new_name’]
- Returns
flat – A flattened ordered dictionary of values
- Return type
OrderedDict
- whylogs.util.data._remap(x, mapping: dict, y: dict)¶
- whylogs.util.data.get_valid_filename(s)¶
Return the given string converted to a string that can be used for a clean filename. Remove leading and trailing spaces; convert other spaces to underscores; and remove anything that is not an alphanumeric, dash, underscore, or dot.
>>> from whylogs.util.data import get_valid_filename >>> get_valid_filename(" Background of tim's 8/1/2019 party!.jpg ")
whylogs.util.dsketch
¶Define functions and classes for interfacing with datasketches
A class to implement frequent item counting for mixed data types. |
|
Deserialize a KLL floats sketch. Compatible with whylogs-java |
|
Deserialize a frequent strings sketch. Compatible with whylogs-java |
- whylogs.util.dsketch.deserialize_kll_floats_sketch(x: bytes, kind: str = 'float')¶
Deserialize a KLL floats sketch. Compatible with whylogs-java
whylogs histograms are serialized as kll floats sketches
- Parameters
x (bytes) – Serialized sketch
kind (str, optional) – Specify type of sketch: ‘float’ or ‘int’
- Returns
sketch – If x is an empty sketch, return None, else return the deserialized sketch.
- Return type
kll_floats_sketch, kll_ints_sketch, or None
- whylogs.util.dsketch.deserialize_frequent_strings_sketch(x: bytes)¶
Deserialize a frequent strings sketch. Compatible with whylogs-java
Wrapper for datasketches.frequent_strings_sketch.deserialize
- Parameters
x (bytes) – Serialized sketch
- Returns
sketch – If x is an empty string sketch, returns None, else returns the deserialized string sketch
- Return type
datasketches.frequent_strings_sketch, None
- class whylogs.util.dsketch.FrequentItemsSketch(lg_max_k: int = None, sketch: datasketches.frequent_strings_sketch = None)¶
A class to implement frequent item counting for mixed data types.
Wraps datasketches.frequent_strings_sketch by encoding numbers as strings since the datasketches python implementation does not implement frequent number tracking.
- Parameters
lg_max_k (int, optional) – Parameter controlling the size and accuracy of the sketch. A larger number increases accuracy and the memory requirements for the sketch
sketch (datasketches.frequent_strings_sketch, optional) – Initialize with an existing frequent strings sketch
- DEFAULT_MAX_ITEMS_SIZE = 128¶
- DEFAULT_ERROR_TYPE¶
- get_apriori_error(self, lg_max_map_size: int, estimated_total_weight: int)¶
Return an apriori estimate of the uncertainty for various parameters
- Parameters
lg_max_map_size (int) – The lg_max_k value
estimated_total_weight – Total weight (see
FrequentItems.get_total_weight()
)
- Returns
error – Approximate uncertainty
- Return type
float
- get_epsilon_for_lg_size(self, lg_max_map_size: int)¶
- get_estimate(self, item)¶
- get_lower_bound(self, item)¶
- get_upper_bound(self, item)¶
- get_frequent_items(self, err_type: datasketches.frequent_items_error_type = None, threshold: int = 0, decode: bool = True)¶
Retrieve the frequent items.
- Parameters
err_type (datasketches.frequent_items_error_type) – Override default error type
threshold (int) – Minimum count for returned items
decode (bool (default=True)) – Decode the returned values. Internally, all items are encoded as strings.
- Returns
items – A list of tuples of items:
[(item, count)]
- Return type
list
- get_num_active_items(self)¶
- get_serialized_size_bytes(self)¶
- get_sketch_epsilon(self)¶
- get_total_weight(self)¶
- is_empty(self)¶
- merge(self, other)¶
Merge the item counts of this sketch with another.
This object will not be modified. This operation is commutative.
- Parameters
other (FrequentItemsSketch) – The other sketch
- copy(self)¶
- Returns
sketch – A copy of this sketch
- Return type
- serialize(self)¶
Serialize this sketch as a bytes string.
See also
FrequentItemsSketch.deserialize()
- Returns
data – Serialized object.
- Return type
bytes
- to_string(self, print_items=False)¶
- update(self, x, weight=1)¶
Track an item.
- Parameters
x (object) – Item to track
weight (int) – Number of times the item appears
- to_summary(self, max_items=30, min_count=1)¶
Generate a protobuf summary. Returns None if there are no frequent items.
- Parameters
max_items (int) – Maximum number of items to return. The most frequent items will be returned
min_count (int) – Minimum number counts for all returned items
- Returns
summary – Protobuf summary message
- Return type
FrequentItemsSummary
- to_protobuf(self)¶
Generate a protobuf representation of this object
- static from_protobuf(message: whylogs.proto.FrequentItemsSketchMessage)¶
Initialize a FrequentItemsSketch from a protobuf FrequentItemsSketchMessage
- static _encode_item(x)¶
- static deserialize(x: bytes)¶
Deserialize a frequent numbers sketch.
If x is an empty sketch, None is returned
whylogs.util.protobuf
¶Functions for interacting with protobuf
|
A wrapper for google.protobuf.json_format.MessageToJson |
|
Convert a protobuf message to a dictionary |
Return an iterator to read delimited protobuf messages. The iterator will |
|
|
Return an iterator to iterate through protobuf messages in a multi-message |
|
Wrapper for |
|
|
|
|
|
Write a list (or iterator) of protobuf messages to a file. |
|
Print or generate string preview of a protobuf message. This is mainly |
|
- whylogs.util.protobuf.message_to_json(x: google.protobuf.message, **kwargs)¶
A wrapper for google.protobuf.json_format.MessageToJson
Currently a very thin wrapper…x and kwargs are just passed to MessageToJson
- whylogs.util.protobuf.message_to_dict(x: google.protobuf.message)¶
Convert a protobuf message to a dictionary
A thin wrapper around the google built-in function.
- whylogs.util.protobuf._varint_delim_reader(fp)¶
- whylogs.util.protobuf._varint_delim_iterator(f)¶
Return an iterator to read delimited protobuf messages. The iterator will return protobuf messages one by one as raw bytes objects.
- whylogs.util.protobuf.multi_msg_reader(f, msg_class)¶
Return an iterator to iterate through protobuf messages in a multi-message protobuf file.
See also: write_multi_msg()
- Parameters
f (str, file-object) – Filename or open file object to read from
msg_class (class) – The Protobuf message class, gets instantiated with a call to msg_class()
- Returns
Iterator which returns protobuf messages
- Return type
msg_iterator
- whylogs.util.protobuf.read_multi_msg(f, msg_class)¶
Wrapper for
multi_msg_reader()
which reads all the messages and returns them as a list.
- whylogs.util.protobuf._encode_one_msg(msg: google.protobuf.message)¶
- whylogs.util.protobuf._write_multi_msg(msgs: list, fp)¶
- whylogs.util.protobuf.write_multi_msg(msgs: list, f)¶
Write a list (or iterator) of protobuf messages to a file.
The multi-message file format is a binary format with:
<varint MessageBytesSize><message>
Which is repeated, where the len(message) in bytes is MessageBytesSize
- Parameters
msgs (list, iterable) – Protobuf messages to write to disk
f (str, file-object) – Filename or open binary file object to write to
- whylogs.util.protobuf.repr_message(x: google.protobuf.message.Message, indent=2, display=True)¶
Print or generate string preview of a protobuf message. This is mainly to get a preview of the attribute names and structure of a protobuf message class.
- Parameters
x (google.protobuf.message.Message) – Message to preview
indent (int) – Indentation
display (bool) – If True, print the message and return None. Else, return a string.
- Returns
msg – If display == False, return the message, else return None.
- Return type
str, None
- whylogs.util.protobuf._repr_message(x, level=0, msg='', display=True, indent=2)¶
whylogs.util.stats
¶Statistical functions used by whylogs
|
Estimate whether a feature is discrete given the number of records |
- whylogs.util.stats.CARDINALITY_SLOP = 1¶
- whylogs.util.stats.is_discrete(num_records: int, cardinality: int, p=0.15)¶
Estimate whether a feature is discrete given the number of records observed and the cardinality (number of unique values)
The default assumption is that features are not discrete.
- Parameters
num_records (int) – The number of observed records
cardinality (int) – Number of unique observed values
- Returns
discrete – Whether the feature is discrete
- Return type
bool
whylogs.util.time
¶Functions for interacting with timestamps and datetime objects
|
Convert a datetime object to UTC epoch milliseconds |
|
Convert a UTC epoch milliseconds timestamp to a datetime object |
- whylogs.util.time.to_utc_ms(dt: datetime.datetime) Optional[int] ¶
Convert a datetime object to UTC epoch milliseconds
- Returns
timstamp_ms – Timestamp
- Return type
int
- whylogs.util.time.from_utc_ms(utc: Optional[int]) Optional[datetime.datetime] ¶
Convert a UTC epoch milliseconds timestamp to a datetime object
- Parameters
utc (int) – Timestamp
- Returns
dt – Datetime object
- Return type
datetime.datetime
whylogs.util.util_functions
¶
|
- whylogs.util.util_functions.encode_to_integers(values, uniques)¶
whylogs.util.varint
¶Varint encoder/decoder
varints are a common encoding for variable length integer data, used in libraries such as sqlite, protobuf, v8, and more. Here’s a quick and dirty module to help avoid reimplementing the same thing over and over again.
Taken from https://github.com/fmoo/python-varint/blob/master/varint.py
MIT License
|
|
|
Pack number into varint bytes |
|
Read a varint from stream. Returns None if an EOF is encountered |
|
Read a varint from from buf bytes |
|
Read a byte from the file (as an integer) |
- whylogs.util.varint._byte(b)¶
- whylogs.util.varint.encode(number)¶
Pack number into varint bytes
- whylogs.util.varint.decode_stream(stream)¶
Read a varint from stream. Returns None if an EOF is encountered
- whylogs.util.varint.decode_bytes(buf)¶
Read a varint from from buf bytes
- whylogs.util.varint._read_one(stream)¶
Read a byte from the file (as an integer) raises EOFError if the stream ends while reading bytes.
whylogs.viz
¶
Subpackages¶
whylogs.viz.matplotlib
¶whylogs.viz.matplotlib.visualizer
¶
|
- class whylogs.viz.matplotlib.visualizer.MatplotlibProfileVisualizer¶
Bases:
whylogs.viz.BaseProfileVisualizer
- available_plots(self)¶
Returns available plots for matplotlib framework.
- _init_data_preprocessing(self, profiles)¶
- _init_theming(self)¶
- static _chart_theming()¶
Applies theming needed for each chart.
- _prof_data(self, variable)¶
- _summary_data_preprocessing(self, variable)¶
Applies general data preprocessing for each chart.
- _confirm_profile_data(self)¶
Checks for that profiles and profile data already set.
- plot_token_length(self, variable, ts_format='%d-%b-%y', **kwargs)¶
Plots token length data .
- plot_char_pos(self, variable, character_list=None, ts_format='%d-%b-%y', **kwargs)¶
Plots character position data .
- plot_string_length(self, variable, ts_format='%d-%b-%y', **kwargs)¶
Plots string length data .
- plot_string(self, variable, character_list, ts_format='%d-%b-%y', **kwargs)¶
Plots string related data .
- plot_distribution(self, variable, ts_format='%d-%b-%y', **kwargs)¶
Plots a distribution chart.
- plot_missing_values(self, variable, ts_format='%d-%b-%y', **kwargs)¶
Plots a Missing Value to Total Count ratio chart.
- plot_uniqueness(self, variable, ts_format='%d-%b-%y', **kwargs)¶
Plots a Estimated Unique Values chart.
- plot_data_types(self, variable, ts_format='%d-%b-%y', **kwargs)¶
Plots a Inferred Data Types chart.
- whylogs.viz.matplotlib.visualizer.array_creation(char_histos, bins, char_list)¶
whylogs.viz.utils
¶whylogs.viz.utils.profile_viz_calculations
¶
|
Calculates variance for single feature |
|
Calculates coefficient of variation for single feature |
|
Calculates sum for single feature |
|
Calculates sum for single feature |
|
Calculates drift value for reference profile based on profile type and inserts that data into reference profile |
|
Calculates different values for feature statistics |
- whylogs.viz.utils.profile_viz_calculations.categorical_types¶
- whylogs.viz.utils.profile_viz_calculations.__calculate_variance(profile_jsons, feature_name)¶
Calculates variance for single feature
- Parameters
profile_jsons (Profile summary serialized json) –
feature_name (Name of feature) –
- Returns
variance
- Return type
Calculated variance for feature
- whylogs.viz.utils.profile_viz_calculations.__calculate_coefficient_of_variation(profile_jsons, feature_name)¶
Calculates coefficient of variation for single feature
- Parameters
profile_jsons (Profile summary serialized json) –
feature_name (Name of feature) –
- Returns
coefficient_of_variation
- Return type
Calculated coefficient of variation for feature
- whylogs.viz.utils.profile_viz_calculations.__calculate_sum(profile_jsons, feature_name)¶
Calculates sum for single feature
- Parameters
profile_jsons (Profile summary serialized json) –
feature_name (Name of feature) –
- Returns
coefficient_of_variation
- Return type
Calculated sum for feature
- whylogs.viz.utils.profile_viz_calculations.__calculate_quantile_statistics(feature, profile_jsons, feature_name)¶
Calculates sum for single feature
- Parameters
profile_jsons (Profile summary serialized json) –
feature_name (Name of feature) –
- Returns
coefficient_of_variation
- Return type
Calculated sum for feature
- whylogs.viz.utils.profile_viz_calculations.add_drift_val_to_ref_profile_json(target_profile, reference_profile, reference_profile_json)¶
Calculates drift value for reference profile based on profile type and inserts that data into reference profile
- Parameters
target_profile (Target profile) –
reference_profile (Reference profile) –
reference_profile_json (Reference profile summary serialized json) –
- Returns
reference_profile_json
- Return type
Reference profile summary serialized json with drift value for every feature
- whylogs.viz.utils.profile_viz_calculations.add_feature_statistics(feature, profile_json, feature_name)¶
Calculates different values for feature statistics
- Parameters
feature –
profile_json (Profile summary serialized json) –
feature_name (Name of feature) –
- Returns
feature
- Return type
Feature data with appended values for statistics report
Submodules¶
whylogs.viz.base
¶- class whylogs.viz.base.BaseProfileVisualizer(framework=None, visualizer=None)¶
- set_profiles(self, profiles)¶
- plot_distribution(self, variable, **kwargs)¶
Plots a distribution chart.
- plot_missing_values(self, variable, **kwargs)¶
Plots a Missing Value to Total Count ratio chart.
- plot_uniqueness(self, variable, **kwargs)¶
Plots a Estimated Unique Values chart.
- plot_data_types(self, variable, **kwargs)¶
Plots a Inferred Data Types chart.
- plot_string_length(self, variable, **kwargs)¶
Plots string length data .
- plot_token_length(self, variable, character_list, **kwargs)¶
Plots token length data .
- plot_char_pos(self, variable, character_list, **kwargs)¶
Plots character position data .
- plot_string(self, variable, character_list, **kwargs)¶
Plots string related data .
- available_plots(self)¶
Returns available plots for selected framework.
whylogs.viz.browser_viz
¶
|
|
|
open a profile viewer loader on your default browser |
- whylogs.viz.browser_viz._MY_DIR¶
- whylogs.viz.browser_viz.logger¶
- whylogs.viz.browser_viz.is_wsl()¶
- whylogs.viz.browser_viz.profile_viewer(profiles: List[whylogs.core.DatasetProfile] = None, reference_profiles: List[whylogs.core.DatasetProfile] = None, output_path=None) str ¶
open a profile viewer loader on your default browser
whylogs.viz.jupyter_notebook_viz
¶- whylogs.viz.jupyter_notebook_viz._MY_DIR¶
- whylogs.viz.jupyter_notebook_viz.logger¶
- whylogs.viz.jupyter_notebook_viz.numerical_types¶
- class whylogs.viz.jupyter_notebook_viz.NotebookProfileVisualizer¶
- SUMMARY_REPORT_TEMPLATE_NAME = index-hbs-cdn-all-in-for-jupyter-notebook.html¶
- DOUBLE_HISTOGRAM_TEMPLATE_NAME = index-hbs-cdn-all-in-jupyter-distribution-chart.html¶
- DISTRIBUTION_CHART_TEMPLATE_NAME = index-hbs-cdn-all-in-jupyter-bar-chart.html¶
- DIFFERENCED_CHART_TEMPLATE_NAME = index-hbs-cdn-all-in-jupyter-differenced-chart.html¶
- FEATURE_STATISTICS_TEMPLATE_NAME = index-hbs-cdn-all-in-jupyter-feature-summary-statistics.html¶
- CONSTRAINTS_REPORT_TEMPLATE_NAME = index-hbs-cdn-all-in-jupyter-constraints-report.html¶
- PAGE_SIZES¶
- __get_template_path(self, html_file_name)¶
- __get_compiled_template(self, template_name)¶
- __display_feature_chart(self, feature_names, template_name, preferred_cell_height=None)¶
- __display_rendered_template(self, template, template_name, height)¶
- set_profiles(self, target_profile: whylogs.core.DatasetProfile = None, reference_profile: whylogs.core.DatasetProfile = None)¶
- summary_drift_report(self, preferred_cell_height=None)¶
- double_histogram(self, feature_names, preferred_cell_height=None)¶
- distribution_chart(self, feature_names, preferred_cell_height=None)¶
- difference_distribution_chart(self, feature_names, preferred_cell_height=None)¶
- feature_statistics(self, feature_name, profile='reference', preferred_cell_height=None)¶
- constraints_report(self, constraints, preferred_cell_height=None)¶
- download(self, html, preferred_path=None, html_file_name=None)¶
whylogs.viz.visualizer
¶- class whylogs.viz.visualizer.ProfileVisualizer(framework='matplotlib')¶
Bases:
whylogs.viz.base.BaseProfileVisualizer
- __subclass_framework(self, framework='matplotlib')¶
Package Contents¶
|
open a profile viewer loader on your default browser |
- whylogs.viz.profile_viewer(profiles: List[whylogs.core.DatasetProfile] = None, reference_profiles: List[whylogs.core.DatasetProfile] = None, output_path=None) str ¶
open a profile viewer loader on your default browser
- class whylogs.viz.NotebookProfileVisualizer¶
- SUMMARY_REPORT_TEMPLATE_NAME = index-hbs-cdn-all-in-for-jupyter-notebook.html¶
- DOUBLE_HISTOGRAM_TEMPLATE_NAME = index-hbs-cdn-all-in-jupyter-distribution-chart.html¶
- DISTRIBUTION_CHART_TEMPLATE_NAME = index-hbs-cdn-all-in-jupyter-bar-chart.html¶
- DIFFERENCED_CHART_TEMPLATE_NAME = index-hbs-cdn-all-in-jupyter-differenced-chart.html¶
- FEATURE_STATISTICS_TEMPLATE_NAME = index-hbs-cdn-all-in-jupyter-feature-summary-statistics.html¶
- CONSTRAINTS_REPORT_TEMPLATE_NAME = index-hbs-cdn-all-in-jupyter-constraints-report.html¶
- PAGE_SIZES¶
- __get_template_path(self, html_file_name)¶
- __get_compiled_template(self, template_name)¶
- __display_feature_chart(self, feature_names, template_name, preferred_cell_height=None)¶
- __display_rendered_template(self, template, template_name, height)¶
- set_profiles(self, target_profile: whylogs.core.DatasetProfile = None, reference_profile: whylogs.core.DatasetProfile = None)¶
- summary_drift_report(self, preferred_cell_height=None)¶
- double_histogram(self, feature_names, preferred_cell_height=None)¶
- distribution_chart(self, feature_names, preferred_cell_height=None)¶
- difference_distribution_chart(self, feature_names, preferred_cell_height=None)¶
- feature_statistics(self, feature_name, profile='reference', preferred_cell_height=None)¶
- constraints_report(self, constraints, preferred_cell_height=None)¶
- download(self, html, preferred_path=None, html_file_name=None)¶
- class whylogs.viz.BaseProfileVisualizer(framework=None, visualizer=None)¶
- set_profiles(self, profiles)¶
- plot_distribution(self, variable, **kwargs)¶
Plots a distribution chart.
- plot_missing_values(self, variable, **kwargs)¶
Plots a Missing Value to Total Count ratio chart.
- plot_uniqueness(self, variable, **kwargs)¶
Plots a Estimated Unique Values chart.
- plot_data_types(self, variable, **kwargs)¶
Plots a Inferred Data Types chart.
- plot_string_length(self, variable, **kwargs)¶
Plots string length data .
- plot_token_length(self, variable, character_list, **kwargs)¶
Plots token length data .
- plot_char_pos(self, variable, character_list, **kwargs)¶
Plots character position data .
- plot_string(self, variable, character_list, **kwargs)¶
Plots string related data .
- available_plots(self)¶
Returns available plots for selected framework.
- class whylogs.viz.ProfileVisualizer(framework='matplotlib')¶
Bases:
whylogs.viz.base.BaseProfileVisualizer
- __subclass_framework(self, framework='matplotlib')¶
- whylogs.viz.__ALL__¶
whylogs.whylabs_client
¶
Utils related to optional communication with Whylabs APIs
Submodules¶
whylogs.whylabs_client.wrapper
¶
|
|
|
|
|
|
|
|
|
|
|
|
|
- whylogs.whylabs_client.wrapper.whylabs_api_endpoint¶
- whylogs.whylabs_client.wrapper.configuration¶
- whylogs.whylabs_client.wrapper._session_token¶
- whylogs.whylabs_client.wrapper._logger¶
- whylogs.whylabs_client.wrapper._api_key¶
- whylogs.whylabs_client.wrapper._api_log_client¶
- whylogs.whylabs_client.wrapper._get_whylabs_client() whylabs_client.apis.SessionsApi ¶
- whylogs.whylabs_client.wrapper._get_or_create_log_client() whylabs_client.api.log_api.LogApi ¶
- whylogs.whylabs_client.wrapper.start_session() None ¶
- whylogs.whylabs_client.wrapper.upload_profile(dataset_profile: whylogs.core.DatasetProfile) None ¶
- whylogs.whylabs_client.wrapper._upload_whylabs(dataset_profile, dataset_timestamp, profile_path)¶
- whylogs.whylabs_client.wrapper._upload_guest_session(dataset_timestamp: int, profile_path: str)¶
- whylogs.whylabs_client.wrapper.end_session() Optional[str] ¶
Package Contents¶
|
|
|
|
|
- whylogs.whylabs_client.end_session() Optional[str] ¶
- whylogs.whylabs_client.start_session() None ¶
- whylogs.whylabs_client.upload_profile(dataset_profile: whylogs.core.DatasetProfile) None ¶
- whylogs.whylabs_client.__ALL__¶
Submodules¶
whylogs._version
¶
WhyLabs version number.
Module Contents¶
- whylogs._version.__version__ = 0.7.8¶
Package Contents¶
Classes¶
Config for a whylogs session. |
|
Config for whylogs writers |
|
Statistics tracking for a column (i.e. a feature) |
|
Statistics tracking for a dataset. |
Functions¶
|
Retrieve the current active global session. |
Reset and deactivate the global whylogs logging session. |
|
|
|
|
Enable whylogs in |
Attributes¶
- whylogs.__version__ = 0.7.8¶
- class whylogs.SessionConfig(project: str, pipeline: str, writers: List[WriterConfig], metadata: Optional[MetadataConfig] = None, verbose: bool = False, with_rotation_time: str = None, cache_size: int = 1, report_progress: bool = False)¶
Config for a whylogs session.
See also
SessionConfigSchema
- Parameters
project (str) – Project associated with this whylogs session
pipeline (str) – Name of the associated data pipeline
writers (list) – A list of WriterConfig objects defining writer outputs
metadata (MetadataConfig) – A MetadataConfiguration object. If none, will replace with default.
verbose (bool, default=False) – Output verbosity
with_rotation_time (str, default = None, to rotate profiles with time, takes values of overall rotation interval,) – “s” for seconds “m” for minutes “h” for hours “d” for days
cache_size (int default =1, sets how many dataprofiles to cache in logger during rotation) –
- to_yaml(self, stream=None)¶
Serialize this config to YAML
- Parameters
stream – If None (default) return a string, else dump the yaml into this stream.
- static from_yaml(stream)¶
Load config from yaml
- Parameters
stream (str, file-obj) – String or file-like object to load yaml from
- Returns
config – Generated config
- Return type
- class whylogs.WriterConfig(type: str, formats: Optional[List[str]] = None, output_path: Optional[str] = None, path_template: Optional[str] = None, filename_template: Optional[str] = None, data_collection_consent: Optional[bool] = None, transport_parameters: Optional[TransportParameterConfig] = None)¶
Config for whylogs writers
See also:
WriterConfigSchema
- Parameters
type (str) – Destination for the writer output, e.g. ‘local’ or ‘s3’
formats (list) – All output formats. See
ALL_SUPPORTED_FORMATS
output_path (str) – Prefix of where to output files. A directory for type = ‘local’, or key prefix for type = ‘s3’
path_template (str, optional) – Templatized path output using standard python string templates. Variables are accessed via $identifier or ${identifier}. See
whylogs.app.writers.Writer.template_params()
for a list of available identifers. Default =whylogs.app.writers.DEFAULT_PATH_TEMPLATE
filename_template (str, optional) – Templatized output filename using standardized python string templates. Variables are accessed via $identifier or ${identifier}. See
whylogs.app.writers.Writer.template_params()
for a list of available identifers. Default =whylogs.app.writers.DEFAULT_FILENAME_TEMPLATE
- to_yaml(self, stream=None)¶
Serialize this config to YAML
- Parameters
stream – If None (default) return a string, else dump the yaml into this stream.
- static from_yaml(stream, **kwargs)¶
Load config from yaml
- Parameters
stream (str, file-obj) – String or file-like object to load yaml from
kwargs – ignored
- Returns
config – Generated config
- Return type
WriterConfig
- whylogs.get_or_create_session(path_to_config: Optional[str] = None, report_progress: Optional[bool] = False)¶
Retrieve the current active global session.
If no active session exists, attempt to load config and create a new session.
If an active session exists, return the session without loading new config.
- Returns
The global active session
- Return type
- whylogs.reset_default_session()¶
Reset and deactivate the global whylogs logging session.
- whylogs.start_whylabs_session(path_to_config: Optional[str] = None, report_progress: Optional[bool] = False)¶
- class whylogs.ColumnProfile(name: str, number_tracker: whylogs.core.statistics.NumberTracker = None, string_tracker: whylogs.core.statistics.StringTracker = None, schema_tracker: whylogs.core.statistics.SchemaTracker = None, counters: whylogs.core.statistics.CountersTracker = None, frequent_items: whylogs.util.dsketch.FrequentItemsSketch = None, cardinality_tracker: whylogs.core.statistics.hllsketch.HllSketch = None, constraints: whylogs.core.statistics.constraints.ValueConstraints = None)¶
Statistics tracking for a column (i.e. a feature)
The primary method for
- Parameters
name (str (required)) – Name of the column profile
number_tracker (NumberTracker) – Implements numeric data statistics tracking
string_tracker (StringTracker) – Implements string data-type statistics tracking
schema_tracker (SchemaTracker) – Implements tracking of schema-related information
counters (CountersTracker) – Keep count of various things
frequent_items (FrequentItemsSketch) – Keep track of all frequent items, even for mixed datatype features
cardinality_tracker (HllSketch) – Track feature cardinality (even for mixed data types)
constraints (ValueConstraints) – Static assertions to be applied to numeric data tracked in this column
TODO –
Proper TypedDataConverter type checking
Multi-threading/parallelism
- track(self, value, character_list=None, token_method=None)¶
Add value to tracking statistics.
- _unique_count_summary(self) whylogs.proto.UniqueCountSummary ¶
- to_summary(self)¶
Generate a summary of the statistics
- Returns
summary – Protobuf summary message.
- Return type
ColumnSummary
- generate_constraints(self) whylogs.core.statistics.constraints.SummaryConstraints ¶
- merge(self, other)¶
Merge this columnprofile with another.
- Parameters
other (ColumnProfile) –
- Returns
merged – A new, merged column profile.
- Return type
- to_protobuf(self)¶
Return the object serialized as a protobuf message
- Returns
message
- Return type
ColumnMessage
- static from_protobuf(message)¶
Load from a protobuf message
- Returns
column_profile
- Return type
- class whylogs.DatasetProfile(name: str, dataset_timestamp: datetime.datetime = None, session_timestamp: datetime.datetime = None, columns: dict = None, multi_columns: whylogs.core.MultiColumnProfile = None, tags: Dict[str, str] = None, metadata: Dict[str, str] = None, session_id: str = None, model_profile: whylogs.core.model_profile.ModelProfile = None, constraints: whylogs.core.statistics.constraints.DatasetConstraints = None)¶
Statistics tracking for a dataset.
A dataset refers to a collection of columns.
- Parameters
name (str) – A human readable name for the dataset profile. Could be model name. This is stored under “name” tag
dataset_timestamp (datetime.datetime) – The timestamp associated with the data (i.e. batch run). Optional.
session_timestamp (datetime.datetime) – Timestamp of the dataset
columns (dict) – Dictionary lookup of `ColumnProfile`s
tags (dict) – A dictionary of key->value. Can be used upstream for aggregating data. Tags must match when merging with another dataset profile object.
metadata (dict) – Metadata that can store arbitrary string mapping. Metadata is not used when aggregating data and can be dropped when merging with another dataset profile object.
session_id (str) – The unique session ID run. Should be a UUID.
constraints (DatasetConstraints) – Static assertions to be applied to tracked numeric data and profile summaries.
- __getstate__(self)¶
- __setstate__(self, serialized_profile)¶
- property name(self)¶
- property tags(self)¶
- property metadata(self)¶
- property session_timestamp(self)¶
- property session_timestamp_ms(self)¶
Return the session timestamp value in epoch milliseconds.
- property total_row_number(self)¶
- add_output_field(self, field: Union[str, List[str]])¶
- track_metrics(self, targets: List[Union[str, bool, float, int]], predictions: List[Union[str, bool, float, int]], scores: List[float] = None, model_type: whylogs.proto.ModelType = None, target_field: str = None, prediction_field: str = None, score_field: str = None)¶
Function to track metrics based on validation data.
user may also pass the associated attribute names associated with target, prediction, and/or score.
- Parameters
targets (List[Union[str, bool, float, int]]) – actual validated values
predictions (List[Union[str, bool, float, int]]) – inferred/predicted values
scores (List[float], optional) – assocaited scores for each inferred, all values set to 1 if not passed
target_field (str, optional) – Description
prediction_field (str, optional) – Description
score_field (str, optional) – Description
model_type (ModelType, optional) – Defaul is Classification type.
target_field –
prediction_field –
score_field –
score_field –
- track(self, columns, data=None, character_list=None, token_method=None)¶
Add value(s) to tracking statistics for column(s).
- Parameters
columns (str, dict) – Either the name of a column, or a dictionary specifying column names and the data (value) for each column If a string, data must be supplied. Otherwise, data is ignored.
data (object, None) – Value to track. Specify if columns is a string.
- track_datum(self, column_name, data, character_list=None, token_method=None)¶
- track_multi_column(self, columns)¶
- track_array(self, x: numpy.ndarray, columns=None)¶
Track statistics for a numpy array
- Parameters
x (np.ndarray) – 2D array to track.
columns (list) – Optional column labels
- track_dataframe(self, df: pandas.DataFrame, character_list=None, token_method=None)¶
Track statistics for a dataframe
- Parameters
df (pandas.DataFrame) – DataFrame to track
- to_properties(self)¶
Return dataset profile related metadata
- Returns
properties – The metadata as a protobuf object.
- Return type
DatasetProperties
- to_summary(self)¶
Generate a summary of the statistics
- Returns
summary – Protobuf summary message.
- Return type
DatasetSummary
- generate_constraints(self) whylogs.core.statistics.constraints.DatasetConstraints ¶
Assemble a sparse dict of constraints for all features.
- Returns
summary – Protobuf constraints message.
- Return type
- flat_summary(self)¶
Generate and flatten a summary of the statistics.
See
flatten_summary()
for a description
- _column_message_iterator(self)¶
- chunk_iterator(self)¶
Generate an iterator to iterate over chunks of data
- validate(self)¶
Sanity check for this object. Raises an AssertionError if invalid
- merge(self, other)¶
Merge this profile with another dataset profile object.
We will use metadata and timestamps from the current DatasetProfile in the result.
This operation will drop the metadata from the ‘other’ profile object.
- Parameters
other (DatasetProfile) –
- Returns
merged – New, merged DatasetProfile
- Return type
- _do_merge(self, other)¶
- merge_strict(self, other)¶
Merge this profile with another dataset profile object. This throws exception if session_id, timestamps and tags don’t match.
This operation will drop the metadata from the ‘other’ profile object.
- Parameters
other (DatasetProfile) –
- Returns
merged – New, merged DatasetProfile
- Return type
- serialize_delimited(self) bytes ¶
Write out in delimited format (data is prefixed with the length of the datastream).
This is useful when you are streaming multiple dataset profile objects
- Returns
data – A sequence of bytes
- Return type
bytes
- to_protobuf(self) whylogs.proto.DatasetProfileMessage ¶
Return the object serialized as a protobuf message
- Returns
message
- Return type
DatasetProfileMessage
- write_protobuf(self, protobuf_path: str, delimited_file: bool = True, transport_parameters: dict = None) None ¶
Write the dataset profile to disk in binary format
- Parameters
protobuf_path (str) – local path or any path supported supported by smart_open: https://github.com/RaRe-Technologies/smart_open#how. The parent directory must already exist
delimited_file (bool, optional) – whether to prefix the data with the length of output or not. Default is True
- static read_protobuf(protobuf_path: str, delimited_file: bool = True, transport_parameters: dict = None) DatasetProfile ¶
Parse a protobuf file and return a DatasetProfile object
- Parameters
protobuf_path (str) – the path of the protobuf data, can be local or any other path supported by smart_open: https://github.com/RaRe-Technologies/smart_open#how
delimited_file (bool, optional) – whether the data is delimited or not. Default is True
- Returns
whylogs.DatasetProfile object from the protobuf
- Return type
- static from_protobuf(message: whylogs.proto.DatasetProfileMessage) DatasetProfile ¶
Load from a protobuf message
- Parameters
message (DatasetProfileMessage) – The protobuf message. Should match the output of DatasetProfile.to_protobuf()
- Returns
dataset_profile
- Return type
- static from_protobuf_string(data: bytes) DatasetProfile ¶
Deserialize a serialized DatasetProfileMessage
- Parameters
data (bytes) – The serialized message
- Returns
profile – The deserialized dataset profile
- Return type
- static _parse_delimited_generator(data: bytes)¶
- static parse_delimited_single(data: bytes, pos=0)¶
Parse a single delimited entry from a byte stream :param data: The bytestream :type data: bytes :param pos: The starting position. Default is zero :type pos: int
- Returns
pos (int) – Current position in the stream after parsing
profile (DatasetProfile) – A dataset profile
- static parse_delimited(data: bytes)¶
Parse delimited data (i.e. data prefixed with the message length).
Java protobuf writes delimited messages, which is convenient for storing multiple dataset profiles. This means that the main data is prefixed with the length of the message.
- Parameters
data (bytes) – The input byte stream
- Returns
profiles – List of all Dataset profile objects
- Return type
list
- apply_summary_constraints(self, summary_constraints: Optional[Mapping[str, whylogs.core.statistics.constraints.SummaryConstraints]] = None)¶
- apply_table_shape_constraints(self, table_shape_constraints: Optional[whylogs.core.statistics.constraints.SummaryConstraints] = None)¶
- whylogs.enable_mlflow(session=None) bool ¶
Enable whylogs in
mlflow
module viamlflow.whylogs
.- Returns
True if MLFlow has been patched. False otherwise.
Example of whylogs and MLFlow¶import mlflow import whylogs whylogs.enable_mlflow() import numpy as np import pandas as pd pdf = pd.DataFrame( data=[[1, 2, 3, 4, True, "x", bytes([1])]], columns=["b", "d", "a", "c", "e", "g", "f"], dtype=np.object, ) active_run = mlflow.start_run() # log a Pandas dataframe under default name mlflow.whylogs.log_pandas(pdf) # log a Pandas dataframe with custom name mlflow.whylogs.log_pandas(pdf, "another dataset") # Finish the MLFlow run mlflow.end_run()
- 1
Created with sphinx-autoapi