whylogs.util.dsketch

Define functions and classes for interfacing with datasketches

Module Contents

Classes

FrequentItemsSketch

A class to implement frequent item counting for mixed data types.

Functions

deserialize_kll_floats_sketch(x: bytes, kind: str = 'float')

Deserialize a KLL floats sketch. Compatible with whylogs-java

deserialize_frequent_strings_sketch(x: bytes)

Deserialize a frequent strings sketch. Compatible with whylogs-java

whylogs.util.dsketch.deserialize_kll_floats_sketch(x: bytes, kind: str = 'float')

Deserialize a KLL floats sketch. Compatible with whylogs-java

whylogs histograms are serialized as kll floats sketches

Parameters
  • x (bytes) – Serialized sketch

  • kind (str, optional) – Specify type of sketch: ‘float’ or ‘int’

Returns

sketch – If x is an empty sketch, return None, else return the deserialized sketch.

Return type

kll_floats_sketch, kll_ints_sketch, or None

whylogs.util.dsketch.deserialize_frequent_strings_sketch(x: bytes)

Deserialize a frequent strings sketch. Compatible with whylogs-java

Wrapper for datasketches.frequent_strings_sketch.deserialize

Parameters

x (bytes) – Serialized sketch

Returns

sketch – If x is an empty string sketch, returns None, else returns the deserialized string sketch

Return type

datasketches.frequent_strings_sketch, None

class whylogs.util.dsketch.FrequentItemsSketch(lg_max_k: int = None, sketch: datasketches.frequent_strings_sketch = None)

A class to implement frequent item counting for mixed data types.

Wraps datasketches.frequent_strings_sketch by encoding numbers as strings since the datasketches python implementation does not implement frequent number tracking.

Parameters
  • lg_max_k (int, optional) – Parameter controlling the size and accuracy of the sketch. A larger number increases accuracy and the memory requirements for the sketch

  • sketch (datasketches.frequent_strings_sketch, optional) – Initialize with an existing frequent strings sketch

DEFAULT_MAX_ITEMS_SIZE = 128
DEFAULT_ERROR_TYPE
get_apriori_error(self, lg_max_map_size: int, estimated_total_weight: int)

Return an apriori estimate of the uncertainty for various parameters

Parameters
  • lg_max_map_size (int) – The lg_max_k value

  • estimated_total_weight – Total weight (see FrequentItems.get_total_weight())

Returns

error – Approximate uncertainty

Return type

float

get_epsilon_for_lg_size(self, lg_max_map_size: int)
get_estimate(self, item)
get_lower_bound(self, item)
get_upper_bound(self, item)
get_frequent_items(self, err_type: datasketches.frequent_items_error_type = None, threshold: int = 0, decode: bool = True)

Retrieve the frequent items.

Parameters
  • err_type (datasketches.frequent_items_error_type) – Override default error type

  • threshold (int) – Minimum count for returned items

  • decode (bool (default=True)) – Decode the returned values. Internally, all items are encoded as strings.

Returns

items – A list of tuples of items: [(item, count)]

Return type

list

get_num_active_items(self)
get_serialized_size_bytes(self)
get_sketch_epsilon(self)
get_total_weight(self)
is_empty(self)
merge(self, other)

Merge the item counts of this sketch with another.

This object will not be modified. This operation is commutative.

Parameters

other (FrequentItemsSketch) – The other sketch

copy(self)
Returns

sketch – A copy of this sketch

Return type

FrequentItemsSketch

serialize(self)

Serialize this sketch as a bytes string.

See also FrequentItemsSketch.deserialize()

Returns

data – Serialized object.

Return type

bytes

to_string(self, print_items=False)
update(self, x, weight=1)

Track an item.

Parameters
  • x (object) – Item to track

  • weight (int) – Number of times the item appears

to_summary(self, max_items=30, min_count=1)

Generate a protobuf summary. Returns None if there are no frequent items.

Parameters
  • max_items (int) – Maximum number of items to return. The most frequent items will be returned

  • min_count (int) – Minimum number counts for all returned items

Returns

summary – Protobuf summary message

Return type

FrequentItemsSummary

to_protobuf(self)

Generate a protobuf representation of this object

static from_protobuf(message: whylogs.proto.FrequentItemsSketchMessage)

Initialize a FrequentItemsSketch from a protobuf FrequentItemsSketchMessage

static _encode_item(x)
static deserialize(x: bytes)

Deserialize a frequent numbers sketch.

If x is an empty sketch, None is returned