Ecommerce Dataset#

The Ecommerce dataset contains transaction information of several products for a popular grocery supermarket in India. It contains features such as the product’s description, category, market price and user rating.

The original data was sourced from Kaggle’s BigBasket Entire Product List . From the source data additional transformations were made, such as: oversampling and feature creation/engineering.

License: CC BY-NC-SA 4.0

Usage#

You can follow this guide to see how to use the ecommerce dataset:

Versions and Data Partitions#

Currently the dataset contains one version: base. The task for the base version is to classify wether an incoming product should be provided a discount, given product features such as history of items sold, user rating, category and market price. The base version contains two partitions: Baseline and Inference

base#

  • Baseline
    • Number of instances: 34743

    • Number of features: 19
      • Input Features: 5

      • Target Features: 1

      • Prediction Features: 2

      • Extra Features: 11

    • Period: from 2022-08-09 to 2022-08-16

  • Inference
    • Number of instances: 86899

    • Number of features: 19
      • Input Features: 5

      • Target Features: 1

      • Prediction Features: 2

      • Extra Features: 11

    • Period: from 2022-08-19 to 2022-09-08

There are 11 possible categories for a given product. In order to get the desired size for the dataset, original data was oversampled for each category with the Random Oversampling Examples (ROSE).

The original data didn’t contain date and time information. Data was artificially partitioned into separate days in a preprocessing stage for the fabrication of this dataset.

Features Description#

Target Features#

These are features that are typically targeted for prediction/classification.

Target Features#

Feature

Description

Type

Present in Versions

output_discount

if the purchased product had a discount, bool

Target

all

Prediction Features#

These features are outputs from a given ML model. Can be directly the prediction/predicted class or also scores such as uncertainty, probability and confidence scores.

Prediction Features#

Feature

Description

Type

Present in Versions

output_prediction

Random Forest model’s prediction for target variable output_discount

Prediction

all

output_score

Class probability for the predicted class

Prediction

all

output_prediction and output_score was obtaind by training a Random Forest model with the SKLearn library. Data used to train the model was previously separated and is not present in this dataset.

The remaining partitions (baseline and inference) were each oversampled and split into separate days.

Extra Features#

These are extra features that are not of any of the previous categories, but still contain relevant information about the data.

Miscellaneous Features#

Feature

Description

Type

Present in Versions

category.Baby Care

Binarized category feature for Baby Care class

Extra

all

category.Bakery, Cakes and Dairy

Binarized category feature for Bakery, Cakes and Dairy class

Extra

all

category.Beauty and Hygiene

Binarized category feature for Beauty and Hygiene class

Extra

all

category.Beverages

Binarized category feature for Beverages class

Extra

all

category.Cleaning and Household

Binarized category feature for Cleaning and Household class

Extra

all

category.Eggs, Meat and Fish

Binarized category feature for Eggs, Meat and Fish class

Extra

all

category.Foodgrains, Oil and Masala

Binarized category feature for Foodgrains, Oil and Masala class

Extra

all

category.Fruits and Vegetables

Binarized category feature for Fruits and Vegetables class

Extra

all

category.Gourmet and World Food

Binarized category feature for Gourmet and World Food class

Extra

all

category.Kitchen, Garden and Pets

Binarized category feature for Kitchen, Garden and Pets class

Extra

all

category.Snacks and Branded Foods

Binarized category feature for Snacks and Branded Foods class

Extra

all

Input Features#

These are input features that were used to train and predict the prediction features.

Input Features#

Feature

Description

Type

Present in Versions

product

text description of the product, str

Input

all

sales_last_week

number of items sold in the last week, int

Input

all

market_price

the product’s market price, float

Input

all

rating

the user’s rating for the product at time of purchase, float

Input

all

category

the product’s category, str

Input

all

The sales_last_week feature was created based on the product’s total count for the complete dataset, and was created for demonstrational purposes.

License#

CC BY NC SA 4.0

References#

  • Giovanna Menardi and Nicola Torelli. Training and assessing classification rules with imbalanced data. Data Mining and Knowledge Discovery, 28:92–122, 2014.