The Dataset Object#

The Dataset is a container for the data and the relevant ML metadata such as special column roles (e.g. label, index, categorical columns). It enables to take into account the relevant context during validation, and to save it in a convenient manner, and is a basic building block in deepchecks.

Class Properties#

The common properties are:

  • label - The target values that the model is trying to predict.

  • cat_features - List of features that should be treated as categorical. If not specified explicitly, they will be inferred automatically.

  • index_name - If the dataset has a meaningful unique index, defining it as such will enable more validations to run.

  • date_name - A date column representing the sample.

  • features - Specifies the columns used by model for training. Used for defining only a subset of the columns in the data as features. If not supplied then all of the columns that aren’t explicitly specified as label, date, or index are considered to be features.

  • name - The dataset name to present in the displays instead of Train or Test.

The Dataset’s metadata properties are all optional. Check out the API Reference for more details.

Dataset API Reference#

Dataset

Dataset wraps pandas DataFrame together with ML related metadata.

Creating a Dataset#

From a Pandas DataFrame#

The default Dataset constructor expects to get a pd.DataFrame The rest of the properties are optional, but if your data has date/index/label columns you would want to define them for more possible validation checks.

>>> d = {"id": [1,2,3,4],
...      "feature1": [0.1,0.3,0.2,0.6],
...      "feature2": [4,5,6,7],
...      "categorical_feature": [0,0,0,1],
...      "class": [1,2,1,2]}
... df = pd.DataFrame(d)
... ds = Dataset(df, label="class", index_name="id", cat_features=["categorical_feature"])

From Numpy Arrays#

A Dataset can be created using a 2D numpy array for features and 1D numpy array for the labels. The features array is mandatory, and the labels array is optional.

>>> features = np.array([[0.25, 0.3, 0.3], [0.14, 0.75, 0.3], [0.23, 0.39, 0.1]])
>>> labels = np.array([0.1, 0.1, 0.7])
>>> ds_with_labels = Dataset.from_numpy(features, labels)
>>> ds_without_labels = Dataset.from_numpy(features)

Also, it’s possible to assign names to the features and label:

>>> Dataset.from_numpy(
...     features, labels,
...     columns=['feat1', 'feat2', 'feat3'],
...     label_name='target'
... )

All the rest of the Dataset’s properties can be passed also as regular keyword arguments:

>>> Dataset.from_numpy(
...     features, labels,
...     columns=['feat1', 'feat2', 'feat3'],
...     label_name='target',
...     max_float_categories=10
... )

Useful Functions#

Train Test Split#

Uses internally sklearn.model_selection.train_test_split (so the same arguments can be passed) and also copies the metadata to each instance of the split and returns two Datasets.

>>> train_ds, test_ds = ds.train_test_split(stratify=True)

Copy#

Copy enables to copy the metadata from an existing Dataset instance, for creating a new Dataset from a new pd.DataFrame’s data. This can be useful for implementing data splits independentaly or for comparing datasets, when receiving new data (of the same known format of existing data).

>>> new_ds = ds.copy(new_df)

Working with Class Parameters#

We can work directly with the Dataset object, to inspect its defined features and label:

>>> ds.features
['feature1', 'feature2', 'category']
>>> ds.label_name
['class']

Get its internal pd.DataFrame:

>>> ds.data
    feature1    feature2    categorical_feature class
0   0.1         4           0                   1
1   0.3         5           0                   2
2   0.2         6           0                   1
3   0.6         7           1                   2

Or extract directly only the feature columns or only the label column from within it:

>>> ds.features_columns
    feature1    feature2        categorical_feature
0       0.1             4               0
1       0.3             5               0
2       0.2             6               0
3       0.6             7               1
>>> ds.label_col
    class
0   1
1   2
2   1
3   2

Inferring Categorical Features#

Warning

It is highly recommended to explicitly state the categorical features or define their column type to be category. Otherwise, the inherent limitations of the automatic type detection may cause inconsistencies (misdetection, different detection between train and test, etc.) and require tuning and adaptions.

If the parameter cat_features was not passed explicitly, the following inference logic will run on the columns to determine which are classified as categorical:

  1. If the pd.dtypes of any of the existing columns is category then all of the columns that are of type category will be considered categorical (and only them).

  2. Otherwise, a heuristic is used for deducting the type. In order for a column to be inferred as categorical, the following two conditions must be met:

    • If (the ratio between the number of unique values and the number of samples <= max_categorical_ratio)

    • If (number of unique values in column <= max_categories_allowed_per_column_type)

Check the API Reference for deepchecks.utils.type_inference.infer_categorical_features() for more details.