Dataset#

class Dataset[source]#

Dataset wraps pandas DataFrame together with ML related metadata.

The Dataset class is containing additional data and methods intended for easily accessing metadata relevant for the training or validating of an ML models.

Parameters
dfAny
An object that can be casted to a pandas DataFrame
  • containing data relevant for the training or validating of a ML models.

labelt.Union[Hashable, pd.Series, pd.DataFrame, np.ndarray] , default: None

label column provided either as a string with the name of an existing column in the DataFrame or a label object including the label data (pandas Series/DataFrame or a numpy array) that will be concatenated to the data in the DataFrame. in case of label data the following logic is applied to set the label name: - Series: takes the series name or ‘target’ if name is empty - DataFrame: expect single column in the dataframe and use its name - numpy: use ‘target’

featurest.Optional[t.Sequence[Hashable]] , default: None

List of names for the feature columns in the DataFrame.

cat_featurest.Optional[t.Sequence[Hashable]] , default: None

List of names for the categorical features in the DataFrame. In order to disable categorical. features inference, pass cat_features=[]

index_namet.Optional[Hashable] , default: None

Name of the index column in the dataframe. If set_index_from_dataframe_index is True and index_name is not None, index will be created from the dataframe index level with the given name. If index levels have no names, an int must be used to select the appropriate level by order.

set_index_from_dataframe_indexbool , default: False

If set to true, index will be created from the dataframe index instead of dataframe columns (default). If index_name is None, first level of the index will be used in case of a multilevel index.

datetime_namet.Optional[Hashable] , default: None

Name of the datetime column in the dataframe. If set_datetime_from_dataframe_index is True and datetime_name is not None, date will be created from the dataframe index level with the given name. If index levels have no names, an int must be used to select the appropriate level by order.

set_datetime_from_dataframe_indexbool , default: False

If set to true, date will be created from the dataframe index instead of dataframe columns (default). If datetime_name is None, first level of the index will be used in case of a multilevel index.

convert_datetimebool , default: True

If set to true, date will be converted to datetime using pandas.to_datetime.

datetime_argst.Optional[t.Dict] , default: None

pandas.to_datetime args used for conversion of the datetime column. (look at https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html for more documentation)

max_categorical_ratiofloat , default: 0.01

The max ratio of unique values in a column in order for it to be inferred as a categorical feature.

max_categoriesint , default: 30

The maximum number of categories in a column in order for it to be inferred as a categorical feature.

max_float_categoriesint , default: 5

The maximum number of categories in a float column in order for it to be inferred as a categorical feature.

label_typestr , default: None

Used to assume target model type if not found on model. Values (‘classification_label’, ‘regression_label’) If None then label type is inferred from label using is_categorical logic.

__init__(df: Any, label: Optional[Union[Hashable, Series, DataFrame, ndarray]] = None, features: Optional[Sequence[Hashable]] = None, cat_features: Optional[Sequence[Hashable]] = None, index_name: Optional[Hashable] = None, set_index_from_dataframe_index: bool = False, datetime_name: Optional[Hashable] = None, set_datetime_from_dataframe_index: bool = False, convert_datetime: bool = True, datetime_args: Optional[Dict] = None, max_categorical_ratio: float = 0.01, max_categories: int = 30, max_float_categories: int = 5, label_type: Optional[str] = None)[source]#
__new__(*args, **kwargs)#

Attributes

Dataset.cat_features

Return list of categorical feature names.

Dataset.classes

Return the classes from label column in sorted list.

Dataset.columns_info

Return the role and logical type of each column.

Dataset.data

Return the data of dataset.

Dataset.datetime_col

Return datetime column if exists.

Dataset.datetime_name

If datetime column exists, return its name.

Dataset.features

Return list of feature names.

Dataset.features_columns

Return DataFrame containing only the features defined in the dataset, if features are empty raise error.

Dataset.index_col

Return index column.

Dataset.index_name

If index column exists, return its name.

Dataset.label_col

Return Series of the label defined in the dataset, if label is not defined raise error.

Dataset.label_name

If label column exists, return its name.

Dataset.label_type

Return the label type.

Dataset.n_samples

Return number of samples in dataframe.

Dataset.numerical_features

Return list of numerical feature names.

Methods

Dataset.assert_datetime()

Check if datetime is defined and if not raise error.

Dataset.assert_features()

Check if features are defined (not empty) and if not raise error.

Dataset.assert_index()

Check if index is defined and if not raise error.

Dataset.assert_label()

Check if label is defined and if not raise error.

Dataset.cast_to_dataset(obj)

Verify Dataset or transform to Dataset.

Dataset.copy(new_data)

Create a copy of this Dataset with new data.

Dataset.datasets_share_categorical_features(...)

Verify that all provided datasets share same categorical features.

Dataset.datasets_share_date(*datasets)

Verify that all provided datasets share same date column.

Dataset.datasets_share_features(*datasets)

Verify that all provided datasets share same features.

Dataset.datasets_share_index(*datasets)

Verify that all provided datasets share same index column.

Dataset.datasets_share_label(*datasets)

Verify that all provided datasets share same label column.

Dataset.from_numpy(*args[, columns, label_name])

Create Dataset instance from numpy arrays.

Dataset.get_datetime_column_from_index(...)

Retrieve the datetime info from the index if _set_datetime_from_dataframe_index is True.

Dataset.is_categorical(col_name)

Check if uniques are few enough to count as categorical.

Dataset.is_sampled(n_samples)

Return True if the dataset number of samples will decrease when sampled with n_samples samples.

Dataset.len_when_sampled(n_samples)

Return number of samples in the sampled dataframe this dataset is sampled with n_samples samples.

Dataset.sample(n_samples[, replace, ...])

Create a copy of the dataset object, with the internal dataframe being a sample of the original dataframe.

Dataset.select([columns, ignore_columns, ...])

Filter dataset columns by given params.

Dataset.train_test_split([train_size, ...])

Split dataset into random train and test datasets.