# Mixed Data Types#

This notebooks provides an overview for using and understanding the mixed data types check.

Structure:

## What are Mixed Data Types?#

Mixed data types is when a column contains both string values and numeric values (either as numeric type or as string like “42.90”). This may indicate a problem in the data collection pipeline, or represent a problem situation for the model’s training.

This checks searches for columns with a mix of strings and numeric values and returns them and their respective ratios.

## Run the Check#

We will run the check on the adult dataset which can be downloaded from the UCI machine learning repository and is also available in deepchecks.tabular.datasets, and introduce to it some data type mixing in order to show the check’s result.

import pandas as pd
import numpy as np

# Prepare functions to insert mixed data types

def insert_new_values_types(col: pd.Series, ratio_to_replace: float, values_list):
col = col.to_numpy().astype(object)
indices_to_replace = np.random.choice(range(len(col)), int(len(col) * ratio_to_replace), replace=False)
new_values = np.random.choice(values_list, len(indices_to_replace))
col[indices_to_replace] = new_values
return col

def insert_string_types(col: pd.Series, ratio_to_replace):
return insert_new_values_types(col, ratio_to_replace, ['a', 'b', 'c'])

def insert_numeric_string_types(col: pd.Series, ratio_to_replace):
return insert_new_values_types(col, ratio_to_replace, ['1.0', '1', '10394.33'])

def insert_number_types(col: pd.Series, ratio_to_replace):
return insert_new_values_types(col, ratio_to_replace, [66, 99.9])

# Load dataset and insert some data type mixing

from deepchecks.tabular import Dataset
from deepchecks.tabular.checks import MixedDataTypes

check = MixedDataTypes()
result


#### Mixed Data Types

Detect columns which contain a mix of numerical and string values.

* showing only the top 10 columns, you can change it using n_top_columns param
age workclass education
strings 50% 99% 90%
numbers 50% 1% 10%

## Define a Condition#

We can define a condition that enforces the ratio of the “rare type” (the less common type, either numeric or string) is not in a given range. The range represents the dangerous zone, when the ratio is lower than the lower bound, then it’s presumably a contamination but a negligible one, and when the ratio is higher than the upper bound, then it’s presumably supposed to contain both numbers and string values. So when the ratio is inside the range there is a real chance that the rarer data type may represent a problem to model training and inference.

check = MixedDataTypes().add_condition_rare_type_ratio_not_in_range((0.01, 0.2))