.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "checks_gallery/tabular/integrity/plot_mixed_data_types.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_checks_gallery_tabular_integrity_plot_mixed_data_types.py: Mixed Data Types **************** This notebooks provides an overview for using and understanding the mixed data types check. **Structure:** * `What are Mixed Data Types? <#what-are-mixed-data-types>`__ * `Run the Check <#run-the-check>`__ * `Define a Condition <#define-a-condition>`__ What are Mixed Data Types? ========================== Mixed data types is when a column contains both string values and numeric values (either as numeric type or as string like "42.90"). This may indicate a problem in the data collection pipeline, or represent a problem situation for the model's training. This checks searches for columns with a mix of strings and numeric values and returns them and their respective ratios. Run the Check ============= We will run the check on the adult dataset which can be downloaded from the `UCI machine learning repository `_ and is also available in `deepchecks.tabular.datasets`, and introduce to it some data type mixing in order to show the check's result. .. GENERATED FROM PYTHON SOURCE LINES 28-61 .. code-block:: default import pandas as pd import numpy as np from deepchecks.tabular.datasets.classification import adult # Prepare functions to insert mixed data types def insert_new_values_types(col: pd.Series, ratio_to_replace: float, values_list): col = col.to_numpy().astype(object) indices_to_replace = np.random.choice(range(len(col)), int(len(col) * ratio_to_replace), replace=False) new_values = np.random.choice(values_list, len(indices_to_replace)) col[indices_to_replace] = new_values return col def insert_string_types(col: pd.Series, ratio_to_replace): return insert_new_values_types(col, ratio_to_replace, ['a', 'b', 'c']) def insert_numeric_string_types(col: pd.Series, ratio_to_replace): return insert_new_values_types(col, ratio_to_replace, ['1.0', '1', '10394.33']) def insert_number_types(col: pd.Series, ratio_to_replace): return insert_new_values_types(col, ratio_to_replace, [66, 99.9]) # Load dataset and insert some data type mixing adult_df, _ = adult.load_data(as_train_test=True, data_format='Dataframe') adult_df['workclass'] = insert_numeric_string_types(adult_df['workclass'], ratio_to_replace=0.01) adult_df['education'] = insert_number_types(adult_df['education'], ratio_to_replace=0.1) adult_df['age'] = insert_string_types(adult_df['age'], ratio_to_replace=0.5) .. GENERATED FROM PYTHON SOURCE LINES 62-71 .. code-block:: default from deepchecks.tabular import Dataset from deepchecks.tabular.checks import MixedDataTypes adult_dataset = Dataset(adult_df, cat_features=['workclass', 'education']) check = MixedDataTypes() result = check.run(adult_dataset) result .. raw:: html

Mixed Data Types

Detect columns which contain a mix of numerical and string values.

Additional Outputs
* showing only the top 10 columns, you can change it using n_top_columns param
  age workclass education
strings 50% 99% 90%
numbers 50% 1% 10%


.. GENERATED FROM PYTHON SOURCE LINES 72-79 Define a Condition ================== We can define a condition that enforces the ratio of the "rare type" (the less common type, either numeric or string) is not in a given range. The range represents the dangerous zone, when the ratio is lower than the lower bound, then it's presumably a contamination but a negligible one, and when the ratio is higher than the upper bound, then it's presumably supposed to contain both numbers and string values. So when the ratio is inside the range there is a real chance that the rarer data type may represent a problem to model training and inference. .. GENERATED FROM PYTHON SOURCE LINES 79-83 .. code-block:: default check = MixedDataTypes().add_condition_rare_type_ratio_not_in_range((0.01, 0.2)) result = check.run(adult_dataset) result.show(show_additional_outputs=False) .. raw:: html
Mixed Data Types


.. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 3.852 seconds) .. _sphx_glr_download_checks_gallery_tabular_integrity_plot_mixed_data_types.py: .. only :: html .. container:: sphx-glr-footer :class: sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_mixed_data_types.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_mixed_data_types.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_