.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "user-guide/tabular/auto_tutorials/plot_quickstart_in_5_minutes.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_user-guide_tabular_auto_tutorials_plot_quickstart_in_5_minutes.py: Quickstart in 5 minutes *********************** In order to run your first Deepchecks Suite all you need to have is the data and model that you wish to validate. More specifically, you need: * Your train and test data (in Pandas DataFrames or Numpy Arrays) * (optional) A :doc:`supported model ` (including XGBoost, scikit-learn models, and many more). Required for running checks that need the model's predictions for running. To run your first suite on your data and model, you need only a few lines of code, that start here: `Define a Dataset Object <#define-a-dataset-object>`__. # If you don't have deepchecks installed yet: .. code:: python # If you don't have deepchecks installed yet: import sys !{sys.executable} -m pip install deepchecks -U --quiet #--user .. GENERATED FROM PYTHON SOURCE LINES 28-32 Load Data, Split Train-Val, and Train a Simple Model ==================================================== For the purpose of this guide we'll use the simple iris dataset and train a simple random forest model for multiclass classification: .. GENERATED FROM PYTHON SOURCE LINES 32-50 .. code-block:: default import numpy as np # General imports import pandas as pd from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from deepchecks.tabular.datasets.classification import iris # Load Data iris_df = iris.load_data(data_format='Dataframe', as_train_test=False) label_col = 'target' df_train, df_test = train_test_split(iris_df, stratify=iris_df[label_col], random_state=0) # Train Model rf_clf = RandomForestClassifier(random_state=0) rf_clf.fit(df_train.drop(label_col, axis=1), df_train[label_col]); .. rst-class:: sphx-glr-script-out Out: .. code-block:: none RandomForestClassifier(random_state=0) .. GENERATED FROM PYTHON SOURCE LINES 51-58 Define a Dataset Object ======================= Initialize the Dataset object, stating the relevant metadata about the dataset (e.g. the name for the label column) Check out the Dataset's attributes to see which additional special columns can be declared and used (e.g. date column, index column). .. GENERATED FROM PYTHON SOURCE LINES 58-67 .. code-block:: default from deepchecks.tabular import Dataset # We explicitly state that this dataset has no categorical features, otherwise they will be automatically inferred # If the dataset has categorical features, the best practice is to pass a list with their names ds_train = Dataset(df_train, label=label_col, cat_features=[]) ds_test = Dataset(df_test, label=label_col, cat_features=[]) .. GENERATED FROM PYTHON SOURCE LINES 68-76 Run a Deepchecks Suite ====================== Run the full suite ------------------ Use the ``full_suite`` that is a collection of (most of) the prebuilt checks. Check out the :doc:`when should you use ` deepchecks guide for some more info about the existing suites and when to use them. .. GENERATED FROM PYTHON SOURCE LINES 76-81 .. code-block:: default from deepchecks.tabular.suites import full_suite suite = full_suite() .. GENERATED FROM PYTHON SOURCE LINES 82-85 .. code-block:: default suite.run(train_dataset=ds_train, test_dataset=ds_test, model=rf_clf) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Full Suite: | | 0/36 [00:00 Full Suite

.. GENERATED FROM PYTHON SOURCE LINES 86-90 Run the integrity suite ----------------------- If you still haven't started modeling and just have a single dataset, you can use the ``single_dataset_integrity``: .. GENERATED FROM PYTHON SOURCE LINES 90-96 .. code-block:: default from deepchecks.tabular.suites import single_dataset_integrity integ_suite = single_dataset_integrity() integ_suite.run(ds_train) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none the single_dataset_integrity suite is deprecated, use the data_integrity suite instead Data Integrity Suite: | | 0/10 [00:00 Data Integrity Suite

.. GENERATED FROM PYTHON SOURCE LINES 97-104 Run a Deepchecks Check ====================== If you want to run a specific check, you can just import it and run it directly. Check out the :doc:`Check tabular examples ` in the examples or the :doc:`API Reference ` for more info about the existing checks and their parameters. .. GENERATED FROM PYTHON SOURCE LINES 104-107 .. code-block:: default from deepchecks.tabular.checks import TrainTestLabelDrift .. GENERATED FROM PYTHON SOURCE LINES 108-113 .. code-block:: default check = TrainTestLabelDrift() result = check.run(ds_train, ds_test) result .. raw:: html
Train Test Label Drift


.. GENERATED FROM PYTHON SOURCE LINES 114-115 and also inspect the result value which has a check-dependant structure: .. GENERATED FROM PYTHON SOURCE LINES 115-118 .. code-block:: default result.value .. rst-class:: sphx-glr-script-out Out: .. code-block:: none {'Drift score': 0.0, 'Method': "Cramer's V"} .. GENERATED FROM PYTHON SOURCE LINES 119-126 Edit an Existing Suite ====================== Inspect suite and remove condition ---------------------------------- We can see that the Feature Label Correlation check failed, both for test and for train. Since this is a very simple dataset with few features and this behavior is not necessarily problematic, we will remove the existing conditions for the PPS .. GENERATED FROM PYTHON SOURCE LINES 126-131 .. code-block:: default # Lets first print the suite to find the conditions that we want to change: suite .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Full Suite: [ 0: PerformanceReport Conditions: 0: Train-Test scores relative degradation is not greater than 0.1 1: RocReport(excluded_classes=[]) Conditions: 0: AUC score for all the classes is not less than 0.7 2: ConfusionMatrixReport 3: SegmentPerformance(feature_1=petal width (cm), feature_2=petal length (cm)) 4: TrainTestPredictionDrift Conditions: 0: categorical drift score <= 0.15 and numerical drift score <= 0.075 5: SimpleModelComparison Conditions: 0: Model performance gain over simple model is not less than 10% 6: ModelErrorAnalysis Conditions: 0: The performance difference of the detected segments must not be greater than 5% 7: CalibrationScore 8: RegressionSystematicError Conditions: 0: Bias ratio is not greater than 0.01 9: RegressionErrorDistribution Conditions: 0: Kurtosis value is not less than -0.1 10: UnusedFeatures Conditions: 0: Number of high variance unused features is not greater than 5 11: BoostingOverfit Conditions: 0: Test score over iterations doesn't decline by more than 5% from the best score 12: ModelInferenceTime Conditions: 0: Average model inference time for one sample is not greater than 0.001 13: DatasetsSizeComparison Conditions: 0: Test-Train size ratio is not smaller than 0.01 14: NewLabelTrainTest Conditions: 0: Number of new label values is not greater than 0 15: CategoryMismatchTrainTest Conditions: 0: Ratio of samples with a new category is not greater than 0% 16: StringMismatchComparison Conditions: 0: No new variants allowed in test data 17: DateTrainTestLeakageDuplicates Conditions: 0: Date leakage ratio is not greater than 0% 18: DateTrainTestLeakageOverlap Conditions: 0: Date leakage ratio is not greater than 0% 19: IndexTrainTestLeakage Conditions: 0: Ratio of leaking indices is not greater than 0% 20: IdentifierLeakage(ppscore_params={}) Conditions: 0: Identifier columns PPS is not greater than 0 21: TrainTestSamplesMix Conditions: 0: Percentage of test data samples that appear in train data not greater than 10% 22: FeatureLabelCorrelationChange(ppscore_params={}) Conditions: 0: Train-Test features' Predictive Power Score difference is not greater than 0.2 1: Train features' Predictive Power Score is not greater than 0.7 23: TrainTestFeatureDrift Conditions: 0: categorical drift score <= 0.2 and numerical drift score <= 0.1 24: TrainTestLabelDrift Conditions: 0: categorical drift score <= 0.2 and numerical drift score <= 0.1 for label drift 25: WholeDatasetDrift Conditions: 0: Drift value is not greater than 0.25 26: IsSingleValue Conditions: 0: Does not contain only a single value 27: SpecialCharacters Conditions: 0: Ratio of entirely special character samples not greater than 0.1% 28: MixedNulls Conditions: 0: Not more than 1 different null types 29: MixedDataTypes Conditions: 0: Rare data types in column are either more than 10% or less than 1% of the data 30: StringMismatch Conditions: 0: No string variants 31: DataDuplicates Conditions: 0: Duplicate data ratio is not greater than 0% 32: StringLengthOutOfBounds Conditions: 0: Ratio of outliers not greater than 0% string length outliers 33: ConflictingLabels Conditions: 0: Ambiguous sample ratio is not greater than 0% 34: OutlierSampleDetection 35: FeatureLabelCorrelation(ppscore_params={}) Conditions: 0: Features' Predictive Power Score is not greater than 0.8 ] .. GENERATED FROM PYTHON SOURCE LINES 132-137 .. code-block:: default # now we can use the check's index and the condition's number to remove it: print(suite[6]) suite[6].remove_condition(0) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none ModelErrorAnalysis Conditions: 0: The performance difference of the detected segments must not be greater than 5% .. GENERATED FROM PYTHON SOURCE LINES 138-142 .. code-block:: default # print and see that the condition was removed suite[6] .. rst-class:: sphx-glr-script-out Out: .. code-block:: none ModelErrorAnalysis .. GENERATED FROM PYTHON SOURCE LINES 143-151 If we now re-run the suite, all of the existing conditions will pass. *Note: the check we manipulated will still run as part of the Suite, however it won't appear in the Conditions Summary since it no longer has any conditions defined on it. You can still see its display results in the Additional Outputs section* **For more info about working with conditions, see the detailed configuring conditions guide.** .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 5.780 seconds) .. _sphx_glr_download_user-guide_tabular_auto_tutorials_plot_quickstart_in_5_minutes.py: .. only :: html .. container:: sphx-glr-footer :class: sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_quickstart_in_5_minutes.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_quickstart_in_5_minutes.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_