.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "tabular/auto_tutorials/quickstarts/plot_quickstart_in_5_minutes.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_tabular_auto_tutorials_quickstarts_plot_quickstart_in_5_minutes.py: .. _quick_full_suite: Full Suite Quickstart ************************************ In order to run your first Deepchecks Suite all you need to have is the data and model that you wish to validate. More specifically, you need: * Your train and test data (in Pandas DataFrames or Numpy Arrays) * (optional) A :ref:`tabular__supported_models` (including XGBoost, scikit-learn models, and many more). Required for running checks that need the model's predictions for running. To run your first suite on your data and model, you need only a few lines of code, that start here: `Define a Dataset Object <#define-a-dataset-object>`__. # If you don't have deepchecks installed yet: .. code:: python # If you don't have deepchecks installed yet: import sys !{sys.executable} -m pip install deepchecks -U --quiet #--user .. GENERATED FROM PYTHON SOURCE LINES 30-34 Load Data, Split Train-Val, and Train a Simple Model ==================================================== For the purpose of this guide we'll use the simple iris dataset and train a simple random forest model for multiclass classification: .. GENERATED FROM PYTHON SOURCE LINES 34-52 .. code-block:: default import numpy as np # General imports import pandas as pd from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from deepchecks.tabular.datasets.classification import iris # Load Data iris_df = iris.load_data(data_format='Dataframe', as_train_test=False) label_col = 'target' df_train, df_test = train_test_split(iris_df, stratify=iris_df[label_col], random_state=0) # Train Model rf_clf = RandomForestClassifier(random_state=0) rf_clf.fit(df_train.drop(label_col, axis=1), df_train[label_col]); .. rst-class:: sphx-glr-script-out .. code-block:: none RandomForestClassifier(random_state=0) .. GENERATED FROM PYTHON SOURCE LINES 53-60 Define a Dataset Object ======================= Initialize the Dataset object, stating the relevant metadata about the dataset (e.g. the name for the label column) Check out the Dataset's attributes to see which additional special columns can be declared and used (e.g. date column, index column). .. GENERATED FROM PYTHON SOURCE LINES 60-69 .. code-block:: default from deepchecks.tabular import Dataset # We explicitly state that this dataset has no categorical features, otherwise they will be automatically inferred # If the dataset has categorical features, the best practice is to pass a list with their names ds_train = Dataset(df_train, label=label_col, cat_features=[]) ds_test = Dataset(df_test, label=label_col, cat_features=[]) .. GENERATED FROM PYTHON SOURCE LINES 70-78 Run a Deepchecks Suite ====================== Run the full suite ------------------ Use the ``full_suite`` that is a collection of (most of) the prebuilt checks. Check out the :ref:`when you should use ` deepchecks guide for some more info about the existing suites and when to use them. .. GENERATED FROM PYTHON SOURCE LINES 78-83 .. code-block:: default from deepchecks.tabular.suites import full_suite suite = full_suite() .. GENERATED FROM PYTHON SOURCE LINES 84-87 .. code-block:: default suite.run(train_dataset=ds_train, test_dataset=ds_test, model=rf_clf) .. rst-class:: sphx-glr-script-out .. code-block:: none Full Suite: | | 0/35 [Time: 00:00] Full Suite: |# | 1/35 [Time: 00:00, Check=Train Test Performance] Full Suite: |### | 3/35 [Time: 00:00, Check=Confusion Matrix Report] Full Suite: |##### | 5/35 [Time: 00:00, Check=Simple Model Comparison] Full Suite: |###### | 6/35 [Time: 00:05, Check=Weak Segments Performance] Full Suite: |######### | 9/35 [Time: 00:05, Check=Unused Features] Full Suite: |################### | 19/35 [Time: 00:05, Check=Train Test Samples Mix] Full Suite: |####################### | 23/35 [Time: 00:06, Check=Multivariate Drift] Full Suite: |################################ | 32/35 [Time: 00:06, Check=Outlier Sample Detection] .. raw:: html
Full Suite


.. GENERATED FROM PYTHON SOURCE LINES 88-92 Run the integrity suite ----------------------- If you still haven't started modeling and just have a single dataset, you can use the ``data_integrity``: .. GENERATED FROM PYTHON SOURCE LINES 92-98 .. code-block:: default from deepchecks.tabular.suites import data_integrity integ_suite = data_integrity() integ_suite.run(ds_train) .. rst-class:: sphx-glr-script-out .. code-block:: none Data Integrity Suite: | | 0/12 [Time: 00:00] Data Integrity Suite: |########## | 10/12 [Time: 00:00, Check=Feature Label Correlation] .. raw:: html
Data Integrity Suite


.. GENERATED FROM PYTHON SOURCE LINES 99-106 Run a Deepchecks Check ====================== If you want to run a specific check, you can just import it and run it directly. Check out the :ref:`Check Gallery ` or the :doc:`API Reference ` for more info about the existing checks and their parameters. .. GENERATED FROM PYTHON SOURCE LINES 106-109 .. code-block:: default from deepchecks.tabular.checks import LabelDrift .. GENERATED FROM PYTHON SOURCE LINES 110-115 .. code-block:: default check = LabelDrift() result = check.run(ds_train, ds_test) result .. raw:: html
Label Drift


.. GENERATED FROM PYTHON SOURCE LINES 116-117 and also inspect the result value which has a check-dependant structure: .. GENERATED FROM PYTHON SOURCE LINES 117-120 .. code-block:: default result.value .. rst-class:: sphx-glr-script-out .. code-block:: none {'Drift score': 0.0, 'Method': "Cramer's V"} .. GENERATED FROM PYTHON SOURCE LINES 121-128 Edit an Existing Suite ====================== Inspect suite and remove condition ---------------------------------- We can see that the Feature Label Correlation check failed, both for test and for train. Since this is a very simple dataset with few features and this behavior is not necessarily problematic, we will remove the existing conditions for the PPS .. GENERATED FROM PYTHON SOURCE LINES 128-133 .. code-block:: default # Lets first print the suite to find the conditions that we want to change: suite .. rst-class:: sphx-glr-script-out .. code-block:: none Full Suite: [ 0: TrainTestPerformance Conditions: 0: Train-Test scores relative degradation is less than 0.1 1: RocReport Conditions: 0: AUC score for all the classes is greater than 0.7 2: ConfusionMatrixReport 3: PredictionDrift Conditions: 0: Prediction drift score < 0.15 4: SimpleModelComparison Conditions: 0: Model performance gain over simple model is greater than 10% 5: WeakSegmentsPerformance(n_to_show=5) Conditions: 0: The relative performance of weakest segment is greater than 80% of average model performance. 6: CalibrationScore 7: RegressionErrorDistribution Conditions: 0: Kurtosis value higher than -0.1 1: Systematic error ratio lower than 0.01 8: UnusedFeatures Conditions: 0: Number of high variance unused features is less or equal to 5 9: BoostingOverfit Conditions: 0: Test score over iterations is less than 5% from the best score 10: ModelInferenceTime Conditions: 0: Average model inference time for one sample is less than 0.001 11: DatasetsSizeComparison Conditions: 0: Test-Train size ratio is greater than 0.01 12: NewLabelTrainTest Conditions: 0: Number of new label values is less or equal to 0 13: NewCategoryTrainTest Conditions: 0: Ratio of samples with a new category is less or equal to 0% 14: StringMismatchComparison Conditions: 0: No new variants allowed in test data 15: DateTrainTestLeakageDuplicates Conditions: 0: Date leakage ratio is less or equal to 0% 16: DateTrainTestLeakageOverlap Conditions: 0: Date leakage ratio is less or equal to 0% 17: IndexTrainTestLeakage Conditions: 0: Ratio of leaking indices is less or equal to 0% 18: TrainTestSamplesMix(n_to_show=5) Conditions: 0: Percentage of test data samples that appear in train data is less or equal to 10% 19: FeatureLabelCorrelationChange(ppscore_params={}, random_state=42) Conditions: 0: Train-Test features' Predictive Power Score difference is less than 0.2 1: Train features' Predictive Power Score is less than 0.7 20: FeatureDrift Conditions: 0: categorical drift score < 0.2 and numerical drift score < 0.2 21: LabelDrift Conditions: 0: Label drift score < 0.15 22: MultivariateDrift Conditions: 0: Drift value is less than 0.25 23: IsSingleValue Conditions: 0: Does not contain only a single value 24: SpecialCharacters Conditions: 0: Ratio of samples containing solely special character is less or equal to 0.1% 25: MixedNulls Conditions: 0: Number of different null types is less or equal to 1 26: MixedDataTypes Conditions: 0: Rare data types in column are either more than 10% or less than 1% of the data 27: StringMismatch Conditions: 0: No string variants 28: DataDuplicates Conditions: 0: Duplicate data ratio is less or equal to 0% 29: StringLengthOutOfBounds Conditions: 0: Ratio of string length outliers is less or equal to 0% 30: ConflictingLabels Conditions: 0: Ambiguous sample ratio is less or equal to 0% 31: OutlierSampleDetection 32: FeatureLabelCorrelation(ppscore_params={}, random_state=42) Conditions: 0: Features' Predictive Power Score is less than 0.8 33: FeatureFeatureCorrelation Conditions: 0: Not more than 0 pairs are correlated above 0.9 34: IdentifierLabelCorrelation(ppscore_params={}) Conditions: 0: Identifier columns PPS is less or equal to 0 ] .. GENERATED FROM PYTHON SOURCE LINES 134-139 .. code-block:: default # now we can use the check's index and the condition's number to remove it: print(suite[5]) suite[5].remove_condition(0) .. rst-class:: sphx-glr-script-out .. code-block:: none WeakSegmentsPerformance(n_to_show=5) Conditions: 0: The relative performance of weakest segment is greater than 80% of average model performance. .. GENERATED FROM PYTHON SOURCE LINES 140-144 .. code-block:: default # print and see that the condition was removed suite[5] .. rst-class:: sphx-glr-script-out .. code-block:: none WeakSegmentsPerformance(n_to_show=5) .. GENERATED FROM PYTHON SOURCE LINES 145-154 If we now re-run the suite, all of the existing conditions will pass. *Note: the check we manipulated will still run as part of the Suite, however it won't appear in the Conditions Summary since it no longer has any conditions defined on it. You can still see its display results in the Additional Outputs section* **For more info about working with conditions, see the detailed configuring conditions guide.** .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 9.827 seconds) .. _sphx_glr_download_tabular_auto_tutorials_quickstarts_plot_quickstart_in_5_minutes.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_quickstart_in_5_minutes.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_quickstart_in_5_minutes.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_