Train Test Validation Suite

.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "tabular/auto_tutorials/quickstarts/plot_quick_train_test_validation.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_tabular_auto_tutorials_quickstarts_plot_quick_train_test_validation.py: .. _quick_train_test_validation: Train-Test Validation Suite Quickstart **************************************** The deepchecks train-test validation suite is relevant any time you wish to validate two data subsets. For example: - Comparing distributions across different train-test splits (e.g. before training a model or when splitting data for cross-validation) - Comparing a new data batch to previous data batches Here we'll use a loans' dataset (:mod:`deepchecks.tabular.datasets.classification.lending_club`), to demonstrate how you can run the suite with only a few simple lines of code, and see which kind of insights it can find. .. code-block:: bash # Before we start, if you don't have deepchecks installed yet, run: import sys !{sys.executable} -m pip install deepchecks -U --quiet # or install using pip from your python environment .. GENERATED FROM PYTHON SOURCE LINES 30-35 Load Data and Prepare Data ==================================================== Load Data ----------- .. GENERATED FROM PYTHON SOURCE LINES 35-45 .. code-block:: default from deepchecks.tabular.datasets.classification import lending_club import pandas as pd data = lending_club.load_data(data_format='Dataframe', as_train_test=False) data.head(2) .. raw:: html

	issue_d	sub_grade	term	home_ownership	fico_range_low	total_acc	pub_rec	revol_util	annual_inc	int_rate	dti	purpose	mort_acc	loan_amnt	application_type	installment	verification_status	pub_rec_bankruptcies	addr_state	initial_list_status	fico_range_high	revol_bal	id	open_acc	emp_length	loan_status	time_to_earliest_cr_line
0	2017-06-01	D1	36 months	MORTGAGE	665.0	29.0	0.0	85.0	112600.0	17.09	22.71	debt_consolidation	4.0	24000.0	Individual	856.75	Verified	0.0	CO	w	669.0	25779.0	110680237	13.0	2.0	0	794188.8
1	2017-06-01	C2	36 months	RENT	670.0	14.0	0.0	34.8	35000.0	13.59	9.95	debt_consolidation	1.0	4200.0	Individual	142.72	Source Verified	NaN	FL	f	674.0	3798.0	109936186	7.0	2.0	1	470793.6

.. GENERATED FROM PYTHON SOURCE LINES 46-48 Split Data to Train and Test ----------------------------- .. GENERATED FROM PYTHON SOURCE LINES 48-57 .. code-block:: default # convert date column to datetime, `issue_d`` is date column data['issue_d'] = pd.to_datetime(data['issue_d']) # Use data from June and July for train and August for test: train_df = data[data['issue_d'].dt.month.isin([6, 7])] test_df = data[data['issue_d'].dt.month.isin([8])] .. GENERATED FROM PYTHON SOURCE LINES 58-67 Run Deepchecks for Train Test Validation =========================================== Define a Dataset Object ------------------------- Create a deepchecks Dataset, including the relevant metadata (label, date, index, etc.). Check out :class:`deepchecks.tabular.Dataset` to see all of the columns and types that can be declared. .. GENERATED FROM PYTHON SOURCE LINES 70-72 Define Lending Club Metadata ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. GENERATED FROM PYTHON SOURCE LINES 72-80 .. code-block:: default categorical_features = ['addr_state', 'application_type', 'home_ownership', \ 'initial_list_status', 'purpose', 'term', 'verification_status', 'sub_grade'] index_name = 'id' label = 'loan_status' # 0 is DEFAULT, 1 is OK datetime_name = 'issue_d' .. GENERATED FROM PYTHON SOURCE LINES 81-83 Create Dataset ^^^^^^^^^^^^^^^^^^^^^^^^ .. GENERATED FROM PYTHON SOURCE LINES 83-96 .. code-block:: default from deepchecks.tabular import Dataset # Categorical features can be heuristically inferred, however we # recommend to state them explicitly to avoid misclassification. # Metadata attributes are optional. Some checks will run only if specific attributes are declared. train_ds = Dataset(train_df, label=label,cat_features=categorical_features, \ index_name=index_name, datetime_name=datetime_name) test_ds = Dataset(test_df, label=label,cat_features=categorical_features, \ index_name=index_name, datetime_name=datetime_name) .. GENERATED FROM PYTHON SOURCE LINES 97-102 .. code-block:: default # for convenience lets save it in a dictionary so we can reuse them for future Dataset initializations columns_metadata = {'cat_features' : categorical_features, 'index_name': index_name, 'label':label, 'datetime_name':datetime_name} .. GENERATED FROM PYTHON SOURCE LINES 103-112 Run the Deepchecks Suite -------------------------- Validate your data with the :class:`deepchecks.tabular.suites.train_test_validation` suite. It runs on two datasets, so you can use it to compare any two batches of data (e.g. train data, test data, a new batch of data that recently arrived) Check out the :ref:`when you should use ` for some more info about the existing suites and when to use them. .. GENERATED FROM PYTHON SOURCE LINES 112-121 .. code-block:: default from deepchecks.tabular.suites import train_test_validation validation_suite = train_test_validation() suite_result = validation_suite.run(train_ds, test_ds) # Note: the result can be saved as html using suite_result.save_as_html() # or exported to json using suite_result.to_json() suite_result .. rst-class:: sphx-glr-script-out .. code-block:: none Train Test Validation Suite: | | 0/12 [Time: 00:00] Train Test Validation Suite: |████ | 4/12 [Time: 00:00, Check=String Mismatch Comparison] Train Test Validation Suite: |███████ | 7/12 [Time: 00:00, Check=Index Train Test Leakage] Train Test Validation Suite: |██████████ | 10/12 [Time: 00:05, Check=Feature Drift] Train Test Validation Suite: |████████████| 12/12 [Time: 00:07, Check=Multivariate Drift] .. raw:: html

.. GENERATED FROM PYTHON SOURCE LINES 122-130 As you can see in the suite's results: the Date Train-Test Leakage check failed, indicating that we may have a problem in the way we've split our data! We've mixed up data from two years, causing a leakage of future data in the training dataset. Let's fix this. Fix Data ^^^^^^^^^^ .. GENERATED FROM PYTHON SOURCE LINES 130-135 .. code-block:: default dt_col = data[datetime_name] train_df = data[dt_col.dt.year.isin([2017]) & dt_col.dt.month.isin([6,7,8])] test_df = data[dt_col.dt.year.isin([2018]) & dt_col.dt.month.isin([6,7,8])] .. GENERATED FROM PYTHON SOURCE LINES 136-143 .. code-block:: default from deepchecks.tabular import Dataset # Create the new Datasets train_ds = Dataset(train_df, **columns_metadata) test_ds = Dataset(test_df, **columns_metadata) .. GENERATED FROM PYTHON SOURCE LINES 144-147 Re-run Validation Suite ^^^^^^^^^^^^^^^^^^^^^^^^^ .. GENERATED FROM PYTHON SOURCE LINES 148-152 .. code-block:: default suite_result = validation_suite.run(train_ds, test_ds) suite_result.show() .. rst-class:: sphx-glr-script-out .. code-block:: none Train Test Validation Suite: | | 0/12 [Time: 00:00] Train Test Validation Suite: |████ | 4/12 [Time: 00:00, Check=String Mismatch Comparison] Train Test Validation Suite: |███████ | 7/12 [Time: 00:00, Check=Index Train Test Leakage] Train Test Validation Suite: |██████████ | 10/12 [Time: 00:06, Check=Feature Drift] Train Test Validation Suite: |████████████| 12/12 [Time: 00:07, Check=Multivariate Drift] .. raw:: html

Train Test Validation Suite

.. GENERATED FROM PYTHON SOURCE LINES 153-172 Ok, the date leakage doesn't happen anymore! However, in the current split after the fix, we can see that we have a multivariate drift, detected by the :ref:`tabular__multivariate_drift` check. The drift is caused mainly by a combination of features representing the loan's interest rate (``int_rate``) and its grade (``sub_grade``). In order to proceed, we should think about the two options we have: To split the data in a different manner, or to stay with the current split. For working with different data splits: We can consider examining other sampling techniques (e.g. using only data from the same year), ideally achieving one in which the training data's univariate and multivariate distribution is similar to the data on which the model will run (test / production data). Of course, we can use deepchecks to validate the new splits. If the current split is representative and we are planning on training a model with it, it is worth understanding this drift (do we expect this kind of drift in the model's production environment? can we do something about it?). For more details about drift, see the :ref:`drift_user_guide`. .. GENERATED FROM PYTHON SOURCE LINES 177-181 Run a Single Check ------------------- We can run a single check on a dataset, and see the results. .. GENERATED FROM PYTHON SOURCE LINES 181-190 .. code-block:: default # If we want to run only that check (possible with or without condition) from deepchecks.tabular.checks import MultivariateDrift check_with_condition = MultivariateDrift().add_condition_overall_drift_value_less_than(0.4) # or just the check without the condition: # check = MultivariateDrift() dataset_drift_result = check_with_condition.run(train_ds, test_ds) .. GENERATED FROM PYTHON SOURCE LINES 191-192 We can also inspect and use the result's value: .. GENERATED FROM PYTHON SOURCE LINES 192-195 .. code-block:: default dataset_drift_result.value .. rst-class:: sphx-glr-script-out .. code-block:: none {'domain_classifier_auc': 0.7328426082404159, 'domain_classifier_drift_score': 0.46568521648083183, 'domain_classifier_feature_importance': {'int_rate': 0.5092728878422134, 'sub_grade': 0.38754783632616996, 'dti': 0.031056814836620684, 'revol_bal': 0.030762437444804377, 'initial_list_status': 0.02458051221666191, 'emp_length': 0.012952605239917654, 'pub_rec': 0.0038269060936120774, 'fico_range_low': -0.0, 'mort_acc': 0.0, 'annual_inc': 0.0, 'total_acc': 0.0, 'revol_util': 0.0, 'fico_range_high': 0.0, 'pub_rec_bankruptcies': 0.0, 'loan_amnt': 0.0, 'installment': 0.0, 'time_to_earliest_cr_line': 0.0, 'open_acc': 0.0, 'application_type': 0.0, 'addr_state': 0.0, 'home_ownership': 0.0, 'purpose': 0.0, 'term': 0.0, 'verification_status': 0.0}} .. GENERATED FROM PYTHON SOURCE LINES 196-197 and see if the conditions have passed .. GENERATED FROM PYTHON SOURCE LINES 197-199 .. code-block:: default dataset_drift_result.passed_conditions() .. rst-class:: sphx-glr-script-out .. code-block:: none False .. GENERATED FROM PYTHON SOURCE LINES 200-204 Create a Custom Suite ---------------------- To create our own suite, we can simply write all of the checks, and add optional conditions. .. GENERATED FROM PYTHON SOURCE LINES 204-217 .. code-block:: default from deepchecks.tabular import Suite from deepchecks.tabular.checks import FeatureDrift, MultivariateDrift, \ PredictionDrift, LabelDrift drift_suite = Suite('drift suite', FeatureDrift().add_condition_drift_score_less_than( max_allowed_categorical_score=0.2, max_allowed_numeric_score=0.1), MultivariateDrift().add_condition_overall_drift_value_less_than(0.4), LabelDrift(), PredictionDrift() ) .. GENERATED FROM PYTHON SOURCE LINES 218-219 we can run our new suite using: .. GENERATED FROM PYTHON SOURCE LINES 220-224 .. code-block:: default result = drift_suite.run(train_ds, test_ds) result.show() .. rst-class:: sphx-glr-script-out .. code-block:: none drift suite: | | 0/4 [Time: 00:00] drift suite: |█▎ | 1/4 [Time: 00:02, Check=Feature Drift] drift suite: |██▌ | 2/4 [Time: 00:03, Check=Multivariate Drift] drift suite: |███▊ | 3/4 [Time: 00:03, Check=Label Drift] .. raw:: html

drift suite

.. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 21.347 seconds) .. _sphx_glr_download_tabular_auto_tutorials_quickstarts_plot_quick_train_test_validation.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_quick_train_test_validation.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_quick_train_test_validation.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_