.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "user-guide/tabular/auto_quickstarts/plot_quick_train_test_validation.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_user-guide_tabular_auto_quickstarts_plot_quick_train_test_validation.py: .. _quick_train_test_validation: Quickstart - Train-Test Validation Suite **************************************** The deepchecks train-test validation suite is relevant any time you wish to validate two data subsets. For example: - Comparing distributions across different train-test splits (e.g. before training a model or when splitting data for cross-validation) - Comparing a new data batch to previous data batches Here we'll use a loans' dataset (:mod:`deepchecks.tabular.datasets.classification.lending_club`), to demonstrate how you can run the suite with only a few simple lines of code, and see which kind of insights it can find. .. code-block:: bash # Before we start, if you don't have deepchecks installed yet, run: import sys !{sys.executable} -m pip install deepchecks -U --quiet # or install using pip from your python environment .. GENERATED FROM PYTHON SOURCE LINES 30-35 Load Data and Prepare Data ==================================================== Load Data ----------- .. GENERATED FROM PYTHON SOURCE LINES 35-45 .. code-block:: default from deepchecks.tabular.datasets.classification import lending_club import pandas as pd data = lending_club.load_data(data_format='Dataframe', as_train_test=False) data.head(2) .. raw:: html
issue_d sub_grade term home_ownership fico_range_low total_acc pub_rec revol_util annual_inc int_rate dti purpose mort_acc loan_amnt application_type installment verification_status pub_rec_bankruptcies addr_state initial_list_status fico_range_high revol_bal id open_acc emp_length loan_status time_to_earliest_cr_line
0 2017-06-01 D1 36 months MORTGAGE 665.0 29.0 0.0 85.0 112600.0 17.09 22.71 debt_consolidation 4.0 24000.0 Individual 856.75 Verified 0.0 CO w 669.0 25779.0 110680237 13.0 2.0 0 794188.8
1 2017-06-01 C2 36 months RENT 670.0 14.0 0.0 34.8 35000.0 13.59 9.95 debt_consolidation 1.0 4200.0 Individual 142.72 Source Verified NaN FL f 674.0 3798.0 109936186 7.0 2.0 1 470793.6


.. GENERATED FROM PYTHON SOURCE LINES 46-48 Split Data to Train and Test ----------------------------- .. GENERATED FROM PYTHON SOURCE LINES 48-57 .. code-block:: default # convert date column to datetime, `issue_d`` is date column data['issue_d'] = pd.to_datetime(data['issue_d']) # Use data from June and July for train and August for test: train_df = data[data['issue_d'].dt.month.isin([6, 7])] test_df = data[data['issue_d'].dt.month.isin([8])] .. GENERATED FROM PYTHON SOURCE LINES 58-67 Run Deepchecks for Train Test Validation =========================================== Define a Dataset Object ------------------------- Create a deepchecks Dataset, including the relevant metadata (label, date, index, etc.). Check out :class:`deepchecks.tabular.Dataset` to see all of the columns and types that can be declared. .. GENERATED FROM PYTHON SOURCE LINES 70-72 Define Lending Club Metadata ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. GENERATED FROM PYTHON SOURCE LINES 72-80 .. code-block:: default categorical_features = ['addr_state', 'application_type', 'home_ownership', \ 'initial_list_status', 'purpose', 'term', 'verification_status', 'sub_grade'] index_name = 'id' label = 'loan_status' # 0 is DEFAULT, 1 is OK datetime_name = 'issue_d' .. GENERATED FROM PYTHON SOURCE LINES 81-83 Create Dataset ^^^^^^^^^^^^^^^^^^^^^^^^ .. GENERATED FROM PYTHON SOURCE LINES 83-96 .. code-block:: default from deepchecks.tabular import Dataset # Categorical features can be heuristically inferred, however we # recommend to state them explicitly to avoid misclassification. # Metadata attributes are optional. Some checks will run only if specific attributes are declared. train_ds = Dataset(train_df, label=label,cat_features=categorical_features, \ index_name=index_name, datetime_name=datetime_name) test_ds = Dataset(test_df, label=label,cat_features=categorical_features, \ index_name=index_name, datetime_name=datetime_name) .. GENERATED FROM PYTHON SOURCE LINES 97-102 .. code-block:: default # for convenience lets save it in a dictionary so we can reuse them for future Dataset initializations columns_metadata = {'cat_features' : categorical_features, 'index_name': index_name, 'label':label, 'datetime_name':datetime_name} .. GENERATED FROM PYTHON SOURCE LINES 103-112 Run the Deepchecks Suite -------------------------- Validate your data with the :class:`deepchecks.tabular.suites.train_test_validation` suite. It runs on two datasets, so you can use it to compare any two batches of data (e.g. train data, test data, a new batch of data that recently arrived) Check out the :doc:`"when should you use deepchecks guide" ` for some more info about the existing suites and when to use them. .. GENERATED FROM PYTHON SOURCE LINES 112-121 .. code-block:: default from deepchecks.tabular.suites import train_test_validation validation_suite = train_test_validation() suite_result = validation_suite.run(train_ds, test_ds) # Note: the result can be saved as html using suite_result.save_as_html() # or exported to json using suite_result.to_json() suite_result .. rst-class:: sphx-glr-script-out .. code-block:: none Train Test Validation Suite: | | 0/12 [Time: 00:00] Train Test Validation Suite: |####### | 7/12 [Time: 00:00, Check=Index Train Test Leakage] .. raw:: html
Train Test Validation Suite


.. GENERATED FROM PYTHON SOURCE LINES 122-130 As you can see in the suite's results: the Date Train-Test Leakage check failed, indicating that we may have a problem in the way we've split our data! We've mixed up data from two years, causing a leakage of future data in the training dataset. Let's fix this. Fix Data ^^^^^^^^^^ .. GENERATED FROM PYTHON SOURCE LINES 130-135 .. code-block:: default dt_col = data[datetime_name] train_df = data[dt_col.dt.year.isin([2017]) & dt_col.dt.month.isin([6,7,8])] test_df = data[dt_col.dt.year.isin([2018]) & dt_col.dt.month.isin([6,7,8])] .. GENERATED FROM PYTHON SOURCE LINES 136-143 .. code-block:: default from deepchecks.tabular import Dataset # Create the new Datasets train_ds = Dataset(train_df, **columns_metadata) test_ds = Dataset(test_df, **columns_metadata) .. GENERATED FROM PYTHON SOURCE LINES 144-147 Re-run Validation Suite ^^^^^^^^^^^^^^^^^^^^^^^^^ .. GENERATED FROM PYTHON SOURCE LINES 148-152 .. code-block:: default suite_result = validation_suite.run(train_ds, test_ds) suite_result.show() .. rst-class:: sphx-glr-script-out .. code-block:: none Train Test Validation Suite: | | 0/12 [Time: 00:00] Train Test Validation Suite: |####### | 7/12 [Time: 00:00, Check=Index Train Test Leakage] .. raw:: html
Train Test Validation Suite


.. GENERATED FROM PYTHON SOURCE LINES 153-172 Ok, the date leakage doesn't happen anymore! However, in the current split after the fix, we can see that we have a multivariate drift, detected by the :doc:`` check. The drift is caused mainly by a combination of features representing the loan's interest rate (``int_rate``) and its grade (``sub_grade``). In order to proceed, we should think about the two options we have: To split the data in a different manner, or to stay with the current split. For working with different data splits: We can consider examining other sampling techniques (e.g. using only data from the same year), ideally achieving one in which the training data's univariate and multivariate distribution is similar to the data on which the model will run (test / production data). Of course, we can use deepchecks to validate the new splits. If the current split is representative and we are planning on training a model with it, it is worth understanding this drift (do we expect this kind of drift in the model's production environment? can we do something about it?). For more details about drift, see the :doc:``. .. GENERATED FROM PYTHON SOURCE LINES 177-181 Run a Single Check ------------------- We can run a single check on a dataset, and see the results. .. GENERATED FROM PYTHON SOURCE LINES 181-190 .. code-block:: default # If we want to run only that check (possible with or without condition) from deepchecks.tabular.checks import WholeDatasetDrift check_with_condition = WholeDatasetDrift().add_condition_overall_drift_value_less_than(0.4) # or just the check without the condition: # check = WholeDatasetDrift() dataset_drift_result = check_with_condition.run(train_ds, test_ds) .. GENERATED FROM PYTHON SOURCE LINES 191-192 We can also inspect and use the result's value: .. GENERATED FROM PYTHON SOURCE LINES 192-195 .. code-block:: default dataset_drift_result.value .. rst-class:: sphx-glr-script-out .. code-block:: none {'domain_classifier_auc': 0.7328426082404159, 'domain_classifier_drift_score': 0.46568521648083183, 'domain_classifier_feature_importance': {'int_rate': 0.5092728878422134, 'sub_grade': 0.38754783632616996, 'dti': 0.031056814836620684, 'revol_bal': 0.030762437444804377, 'initial_list_status': 0.02458051221666191, 'emp_length': 0.012952605239917654, 'pub_rec': 0.0038269060936120774, 'time_to_earliest_cr_line': 0.0, 'verification_status': 0.0, 'term': 0.0, 'purpose': 0.0, 'home_ownership': 0.0, 'application_type': 0.0, 'addr_state': 0.0, 'annual_inc': 0.0, 'revol_util': 0.0, 'open_acc': 0.0, 'total_acc': 0.0, 'fico_range_high': 0.0, 'pub_rec_bankruptcies': 0.0, 'installment': 0.0, 'loan_amnt': 0.0, 'mort_acc': 0.0, 'fico_range_low': -0.0}} .. GENERATED FROM PYTHON SOURCE LINES 196-197 and see if the conditions have passed .. GENERATED FROM PYTHON SOURCE LINES 197-199 .. code-block:: default dataset_drift_result.passed_conditions() .. rst-class:: sphx-glr-script-out .. code-block:: none False .. GENERATED FROM PYTHON SOURCE LINES 200-204 Create a Custom Suite ---------------------- To create our own suite, we can simply write all of the checks, and add optional conditions. .. GENERATED FROM PYTHON SOURCE LINES 204-217 .. code-block:: default from deepchecks.tabular import Suite from deepchecks.tabular.checks import TrainTestFeatureDrift, WholeDatasetDrift, \ TrainTestPredictionDrift, TrainTestLabelDrift drift_suite = Suite('drift suite', TrainTestFeatureDrift().add_condition_drift_score_less_than( max_allowed_categorical_score=0.2, max_allowed_numeric_score=0.1), WholeDatasetDrift().add_condition_overall_drift_value_less_than(0.4), TrainTestLabelDrift(), TrainTestPredictionDrift() ) .. GENERATED FROM PYTHON SOURCE LINES 218-219 we can run our new suite using: .. GENERATED FROM PYTHON SOURCE LINES 220-223 .. code-block:: default result = drift_suite.run(train_ds, test_ds) result.show() .. rst-class:: sphx-glr-script-out .. code-block:: none drift suite: | | 0/4 [Time: 00:00] drift suite: |#2 | 1/4 [Time: 00:01, Check=Train Test Feature Drift] drift suite: |##5 | 2/4 [Time: 00:02, Check=Whole Dataset Drift] .. raw:: html
drift suite


.. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 17.131 seconds) .. _sphx_glr_download_user-guide_tabular_auto_quickstarts_plot_quick_train_test_validation.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_quick_train_test_validation.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_quick_train_test_validation.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_