NLP Property Drift#

This notebooks provides an overview for using and understanding the nlp property drift check.

Structure:

Calculating Drift for Text Data#

What is Drift?#

Drift is simply a change in the distribution of data over time, and it is also one of the top reasons why machine learning model’s performance degrades over time.

For more information on drift, please visit our Drift User Guide.

How Deepchecks Detects Drift in NLP Data#

This check detects drift by in NLP Data by calculated univariate drift measures for each of the text property (such as text length, language etc.) that are present in the train and test datasets.

This check is easy to run (once the properties are calculated once per dataset) and is useful for detecting easily explainable changes in the data. For example, if you have started to use new data sources that contain samples in a new language, this check will detect it and show you a high drift score for the language property.

Which NLP Properties Are Used?#

By default the checks uses the properties that where calculated for the train and test datasets, which by default are the built-in text properties. It’s also possible to replace the default properties with custom ones. For the list of the built-in text properties and explanation about custom properties refer to NLP properties.

Note

If a property was not calculated for a sample (for example, if it applies only to English samples and the sample is in another language), it will contain a nan value and will be ignored when calculating the drift.

Prepare data#

from deepchecks.nlp.datasets.classification.tweet_emotion import load_data

train_dataset, test_dataset = load_data()

# # Calculate properties, commented out because it takes a short while to run
# train_dataset.calculate_builtin_properties(include_long_calculation_properties=True)
# test_dataset.calculate_builtin_properties(include_long_calculation_properties=True)

Run the check#

from deepchecks.nlp.checks import PropertyDrift
check_result = PropertyDrift().run(train_dataset, test_dataset)
check_result
Property Drift


We can see that there isn’t any significant drift in the data. We can see some slight increase in the formality of the text samples in the test dataset.

To display the results in an IDE like PyCharm, you can use the following code:

#  check_result.show_in_window()

The result will be displayed in a new window.

Observe the check’s output#

The result value is a dict that contains drift score and method used for each text property.

check_result.value
{'Average Word Length': {'Drift score': 0.05351275242622111, 'Method': 'Kolmogorov-Smirnov', 'Importance': None}, 'Formality': {'Drift score': 0.08043676705442104, 'Method': 'Kolmogorov-Smirnov', 'Importance': None}, '% Special Characters': {'Drift score': 0.02332838796858905, 'Method': 'Kolmogorov-Smirnov', 'Importance': None}, 'Sentiment': {'Drift score': 0.04037496574468685, 'Method': 'Kolmogorov-Smirnov', 'Importance': None}, 'Language': {'Drift score': 0.009166684961611582, 'Method': "Cramer's V", 'Importance': None}, 'Max Word Length': {'Drift score': 0.04743959252714447, 'Method': 'Kolmogorov-Smirnov', 'Importance': None}, 'Fluency': {'Drift score': 0.054627254944577264, 'Method': 'Kolmogorov-Smirnov', 'Importance': None}, 'Text Length': {'Drift score': 0.029349196299481184, 'Method': 'Kolmogorov-Smirnov', 'Importance': None}, 'Toxicity': {'Drift score': 0.023840752955406663, 'Method': 'Kolmogorov-Smirnov', 'Importance': None}, 'Subjectivity': {'Drift score': 0.034508944180376644, 'Method': 'Kolmogorov-Smirnov', 'Importance': None}}

Define a condition#

We can define a condition that make sure that nlp properties drift scores do not exceed allowed threshold.

check_result = (
    PropertyDrift()
    .add_condition_drift_score_less_than(0.001)
    .run(train_dataset, test_dataset)
)
check_result.show(show_additional_outputs=False)
Property Drift


Check Parameters#

The Property Drift Check can define a list of properties to use for the drift check, or a list to exclude using the properties and ignore_properties parameters.

On top of that the Property Drift Check supports several parameters pertaining to the way drift is calculated and displayed. Information about the most relevant of them can be found in the Drift User Guide.

Total running time of the script: (0 minutes 1.257 seconds)

Gallery generated by Sphinx-Gallery