Unknown Tokens

.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "nlp/auto_checks/data_integrity/plot_unknown_tokens.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_nlp_auto_checks_data_integrity_plot_unknown_tokens.py: .. _nlp__unknown_tokens: Unknown Tokens ************** This notebook provides an overview for using and understanding the Unknown Tokens check. **Structure:** * `What is the purpose of the check? <#what-is-the-purpose-of-the-check>`__ * `Generate data & model <#generate-data-model>`__ * `Run the check <#run-the-check>`__ * `Using the Check Value <#using-the-check-value>`__ * `Define a condition <#define-a-condition>`__ What is the purpose of the check? ================================== The Unknown Tokens check is designed to help you identify samples that contain tokens not supported by your tokenizer. These not supported tokens can lead to poor model performance, as the model may not be able to understand the meaning of such tokens. By identifying these unknown tokens, you can take appropriate action, such as updating your tokenizer or preprocessing your data to handle them. Generate data & model ===================== In this example, we'll use the twitter dataset. .. GENERATED FROM PYTHON SOURCE LINES 33-38 .. code-block:: default from deepchecks.nlp.datasets.classification import tweet_emotion dataset, _ = tweet_emotion.load_data() .. GENERATED FROM PYTHON SOURCE LINES 39-48 Run the check ============= The check has several key parameters that affect its behavior and output: * `tokenizer`: Tokenizer from the HuggingFace transformers library to use for tokenization. If None, AutoTokenizer.from_pretrained('bert-base-uncased') will be used. It's highly recommended to use a fast tokenizer. * `group_singleton_words`: If True, group all words that appear only once in the data into the "Other" category in the display. .. GENERATED FROM PYTHON SOURCE LINES 48-56 .. code-block:: default from deepchecks.nlp.checks import UnknownTokens check = UnknownTokens() result = check.run(dataset) result.show() .. rst-class:: sphx-glr-script-out .. code-block:: none tokenizer_config.json: 0%| | 0.00/28.0 [00:00 Unknown Tokens

.. GENERATED FROM PYTHON SOURCE LINES 57-68 Observe the check's output -------------------------- We see in the results that the check found many emojis and some foreign words (Korean, can be seen by hovering over the "Other Unknown Words" slice of the pie chart) that are not supported by the tokenizer. We can also see that the check grouped all words that appear only once in the data into the "Other" Use a Different Tokenizer ------------------------- We can also use a different tokenizer, such as the GPT2 tokenizer, to see how the results change. .. GENERATED FROM PYTHON SOURCE LINES 68-74 .. code-block:: default from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('gpt2') UnknownTokens(tokenizer=tokenizer).run(dataset) .. rst-class:: sphx-glr-script-out .. code-block:: none config.json: 0%| | 0.00/665 [00:00 Unknown Tokens

.. GENERATED FROM PYTHON SOURCE LINES 75-87 Using the Check Value ===================== On top of observing the check's display, we can use the check's returned value to get more information about the words containing unknown tokens in our dataset. The check's value is a nested dictionary with the following keys: 1. ``unknown_word_ratio``: The ratio of unknown words out of all words in the dataset. 2. ``unknown_word_details``: This is in turn also a dict, containing a key for each unknown word. The value for each key is a dict containing 'ratio' (the ratio of the unknown word out of all words in the dataset) and 'indexes' (the indexes of the samples containing the unknown word). We'll show here how you can use this value to get the individual samples containing unknown tokens. .. GENERATED FROM PYTHON SOURCE LINES 87-97 .. code-block:: default from pprint import pprint unknown_word_details = result.value['unknown_word_details'] first_unknown_word = list(unknown_word_details.keys())[0] print(f"Unknown word: {first_unknown_word}") word_indexes = unknown_word_details[first_unknown_word]['indexes'] pprint(dataset.text[word_indexes].tolist()) .. rst-class:: sphx-glr-script-out .. code-block:: none Unknown word: 🙄 ['Why have I only just started watching glee this week I am now addicted 🙄 ' '#glee #GLEEK', "Just had to reverse half way up the woods to collect the dog n I've never " 'even reverse parked in my life 🙄 #nightmare', 'I was literally shaking getting the EKG done lol 🙄', 'Being shy is the biggest struggle of my life. 🙄', "@user did you not learn from @user 's viral insult to ballet? Stop trying to " 'wrongfully stick models into pointe shoes 🙄', "Can't believe I've only got 2 days off left 🙄 #backtoreality", "These people irritate tf out of me I swear 🙄 I'm goin to sleep ✌🏾️", "@user I don't even remember that part 😅 the movie wasn't terrible, it just " "wasn't very scary and I expected a better ending 🙄"] .. GENERATED FROM PYTHON SOURCE LINES 98-106 As we can see, the GPT2 tokenizer supports emojis, so the check did not find any unknown tokens. Define a condition ================== We can add a condition that validates the ratio of unknown words in the dataset is below a certain threshold. This can be useful to ensure that your dataset does not have a high percentage of unknown tokens, which might negatively impact the performance of your model. .. GENERATED FROM PYTHON SOURCE LINES 107-112 .. code-block:: default check.add_condition_ratio_of_unknown_words_less_or_equal(0.005) result = check.run(dataset) result.show(show_additional_outputs=False) .. raw:: html

Unknown Tokens

.. GENERATED FROM PYTHON SOURCE LINES 113-115 In this example, the condition checks if the ratio of unknown words is less than or equal to 0.005 (0.5%). If the ratio is higher than the threshold, the condition will fail, indicating a potential issue with the dataset. .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 2.071 seconds) .. _sphx_glr_download_nlp_auto_checks_data_integrity_plot_unknown_tokens.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_unknown_tokens.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_unknown_tokens.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_