Unknown Tokens#

This notebook provides an overview for using and understanding the Unknown Tokens check.

Structure:

What is the purpose of the check?#

The Unknown Tokens check is designed to help you identify samples that contain tokens not supported by your tokenizer. These not supported tokens can lead to poor model performance, as the model may not be able to understand the meaning of such tokens. By identifying these unknown tokens, you can take appropriate action, such as updating your tokenizer or preprocessing your data to handle them.

Generate data & model#

In this example, we’ll use the twitter dataset.

from deepchecks.nlp.datasets.classification import tweet_emotion

dataset, _ = tweet_emotion.load_data()

Run the check#

The check has several key parameters that affect its behavior and output:

  • tokenizer: Tokenizer from the HuggingFace transformers library to use for tokenization. If None, AutoTokenizer.from_pretrained(‘bert-base-uncased’) will be used. It’s highly recommended to use a fast tokenizer.

  • group_singleton_words: If True, group all words that appear only once in the data into the “Other” category in the display.

from deepchecks.nlp.checks import UnknownTokens

check = UnknownTokens()
result = check.run(dataset)
result.show()
tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]
tokenizer_config.json: 100%|██████████| 28.0/28.0 [00:00<00:00, 12.5kB/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]
config.json: 100%|██████████| 570/570 [00:00<00:00, 694kB/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]
vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 26.0MB/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]
tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 6.59MB/s]
Unknown Tokens


Observe the check’s output#

We see in the results that the check found many emojis and some foreign words (Korean, can be seen by hovering over the “Other Unknown Words” slice of the pie chart) that are not supported by the tokenizer. We can also see that the check grouped all words that appear only once in the data into the “Other”

Use a Different Tokenizer#

We can also use a different tokenizer, such as the GPT2 tokenizer, to see how the results change.

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('gpt2')

UnknownTokens(tokenizer=tokenizer).run(dataset)
config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]
config.json: 100%|██████████| 665/665 [00:00<00:00, 802kB/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]
vocab.json: 100%|██████████| 1.04M/1.04M [00:00<00:00, 46.8MB/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]
merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 46.4MB/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]
tokenizer.json: 100%|██████████| 1.36M/1.36M [00:00<00:00, 49.1MB/s]
Unknown Tokens


Using the Check Value#

On top of observing the check’s display, we can use the check’s returned value to get more information about the words containing unknown tokens in our dataset. The check’s value is a nested dictionary with the following keys:

  1. unknown_word_ratio: The ratio of unknown words out of all words in the dataset.

  2. unknown_word_details: This is in turn also a dict, containing a key for each unknown word. The value for each key is a dict containing ‘ratio’ (the ratio of the unknown word out of all words in the dataset) and ‘indexes’ (the indexes of the samples containing the unknown word).

We’ll show here how you can use this value to get the individual samples containing unknown tokens.

from pprint import pprint

unknown_word_details = result.value['unknown_word_details']
first_unknown_word = list(unknown_word_details.keys())[0]
print(f"Unknown word: {first_unknown_word}")

word_indexes = unknown_word_details[first_unknown_word]['indexes']
pprint(dataset.text[word_indexes].tolist())
Unknown word: 🙄
['Why have I only just started watching glee this week I am now addicted 🙄 '
 '#glee #GLEEK',
 "Just had to reverse half way up the woods to collect the dog n I've never "
 'even reverse parked in my life 🙄 #nightmare',
 'I was literally shaking getting the EKG done lol 🙄',
 'Being shy is the biggest struggle of my life. 🙄',
 "@user did you not learn from @user 's viral insult to ballet? Stop trying to "
 'wrongfully stick models into pointe shoes 🙄',
 "Can't believe I've only got 2 days off left 🙄 #backtoreality",
 "These people irritate tf out of me I swear 🙄 I'm goin to sleep ✌🏾️",
 "@user I don't even remember that part 😅 the movie wasn't terrible, it just "
 "wasn't very scary and I expected a better ending 🙄"]

As we can see, the GPT2 tokenizer supports emojis, so the check did not find any unknown tokens.

Define a condition#

We can add a condition that validates the ratio of unknown words in the dataset is below a certain threshold. This can be useful to ensure that your dataset does not have a high percentage of unknown tokens, which might negatively impact the performance of your model.

check.add_condition_ratio_of_unknown_words_less_or_equal(0.005)
result = check.run(dataset)
result.show(show_additional_outputs=False)
Unknown Tokens


In this example, the condition checks if the ratio of unknown words is less than or equal to 0.005 (0.5%). If the ratio is higher than the threshold, the condition will fail, indicating a potential issue with the dataset.

Total running time of the script: (0 minutes 2.071 seconds)

Gallery generated by Sphinx-Gallery