UnknownTokens#

class UnknownTokens[source]#

Find samples that contain tokens unsupported by your tokenizer.

Parameters

tokenizer: t.Any , default: None: Tokenizer from the HuggingFace transformers library to use for tokenization. If None, AutoTokenizer.from_pretrained(‘bert-base-uncased’) will be used.
group_singleton_words: bool, default: False: If True, group all words that appear only once in the data into the “Other” category in the display.
n_most_commonint , default: 5: Number of most common words with unknown tokens to show in the display.
n_samples: int, default: 1_000_000: number of samples to use for this check.
random_stateint, default: 42: random seed for all check internals.
max_text_length_for_displayint, default 30: truncate text samples to given length before display.

__init__(tokenizer: Optional[Any] = None, group_singleton_words: bool = False, n_most_common: int = 5, n_samples: int = 1000000, random_state: int = 42, max_text_length_for_display: int = 30, **kwargs)[source]#

Methods

`UnknownTokens.add_condition`(name, ...)	Add new condition function to the check.
`UnknownTokens.add_condition_ratio_of_unknown_words_less_or_equal`([ratio])	Add condition that checks if the ratio of unknown words is less than a given ratio.
`UnknownTokens.clean_conditions`()	Remove all conditions from this check instance.
`UnknownTokens.conditions_decision`(result)	Run conditions on given result.
`UnknownTokens.config`([include_version, ...])	Return check configuration (conditions' configuration not yet supported).
`UnknownTokens.create_pie_chart`(...)	Create pie chart with most common unknown words.
`UnknownTokens.find_unknown_words`(samples, ...)	Find words with unknown tokens in samples.
`UnknownTokens.from_config`(conf[, ...])	Return check object from a CheckConfig object.
`UnknownTokens.from_json`(conf[, version_unmatch])	Deserialize check instance from JSON string.
`UnknownTokens.metadata`([with_doc_link])	Return check metadata.
`UnknownTokens.name`()	Name of class in split camel case.
`UnknownTokens.params`([show_defaults])	Return parameters to show when printing the check.
`UnknownTokens.remove_condition`(index)	Remove given condition by index.
`UnknownTokens.run`(dataset[, model, ...])	Run check.
`UnknownTokens.run_logic`(context, dataset_kind)	Run check.
`UnknownTokens.to_json`([indent, ...])	Serialize check instance to JSON string.

Examples#

SpecialCharacters.to_json

UnknownTokens.add_condition