FrequentSubstrings#

class FrequentSubstrings[source]#

Checks for frequent substrings in the dataset.

Substrings of varying lengths (n-grams) are extracted from the dataset text samples. The frequencies of these n-grams are calculated and only substrings exceeding a defined minimum length are retained. The substrings are then sorted by their frequencies and the most frequent substrings are identified. Finally, the substrings with the highest frequency and those surpassing a significance level are displayed.

Parameters
n_to_showint, default: 5

Number of most frequent substrings to show.

n_samplesint, default: 10_000

Number of samples to use for this check.

random_stateint, default: 42

Random seed for all check internals.

n_sentencesint, default: 5

The number of sentences to extract from the beginning and end of the text content.

min_ngram_length: int, default: 4

Minimum amount of words for a substring to be considered a frequent substring.

min_substring_ratio: float, default: 0.05

Minimum frequency required for a substring to be considered “frequent”.

significant_substring_ratio: float, default: 0.3

Frequency above which samples are considered significant. Substrings meeting or exceeding this ratio will always be returned, regardless of other parameters and conditions.

frequency_margin: float, default: 0.02

Defines the tolerance level for selecting longer overlapping substrings. If a longer substring has a frequency that’s less than a shorter overlapping substring but the difference is within the specified frequency_margin, the longer substring is still preferred.

min_relative_changefloat, optional, default=0.05

Defines the threshold for relative change. If the computed relative change falls below this specified threshold, it is considered insignificant and is thus set to zero.

__init__(n_to_show: int = 5, n_samples: int = 10000, random_state: int = 42, n_sentences: int = 5, min_ngram_length: int = 4, min_substring_ratio: float = 0.05, significant_substring_ratio: float = 0.3, frequency_margin: float = 0.02, min_relative_change: float = 0.05, **kwargs)[source]#
__new__(*args, **kwargs)#

Methods

FrequentSubstrings.add_condition(name, ...)

Add new condition function to the check.

FrequentSubstrings.add_condition_zero_result([...])

Add condition - check that the amount of frequent substrings is below the minimum.

FrequentSubstrings.clean_conditions()

Remove all conditions from this check instance.

FrequentSubstrings.conditions_decision(result)

Run conditions on given result.

FrequentSubstrings.config([include_version, ...])

Return check configuration (conditions' configuration not yet supported).

FrequentSubstrings.from_config(conf[, ...])

Return check object from a CheckConfig object.

FrequentSubstrings.from_json(conf[, ...])

Deserialize check instance from JSON string.

FrequentSubstrings.metadata([with_doc_link])

Return check metadata.

FrequentSubstrings.name()

Name of class in split camel case.

FrequentSubstrings.params([show_defaults])

Return parameters to show when printing the check.

FrequentSubstrings.remove_condition(index)

Remove given condition by index.

FrequentSubstrings.run(dataset[, model, ...])

Run check.

FrequentSubstrings.run_logic(context, ...)

Run check.

FrequentSubstrings.to_json([indent, ...])

Serialize check instance to JSON string.

Examples#