FrequentSubstrings#

class FrequentSubstrings[source]#

Checks for frequent substrings in the dataset.

Substrings of varying lengths (n-grams) are extracted from the dataset text samples. The frequencies of these n-grams are calculated and only substrings exceeding a defined minimum length are retained. The substrings are then sorted by their frequencies and the most frequent substrings are identified. Finally, the substrings with the highest frequency and those surpassing a significance level are displayed.

Parameters

n_to_showint, default: 5: Number of most frequent substrings to show.
n_samplesint, default: 10_000: Number of samples to use for this check.
random_stateint, default: 42: Random seed for all check internals.
n_sentencesint, default: 5: The number of sentences to extract from the beginning and end of the text content.
min_ngram_length: int, default: 4: Minimum amount of words for a substring to be considered a frequent substring.
min_substring_ratio: float, default: 0.05: Minimum frequency required for a substring to be considered “frequent”.
significant_substring_ratio: float, default: 0.3: Frequency above which samples are considered significant. Substrings meeting or exceeding this ratio will always be returned, regardless of other parameters and conditions.
frequency_margin: float, default: 0.02: Defines the tolerance level for selecting longer overlapping substrings. If a longer substring has a frequency that’s less than a shorter overlapping substring but the difference is within the specified frequency_margin, the longer substring is still preferred.
min_relative_changefloat, optional, default=0.05: Defines the threshold for relative change. If the computed relative change falls below this specified threshold, it is considered insignificant and is thus set to zero.

__init__(n_to_show: int = 5, n_samples: int = 10000, random_state: int = 42, n_sentences: int = 5, min_ngram_length: int = 4, min_substring_ratio: float = 0.05, significant_substring_ratio: float = 0.3, frequency_margin: float = 0.02, min_relative_change: float = 0.05, **kwargs)[source]#

__new__(*args, **kwargs)#

Methods

`FrequentSubstrings.add_condition`(name, ...)	Add new condition function to the check.
`FrequentSubstrings.add_condition_zero_result`([...])	Add condition - check that the amount of frequent substrings is below the minimum.
`FrequentSubstrings.clean_conditions`()	Remove all conditions from this check instance.
`FrequentSubstrings.conditions_decision`(result)	Run conditions on given result.
`FrequentSubstrings.config`([include_version, ...])	Return check configuration (conditions' configuration not yet supported).
`FrequentSubstrings.from_config`(conf[, ...])	Return check object from a CheckConfig object.
`FrequentSubstrings.from_json`(conf[, ...])	Deserialize check instance from JSON string.
`FrequentSubstrings.metadata`([with_doc_link])	Return check metadata.
`FrequentSubstrings.name`()	Name of class in split camel case.
`FrequentSubstrings.params`([show_defaults])	Return parameters to show when printing the check.
`FrequentSubstrings.remove_condition`(index)	Remove given condition by index.
`FrequentSubstrings.run`(dataset[, model, ...])	Run check.
`FrequentSubstrings.run_logic`(context, ...)	Run check.
`FrequentSubstrings.to_json`([indent, ...])	Serialize check instance to JSON string.

Examples#

UnderAnnotatedPropertySegments.to_json

FrequentSubstrings.add_condition