FrequentSubstrings#
- class FrequentSubstrings[source]#
Checks for frequent substrings in the dataset.
Substrings of varying lengths (n-grams) are extracted from the dataset text samples. The frequencies of these n-grams are calculated and only substrings exceeding a defined minimum length are retained. The substrings are then sorted by their frequencies and the most frequent substrings are identified. Finally, the substrings with the highest frequency and those surpassing a significance level are displayed.
- Parameters
- n_to_showint, default: 5
Number of most frequent substrings to show.
- n_samplesint, default: 10_000
Number of samples to use for this check.
- random_stateint, default: 42
Random seed for all check internals.
- n_sentencesint, default: 5
The number of sentences to extract from the beginning and end of the text content.
- min_ngram_length: int, default: 4
Minimum amount of words for a substring to be considered a frequent substring.
- min_substring_ratio: float, default: 0.05
Minimum frequency required for a substring to be considered “frequent”.
- significant_substring_ratio: float, default: 0.3
Frequency above which samples are considered significant. Substrings meeting or exceeding this ratio will always be returned, regardless of other parameters and conditions.
- frequency_margin: float, default: 0.02
Defines the tolerance level for selecting longer overlapping substrings. If a longer substring has a frequency that’s less than a shorter overlapping substring but the difference is within the specified frequency_margin, the longer substring is still preferred.
- min_relative_changefloat, optional, default=0.05
Defines the threshold for relative change. If the computed relative change falls below this specified threshold, it is considered insignificant and is thus set to zero.
- __init__(n_to_show: int = 5, n_samples: int = 10000, random_state: int = 42, n_sentences: int = 5, min_ngram_length: int = 4, min_substring_ratio: float = 0.05, significant_substring_ratio: float = 0.3, frequency_margin: float = 0.02, min_relative_change: float = 0.05, **kwargs)[source]#
- __new__(*args, **kwargs)#
Methods
|
Add new condition function to the check. |
Add condition - check that the amount of frequent substrings is below the minimum. |
|
Remove all conditions from this check instance. |
|
Run conditions on given result. |
|
|
Return check configuration (conditions' configuration not yet supported). |
|
Return check object from a CheckConfig object. |
|
Deserialize check instance from JSON string. |
|
Return check metadata. |
Name of class in split camel case. |
|
|
Return parameters to show when printing the check. |
Remove given condition by index. |
|
|
Run check. |
|
Run check. |
|
Serialize check instance to JSON string. |