String Length Out Of Bounds#

import pandas as pd

from deepchecks.tabular.checks.integrity.string_length_out_of_bounds import \
    StringLengthOutOfBounds
col1 = ["aaaaa33", "aaaaaaa33"]*40
col1.append("a")
col1.append("aaaaaadsfasdfasdf")

col2 = ["b", "abc"]*41

col3 = ["a"]*80
col3.append("a"*100)
col3.append("a"*200)
# col1 and col3 contrains outliers, col2 does not
df = pd.DataFrame({"col1":col1, "col2": col2, "col3": col3 })
StringLengthOutOfBounds(min_unique_value_ratio=0.01).run(df)

String Length Out Of Bounds

Detect strings with length that is much longer/shorter than the identified "normal" string lengths.

Additional Outputs
* showing only the top 10 columns, you can change it using n_top_columns param
      Number of Outlier Samples Example Samples
Column Name Range of Detected Normal String Lengths Range of Detected Outlier String Lengths    
col1 7 - 9 1 - 1 1 ['a']
17 - 17 1 ['aaaaaadsfasdfasdf']
col3 1 - 1 100 - 200 2 ['aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa...', 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa...']


col = ["a","a","a","a","a","a","a","a","a","a","a","a","a","ab","ab","ab","ab","ab","ab", "ab"]*1000
col.append("basdbadsbaaaaaaaaaa")
col.append("basdbadsbaaaaaaaaaaa")
df = pd.DataFrame({"col1":col})
StringLengthOutOfBounds(num_percentiles=1000, min_unique_values=3).run(df)

String Length Out Of Bounds

Detect strings with length that is much longer/shorter than the identified "normal" string lengths.

Additional Outputs
* showing only the top 10 columns, you can change it using n_top_columns param
      Number of Outlier Samples Example Samples
Column Name Range of Detected Normal String Lengths Range of Detected Outlier String Lengths    
col1 1 - 2 19 - 20 2 ['basdbadsbaaaaaaaaaa', 'basdbadsbaaaaaaaaaaa']


Total running time of the script: ( 0 minutes 0.187 seconds)

Gallery generated by Sphinx-Gallery