Note
Go to the end to download the full example code
Special Characters#
This notebook provides an overview for using and understanding the special characters check.
Structure:
Why check for special characters?#
The SpecialCharacters
check looks for text sample in which the percentage of special characters
out of all characters is significant. Such samples can be an indicator for a problem in the data pipeline that
require attention. Additionally, such examples may be problematic for the model to predict on.
For example, a text sample with many emojis may be hard to
predict on and a common methodology will be to replace them with a textual representation of the emoji.
Generate data & model#
Letโs create a simple dataset with some duplicate and similar text samples.
from deepchecks.nlp.datasets.classification import tweet_emotion
text_data = tweet_emotion.load_data(as_train_test=False)
text_data.head(3)
Run the Check#
from deepchecks.nlp.checks import SpecialCharacters
check = SpecialCharacters()
result = check.run(text_data)
result.show()
We can see in the check display that ~17% of the samples contain at least one special character and that the samples with the highest percentage of special characters contain many emojis.
In addition to the check display we can also see receive a summary of most common special characters and which samples contain them. This can assist us in conforming that the majority of the special characters in this dataset are indeed emojis.
result.value['samples_per_special_char']
{'๐': [58, 78, 200, 204, 354, 413, 469, 494, 525, 754, 810, 873, 916, 936, 1033, 1037, 1101, 1167, 1250, 1323, 1352, 1378, 1469, 1492, 1564, 1687, 1715, 1820, 1887, 1934, 2030, 2049, 2153, 2173, 2327, 2376, 2408, 2533, 2546, 2567, 2729, 2744, 2759, 2765, 2798, 2861, 2908, 2973, 3044, 3099, 3128, 3133, 3277, 3295, 3323, 3328, 3403, 3421, 3546, 3599, 3680, 3693, 3706, 3708, 3713, 3719, 3720, 3764, 3772, 3815, 3817, 3862, 3878, 3885, 3891, 3906, 3929, 3964, 4010, 4031, 4037, 4057, 4111, 4112, 4190, 4191, 4240, 4241, 4256, 4267, 4309, 4316, 4322, 4336, 4341, 4361, 4387, 4402, 4495, 4546, 4559, 4578], '๏ธ': [181, 184, 232, 296, 423, 747, 830, 889, 950, 1016, 1399, 1418, 1468, 1474, 1855, 1965, 2005, 2057, 2091, 2485, 2576, 2709, 2730, 2748, 2773, 2870, 3071, 3078, 3291, 3318, 3346, 3440, 3463, 3569, 3763, 3786, 3797, 3825, 3885, 3918, 3959, 4103, 4161, 4169, 4231, 4283, 4495, 4509, 4559, 4573], '๐ญ': [78, 139, 478, 606, 754, 1275, 1492, 1637, 1687, 1721, 1781, 1918, 2008, 2016, 2081, 2178, 2533, 2620, 2744, 2971, 2973, 3308, 3420, 3456, 3483, 3554, 3615, 3640, 3692, 3696, 3725, 3772, 3792, 3815, 3883, 3887, 3898, 4008, 4051, 4119, 4127, 4157, 4231, 4384, 4420, 4460, 4525, 4540, 4563], 'โ': [21, 23, 39, 82, 361, 394, 557, 856, 980, 1086, 1272, 1296, 1397, 1420, 1670, 1714, 2117, 2166, 2267, 2406, 2434, 2569, 2578, 2596, 2719, 2775, 2819, 2887, 3020, 3052, 3693, 3727, 3805, 3962, 4002, 4063, 4186, 4381, 4453, 4497], '๐ก': [169, 171, 272, 327, 495, 786, 807, 854, 1030, 1093, 1161, 1235, 1326, 1327, 2127, 2212, 2900, 3375, 3393, 3468, 3606, 3755, 3774, 3787, 4045, 4180, 4205, 4209, 4224], '๐': [30, 167, 250, 478, 709, 714, 1297, 1331, 1352, 1418, 1497, 1678, 2153, 2312, 2525, 2748, 2756, 2759, 2765, 2854, 2973, 3099, 3204, 3343, 4597], 'โค': [184, 423, 668, 889, 1016, 1399, 1468, 2005, 2057, 2091, 2514, 2730, 3071, 3199, 3476, 3587, 3825, 3959, 4103, 4109, 4161, 4231, 4471, 4509, 4573], 'โ': [0, 43, 349, 508, 598, 994, 1677, 1890, 2276, 2406, 2751, 2769, 2774, 2775, 2934, 3113, 3290, 3310, 3897, 3962, 4026, 4392], 'โ': [43, 349, 508, 598, 994, 1677, 1879, 1890, 2276, 2751, 2769, 2775, 2934, 3113, 3290, 3310, 3897, 3962, 4026, 4360, 4392], '๐ฉ': [116, 494, 1117, 1143, 1542, 1687, 1809, 1880, 2229, 2682, 2792, 3029, 3172, 3215, 3519, 3617, 3986, 4127, 4157, 4381, 4399], '๐ข': [659, 1617, 1948, 2008, 2052, 2250, 2316, 2812, 3416, 3480, 3584, 3747, 3838, 3898, 3915, 4020, 4303, 4304, 4562, 4608], '๐': [155, 764, 908, 1236, 1315, 1468, 2360, 3041, 3055, 3320, 3529, 3796, 3825, 3862, 4248, 4294, 4471, 4603], '๐': [200, 424, 549, 589, 625, 868, 1290, 1793, 2567, 3030, 3260, 3371, 3467, 3532, 3669, 3703, 3876, 4361], '๐': [1037, 1080, 1319, 1468, 1696, 2005, 2285, 2992, 3098, 3202, 3577, 3591, 3769, 4006, 4116, 4207, 4251, 4335], '๐ ': [250, 2053, 2211, 2212, 3316, 3375, 3393, 3468, 3532, 3611, 3632, 3670, 3705, 4027, 4054, 4418, 4450], 'โ': [39, 241, 437, 845, 1560, 1565, 1612, 1762, 2124, 2184, 2561, 2747, 2769, 2810, 3969, 4360], '๐
': [311, 838, 2430, 2591, 2622, 2917, 3007, 3204, 3337, 3869, 3923, 4072, 4272, 4327, 4463, 4596], '๐ค': [917, 1480, 2206, 2294, 2445, 2629, 2822, 2863, 3087, 3190, 3245, 3393, 3468, 4586, 4595], '๐': [281, 423, 599, 946, 964, 1880, 2950, 3540, 3821, 3988, 4103, 4192, 4242, 4543], '๐': [300, 390, 554, 1107, 1394, 1478, 1570, 2570, 3078, 3182, 3409, 3680, 3986, 4164], '๐คฃ': [3453, 3471, 3481, 3713, 3722, 3791, 3822, 3834, 3863, 4227, 4232, 4392, 4651], '๐ผ': [300, 1161, 1478, 2057, 2379, 3277, 3334, 3446, 3569, 3644, 4349, 4592], '๐ป': [462, 1327, 1914, 2246, 2989, 3275, 3291, 3763, 3918, 3933, 4256, 4559], '๐': [890, 3476, 3578, 3714, 3797, 3816, 3829, 3980, 4087, 4101, 4400, 4491], 'โฆ': [102, 418, 2106, 2387, 2839, 2990, 3130, 3144, 3332, 3377, 3733], '๐ฐ': [205, 1124, 1838, 2023, 3348, 3372, 3898, 4065, 4385, 4505, 4646], '๐': [694, 1071, 1701, 2981, 3295, 3683, 3698, 3803, 3898, 4167, 4381], '๐': [139, 891, 2034, 2229, 2316, 3008, 3393, 3715, 4332, 4583], '๐ณ': [171, 457, 844, 1341, 1478, 1907, 2481, 2640, 4048, 4321], 'ยฃ': [495, 933, 1258, 1748, 1866, 2687, 3787, 3903, 4368, 4383], 'โน': [232, 296, 830, 3282, 3346, 3910, 4244, 4384, 4591], '๐': [488, 762, 1406, 1709, 2410, 2651, 3205, 4399, 4430], 'โ': [124, 1072, 1194, 2391, 2751, 3386, 4026, 4536], '๐ฅ': [169, 528, 570, 1175, 1347, 2825, 4326, 4634], '๐': [280, 511, 3370, 3512, 3544, 3694, 3723, 4588], '๐ค': [354, 589, 917, 990, 1071, 2445, 4296, 4449], '๐': [385, 462, 1914, 1999, 3314, 3495, 4300, 4329], '๐': [764, 1830, 2223, 3202, 3446, 4256, 4299, 4645], '๐': [1129, 1152, 2993, 3855, 4060, 4183, 4198, 4574], '๐': [1632, 2263, 2281, 3059, 3393, 4287, 4399, 4616], '๐ฑ': [456, 650, 2066, 3630, 3654, 4396, 4621], '๐': [1053, 1431, 2989, 3644, 3933, 4349, 4592], '๐': [1075, 1980, 2073, 2910, 3437, 3679, 4495], '๐': [1732, 3945, 4121, 4225, 4314, 4344, 4365], '๐': [3446, 3524, 3777, 3798, 4139, 4174, 4234], '๐ท': [117, 169, 675, 1167, 1910, 4127], '๐ฅ': [240, 2167, 3479, 3581, 3737, 4298], '๐': [343, 2327, 2413, 2565, 3188, 3698], '๐ฌ': [444, 2086, 2547, 3185, 4386, 4601], '๐': [449, 1060, 2074, 3619, 3743, 3871], 'โ': [856, 1086, 1272, 2887, 3727, 4002], '๐พ': [1175, 1418, 1999, 2748, 3405, 3495], '๐ง': [1541, 3563, 3671, 3766, 3949, 4271], '๐': [1585, 2240, 3275, 4041, 4103, 4299], 'โ': [1890, 2596, 2719, 2771, 2978, 3736], '๐ฝ': [3118, 3786, 4169, 4300, 4329, 4495], '\u200d': [3291, 3763, 3786, 3918, 4169, 4559], '\xa0': [0, 392, 2106, 2896, 4328], '๐': [54, 1175, 3405, 3913, 4256], '๐': [118, 217, 2729, 2732, 4575], '๐ฆ': [217, 1965, 2773, 2943, 4634], '๐': [300, 2718, 3393, 3885, 4000], '๐': [532, 901, 1603, 3450, 4020], '๐': [587, 1585, 1696, 2240, 3641], '๐ด': [642, 2481, 3090, 4175, 4184], 'โ': [1053, 2091, 2748, 2870, 3569], '๐': [1492, 2165, 2521, 3393, 4325], 'โบ': [1855, 1965, 2576, 2773, 3463], '๐': [2164, 3275, 3334, 4341, 4495], 'โ': [3291, 3763, 3918, 4169, 4559], '๐': [442, 1056, 2750, 3474], '๐': [969, 4212, 4571, 4632], '๐ค': [1061, 1101, 1487, 2989], '๐': [1259, 3488, 3540, 3946], '๐ช': [1815, 1831, 2682, 2947], '๐ฆ': [2226, 3243, 3276, 3661], '๐': [2703, 3841, 3900, 4616], 'โจ': [54, 1005, 2384], '๐': [296, 587, 2123], '๐': [557, 3032, 3381], '๐
': [557, 1418, 1478], '๐': [645, 3015, 3442], '๐': [690, 727, 1687], '๐ช': [691, 2454, 4082], '๐ซ': [975, 1344, 4300], '๐ถ': [1031, 3185, 4622], '๐': [1161, 2379, 3393], '๐ช': [1327, 2057, 3405], '๐': [1585, 2240, 3275], '๐': [1672, 3393, 4056], '๐ป': [1875, 2431, 3495], '๐ค': [2537, 2697, 3983], '๐': [2537, 3570, 3960], '๐คท': [3291, 3786, 4559], '๐ฌ': [3405, 4267, 4432], '๐ฏ': [3405, 4341, 4509], '๐': [3579, 4311, 4495], '๐คฆ': [3763, 3918, 4169], '๐พ': [3933, 4038, 4601], '๐': [4329, 4332, 4553], '๐': [32, 4410], '๐': [32, 4410], '๐': [181, 2433], '๐': [297, 4207], '๐': [300, 3446], '๐': [359, 2903], '๐ค': [427, 4573], 'โช': [502, 4264], '๐': [668, 2366], 'โ': [694, 2485], '๐': [708, 3421], 'โฌ': [895, 2816], 'โข': [953, 1879], '๐': [958, 2398], '๐': [1155, 2968], '๐': [1216, 2379], '๐ฉ': [1292, 3440], 'ยป': [1309, 4469], '๐': [1448, 3966], '๐': [1478, 2786], '๐ธ': [1632, 3059], '๐ค': [1809, 2562], '๏ผ': [1919, 3100], 'โ': [2019, 4495], '๐': [2455, 3277], '๐ธ': [2620, 2734], '๐': [2709, 3275], '๐': [2943, 3600], 'โฝ': [3078, 3885], '๐ฒ': [3279, 4433], 'ยฟ': [3282, 4105], '๐ฆ': [3325, 4613], '๐': [3333, 4256], 'โ': [3390, 3797], '๐ธ': [3446, 4103], 'โ
': [3782, 4543], '๐': [3793, 4543], '๐ฟ': [3913, 4341], '๐': [4103, 4517], '๐ง': [4267, 4432], '๐ฉ': [19], '๐ถ': [66], '๐ก': [169], '๐': [169], '๐': [181], 'โฏ': [181], '๐น': [181], 'ยด': [205], '๐ซ': [300], '๐': [300], '๐ฟ': [360], '๐ฅ': [382], '๐ค': [427], '๐ซ': [484], '๐': [484], '๐ฌ': [499], '๐ฝ': [499], 'ใ': [502], 'ใป': [502], 'ใ': [502], 'ยฉ': [502], '๐บ': [557], 'โจ': [599], 'โซ': [694], '\uf645': [696], '\uf64a': [696], '\uf3fc': [696], '\uf648': [696], '\uf633': [696], 'โผ': [747], 'โ': [950], '๐': [954], '๐ณ': [1168], '๐': [1204], '๐ฒ': [1204], '๐บ': [1216], '๐น': [1216], 'โ': [1216], 'โ': [1355], 'โ': [1355], '๐ฝ': [1400], 'โฃ': [1417], '๐
ฟ': [1418], '๐': [1448], 'โ': [1474], '๐ฅ': [1508], '๐ฏ': [1521], '๐': [1597], '๐ญ': [1662], '๐ฎ': [1672], '๐': [1678], 'เบด': [1704], 'อซ': [1704], 'ี': [1704], '๐น': [1729], '๐จ': [1753], '๐ฉ': [1753], '๐ข': [1753], 'โฅ': [1806], '๐ฟ': [1880], '\uf629': [1885], 'โ': [1919], 'โก': [1919], 'โฏ': [1919], 'โป': [1919], 'ยฐ': [1919], '๏ธต': [1919], '๐ด': [1941], '๐ฆ': [1941], '๐': [2019], '๐ต': [2019], '๐': [2048], '๐': [2073], '๐ฉ': [2091], 'โ': [2246], '๐ป': [2250], 'โ': [2267], '๐ง': [2379], '๐': [2379], '๐': [2379], '๐ฑ': [2379], '๐พ': [2433], '\U000fe334': [2486], 'โข': [2513], '๐ผ': [2620], 'โฌ
': [2709], '๐': [2734], '๐พ': [2933], '๐ถ': [3015], 'โฆ': [3100], 'โง': [3100], 'โ': [3100], '๏ผ': [3100], 'โ': [3118], '๏ธ': [3199], '๐ธ': [3314], 'โญ': [3318], '๐จ': [3325], '๐': [3393], '๐': [3393], '๐ณ': [3405], 'โ': [3440], '๐': [3442], '๐ผ': [3446], '๐ป': [3446], '๐จ': [3468], '๐ต': [3468], '๐ฐ': [3488], '๐ซ': [3585], '๐': [3585], '\ufeff': [3627], '๐ฐ': [3751], '๐ต': [3751], '๐คข': [3763], '๐': [3782], 'โ': [3786], '๐': [3844], 'โซ': [3870], '๐ค': [3887], '๐คฌ': [3887], 'โ ': [3997], '๐ฎ': [4041], '๐': [4041], '๐': [4074], '๐คก': [4074], '๐ก': [4082], '๐': [4082], 'โฝ': [4097], 'ใ': [4097], '๐': [4103], '๐': [4103], '\u200b': [4195], 'โ': [4283], '๐': [4299], 'โญ': [4360], 'โฌ': [4360], '๐ฆ': [4404], 'ยซ': [4469], '๐ต': [4487], '๐ฅ': [4495], '๐': [4517], '๐ฑ': [4565], '๐ฃ': [4571], 'ู': [4578], '๐น': [4613], '๐
': [4634]}
Define a condition#
We can add a condition that will validate that the percentage of samples with a significant ratio of special characters is below a certain threshold. Letโs add a condition and re-run the check:
Total running time of the script: (0 minutes 0.459 seconds)