Special Characters#

This notebook provides an overview for using and understanding the special characters check.

Structure:

Why check for special characters?#

The SpecialCharacters check looks for text sample in which the percentage of special characters out of all characters is significant. Such samples can be an indicator for a problem in the data pipeline that require attention. Additionally, such examples may be problematic for the model to predict on. For example, a text sample with many emojis may be hard to predict on and a common methodology will be to replace them with a textual representation of the emoji.

Generate data & model#

Letโ€™s create a simple dataset with some duplicate and similar text samples.

from deepchecks.nlp.datasets.classification import tweet_emotion

text_data = tweet_emotion.load_data(as_train_test=False)
text_data.head(3)
text label user_age gender days_on_platform user_region
0 โ€œWorry is a down payment on a problem you may ... optimism 30.73 Male 5614 Americas
1 My roommate: it's okay that we can't spell bec... anger 42.29 Female 4308 Europe
2 No but that's so cute. Atsu was probably shy a... happiness 24.97 Male 2729 Middle East/Africa


Run the Check#

from deepchecks.nlp.checks import SpecialCharacters

check = SpecialCharacters()
result = check.run(text_data)
result.show()
Special Characters


We can see in the check display that ~17% of the samples contain at least one special character and that the samples with the highest percentage of special characters contain many emojis.

In addition to the check display we can also see receive a summary of most common special characters and which samples contain them. This can assist us in conforming that the majority of the special characters in this dataset are indeed emojis.

result.value['samples_per_special_char']
{'๐Ÿ˜‚': [58, 78, 200, 204, 354, 413, 469, 494, 525, 754, 810, 873, 916, 936, 1033, 1037, 1101, 1167, 1250, 1323, 1352, 1378, 1469, 1492, 1564, 1687, 1715, 1820, 1887, 1934, 2030, 2049, 2153, 2173, 2327, 2376, 2408, 2533, 2546, 2567, 2729, 2744, 2759, 2765, 2798, 2861, 2908, 2973, 3044, 3099, 3128, 3133, 3277, 3295, 3323, 3328, 3403, 3421, 3546, 3599, 3680, 3693, 3706, 3708, 3713, 3719, 3720, 3764, 3772, 3815, 3817, 3862, 3878, 3885, 3891, 3906, 3929, 3964, 4010, 4031, 4037, 4057, 4111, 4112, 4190, 4191, 4240, 4241, 4256, 4267, 4309, 4316, 4322, 4336, 4341, 4361, 4387, 4402, 4495, 4546, 4559, 4578], '๏ธ': [181, 184, 232, 296, 423, 747, 830, 889, 950, 1016, 1399, 1418, 1468, 1474, 1855, 1965, 2005, 2057, 2091, 2485, 2576, 2709, 2730, 2748, 2773, 2870, 3071, 3078, 3291, 3318, 3346, 3440, 3463, 3569, 3763, 3786, 3797, 3825, 3885, 3918, 3959, 4103, 4161, 4169, 4231, 4283, 4495, 4509, 4559, 4573], '๐Ÿ˜ญ': [78, 139, 478, 606, 754, 1275, 1492, 1637, 1687, 1721, 1781, 1918, 2008, 2016, 2081, 2178, 2533, 2620, 2744, 2971, 2973, 3308, 3420, 3456, 3483, 3554, 3615, 3640, 3692, 3696, 3725, 3772, 3792, 3815, 3883, 3887, 3898, 4008, 4051, 4119, 4127, 4157, 4231, 4384, 4420, 4460, 4525, 4540, 4563], 'โ€™': [21, 23, 39, 82, 361, 394, 557, 856, 980, 1086, 1272, 1296, 1397, 1420, 1670, 1714, 2117, 2166, 2267, 2406, 2434, 2569, 2578, 2596, 2719, 2775, 2819, 2887, 3020, 3052, 3693, 3727, 3805, 3962, 4002, 4063, 4186, 4381, 4453, 4497], '๐Ÿ˜ก': [169, 171, 272, 327, 495, 786, 807, 854, 1030, 1093, 1161, 1235, 1326, 1327, 2127, 2212, 2900, 3375, 3393, 3468, 3606, 3755, 3774, 3787, 4045, 4180, 4205, 4209, 4224], '๐Ÿ™„': [30, 167, 250, 478, 709, 714, 1297, 1331, 1352, 1418, 1497, 1678, 2153, 2312, 2525, 2748, 2756, 2759, 2765, 2854, 2973, 3099, 3204, 3343, 4597], 'โค': [184, 423, 668, 889, 1016, 1399, 1468, 2005, 2057, 2091, 2514, 2730, 3071, 3199, 3476, 3587, 3825, 3959, 4103, 4109, 4161, 4231, 4471, 4509, 4573], 'โ€œ': [0, 43, 349, 508, 598, 994, 1677, 1890, 2276, 2406, 2751, 2769, 2774, 2775, 2934, 3113, 3290, 3310, 3897, 3962, 4026, 4392], 'โ€': [43, 349, 508, 598, 994, 1677, 1879, 1890, 2276, 2751, 2769, 2775, 2934, 3113, 3290, 3310, 3897, 3962, 4026, 4360, 4392], '๐Ÿ˜ฉ': [116, 494, 1117, 1143, 1542, 1687, 1809, 1880, 2229, 2682, 2792, 3029, 3172, 3215, 3519, 3617, 3986, 4127, 4157, 4381, 4399], '๐Ÿ˜ข': [659, 1617, 1948, 2008, 2052, 2250, 2316, 2812, 3416, 3480, 3584, 3747, 3838, 3898, 3915, 4020, 4303, 4304, 4562, 4608], '๐Ÿ˜': [155, 764, 908, 1236, 1315, 1468, 2360, 3041, 3055, 3320, 3529, 3796, 3825, 3862, 4248, 4294, 4471, 4603], '๐Ÿ™ƒ': [200, 424, 549, 589, 625, 868, 1290, 1793, 2567, 3030, 3260, 3371, 3467, 3532, 3669, 3703, 3876, 4361], '๐Ÿ˜Š': [1037, 1080, 1319, 1468, 1696, 2005, 2285, 2992, 3098, 3202, 3577, 3591, 3769, 4006, 4116, 4207, 4251, 4335], '๐Ÿ˜ ': [250, 2053, 2211, 2212, 3316, 3375, 3393, 3468, 3532, 3611, 3632, 3670, 3705, 4027, 4054, 4418, 4450], 'โ€”': [39, 241, 437, 845, 1560, 1565, 1612, 1762, 2124, 2184, 2561, 2747, 2769, 2810, 3969, 4360], '๐Ÿ˜…': [311, 838, 2430, 2591, 2622, 2917, 3007, 3204, 3337, 3869, 3923, 4072, 4272, 4327, 4463, 4596], '๐Ÿ˜ค': [917, 1480, 2206, 2294, 2445, 2629, 2822, 2863, 3087, 3190, 3245, 3393, 3468, 4586, 4595], '๐Ÿ˜˜': [281, 423, 599, 946, 964, 1880, 2950, 3540, 3821, 3988, 4103, 4192, 4242, 4543], '๐Ÿ™ˆ': [300, 390, 554, 1107, 1394, 1478, 1570, 2570, 3078, 3182, 3409, 3680, 3986, 4164], '๐Ÿคฃ': [3453, 3471, 3481, 3713, 3722, 3791, 3822, 3834, 3863, 4227, 4232, 4392, 4651], '๐Ÿผ': [300, 1161, 1478, 2057, 2379, 3277, 3334, 3446, 3569, 3644, 4349, 4592], '๐Ÿป': [462, 1327, 1914, 2246, 2989, 3275, 3291, 3763, 3918, 3933, 4256, 4559], '๐Ÿ˜„': [890, 3476, 3578, 3714, 3797, 3816, 3829, 3980, 4087, 4101, 4400, 4491], 'โ€ฆ': [102, 418, 2106, 2387, 2839, 2990, 3130, 3144, 3332, 3377, 3733], '๐Ÿ˜ฐ': [205, 1124, 1838, 2023, 3348, 3372, 3898, 4065, 4385, 4505, 4646], '๐Ÿ˜ž': [694, 1071, 1701, 2981, 3295, 3683, 3698, 3803, 3898, 4167, 4381], '๐Ÿ’”': [139, 891, 2034, 2229, 2316, 3008, 3393, 3715, 4332, 4583], '๐Ÿ˜ณ': [171, 457, 844, 1341, 1478, 1907, 2481, 2640, 4048, 4321], 'ยฃ': [495, 933, 1258, 1748, 1866, 2687, 3787, 3903, 4368, 4383], 'โ˜น': [232, 296, 830, 3282, 3346, 3910, 4244, 4384, 4591], '๐Ÿ˜•': [488, 762, 1406, 1709, 2410, 2651, 3205, 4399, 4430], 'โ€•': [124, 1072, 1194, 2391, 2751, 3386, 4026, 4536], '๐Ÿ”ฅ': [169, 528, 570, 1175, 1347, 2825, 4326, 4634], '๐Ÿ™': [280, 511, 3370, 3512, 3544, 3694, 3723, 4588], '๐Ÿค”': [354, 589, 917, 990, 1071, 2445, 4296, 4449], '๐Ÿ™': [385, 462, 1914, 1999, 3314, 3495, 4300, 4329], '๐Ÿ’•': [764, 1830, 2223, 3202, 3446, 4256, 4299, 4645], '๐Ÿ˜': [1129, 1152, 2993, 3855, 4060, 4183, 4198, 4574], '๐Ÿ˜’': [1632, 2263, 2281, 3059, 3393, 4287, 4399, 4616], '๐Ÿ˜ฑ': [456, 650, 2066, 3630, 3654, 4396, 4621], '๐Ÿ‘Œ': [1053, 1431, 2989, 3644, 3933, 4349, 4592], '๐Ÿ˜': [1075, 1980, 2073, 2910, 3437, 3679, 4495], '๐Ÿ˜ƒ': [1732, 3945, 4121, 4225, 4314, 4344, 4365], '๐Ÿ˜Ÿ': [3446, 3524, 3777, 3798, 4139, 4174, 4234], '๐Ÿ˜ท': [117, 169, 675, 1167, 1910, 4127], '๐Ÿ˜ฅ': [240, 2167, 3479, 3581, 3737, 4298], '๐Ÿ˜‘': [343, 2327, 2413, 2565, 3188, 3698], '๐Ÿ˜ฌ': [444, 2086, 2547, 3185, 4386, 4601], '๐Ÿ˜†': [449, 1060, 2074, 3619, 3743, 3871], 'โ€˜': [856, 1086, 1272, 2887, 3727, 4002], '๐Ÿพ': [1175, 1418, 1999, 2748, 3405, 3495], '๐Ÿ˜ง': [1541, 3563, 3671, 3766, 3949, 4271], '๐ŸŽ‰': [1585, 2240, 3275, 4041, 4103, 4299], 'โ€“': [1890, 2596, 2719, 2771, 2978, 3736], '๐Ÿฝ': [3118, 3786, 4169, 4300, 4329, 4495], '\u200d': [3291, 3763, 3786, 3918, 4169, 4559], '\xa0': [0, 392, 2106, 2896, 4328], '๐Ÿ™Œ': [54, 1175, 3405, 3913, 4256], '๐Ÿ˜‹': [118, 217, 2729, 2732, 4575], '๐Ÿ’ฆ': [217, 1965, 2773, 2943, 4634], '๐Ÿ‘€': [300, 2718, 3393, 3885, 4000], '๐Ÿ˜“': [532, 901, 1603, 3450, 4020], '๐Ÿ’–': [587, 1585, 1696, 2240, 3641], '๐Ÿ˜ด': [642, 2481, 3090, 4175, 4184], 'โœŒ': [1053, 2091, 2748, 2870, 3569], '๐Ÿ˜': [1492, 2165, 2521, 3393, 4325], 'โ˜บ': [1855, 1965, 2576, 2773, 3463], '๐Ÿ‘': [2164, 3275, 3334, 4341, 4495], 'โ™€': [3291, 3763, 3918, 4169, 4559], '๐Ÿ™‚': [442, 1056, 2750, 3474], '๐Ÿ˜€': [969, 4212, 4571, 4632], '๐Ÿค—': [1061, 1101, 1487, 2989], '๐Ÿ˜‰': [1259, 3488, 3540, 3946], '๐Ÿ˜ช': [1815, 1831, 2682, 2947], '๐Ÿ˜ฆ': [2226, 3243, 3276, 3661], '๐Ÿ˜”': [2703, 3841, 3900, 4616], 'โœจ': [54, 1005, 2384], '๐Ÿ’˜': [296, 587, 2123], '๐Ÿ˜–': [557, 3032, 3381], '๐Ÿ™…': [557, 1418, 1478], '๐Ÿ˜‡': [645, 3015, 3442], '๐Ÿ’€': [690, 727, 1687], '๐Ÿ”ช': [691, 2454, 4082], '๐Ÿ˜ซ': [975, 1344, 4300], '๐Ÿ˜ถ': [1031, 3185, 4622], '๐Ÿ‘Š': [1161, 2379, 3393], '๐Ÿ’ช': [1327, 2057, 3405], '๐ŸŽŠ': [1585, 2240, 3275], '๐Ÿ‘Ž': [1672, 3393, 4056], '๐Ÿ‘ป': [1875, 2431, 3495], '๐Ÿค“': [2537, 2697, 3983], '๐Ÿ˜': [2537, 3570, 3960], '๐Ÿคท': [3291, 3786, 4559], '๐Ÿ‡ฌ': [3405, 4267, 4432], '๐Ÿ’ฏ': [3405, 4341, 4509], '๐Ÿ‘': [3579, 4311, 4495], '๐Ÿคฆ': [3763, 3918, 4169], '๐ŸŽพ': [3933, 4038, 4601], '๐Ÿ˜Œ': [4329, 4332, 4553], '๐ŸŒš': [32, 4410], '๐ŸŒ': [32, 4410], '๐Ÿ': [181, 2433], '๐Ÿ‘': [297, 4207], '๐Ÿ™‹': [300, 3446], '๐Ÿ’™': [359, 2903], '๐Ÿค': [427, 4573], 'โ™ช': [502, 4264], '๐Ÿ˜™': [668, 2366], 'โœ': [694, 2485], '๐Ÿ’': [708, 3421], 'โ‚ฌ': [895, 2816], 'โ€ข': [953, 1879], '๐Ÿ‘‹': [958, 2398], '๐Ÿ‘': [1155, 2968], '๐Ÿ•': [1216, 2379], '๐ŸŽฉ': [1292, 3440], 'ยป': [1309, 4469], '๐ŸŒ': [1448, 3966], '๐Ÿ™Š': [1478, 2786], '๐Ÿ’ธ': [1632, 3059], '๐Ÿ’ค': [1809, 2562], '๏ผ‰': [1919, 3100], 'โ˜€': [2019, 4495], '๐Ÿ–•': [2455, 3277], '๐ŸŽธ': [2620, 2734], '๐ŸŽˆ': [2709, 3275], '๐Ÿ˜œ': [2943, 3600], 'โšฝ': [3078, 3885], '๐Ÿ˜ฒ': [3279, 4433], 'ยฟ': [3282, 4105], '๐Ÿ‡ฆ': [3325, 4613], '๐Ÿ”': [3333, 4256], 'โ˜”': [3390, 3797], '๐ŸŒธ': [3446, 4103], 'โœ…': [3782, 4543], '๐Ÿ’š': [3793, 4543], '๐Ÿฟ': [3913, 4341], '๐ŸŒž': [4103, 4517], '๐Ÿ‡ง': [4267, 4432], '๐Ÿ’ฉ': [19], '๐ŸŽถ': [66], '๐Ÿก': [169], '๐Ÿž': [169], '๐ŸŒ“': [181], 'โ˜ฏ': [181], '๐ŸŒน': [181], 'ยด': [205], '๐Ÿ”ซ': [300], '๐Ÿ‘ˆ': [300], '๐Ÿ˜ฟ': [360], '๐Ÿฅ': [382], '๐Ÿค–': [427], '๐ŸŒซ': [484], '๐ŸŒŠ': [484], '๐ŸŽฌ': [499], '๐Ÿ“ฝ': [499], 'ใ€Œ': [502], 'ใƒป': [502], 'ใ€': [502], 'ยฉ': [502], '๐Ÿ‘บ': [557], 'โ™จ': [599], 'โšซ': [694], '\uf645': [696], '\uf64a': [696], '\uf3fc': [696], '\uf648': [696], '\uf633': [696], 'โ€ผ': [747], 'โ—': [950], '๐Ÿ˜ˆ': [954], '๐Ÿณ': [1168], '๐ŸŽ': [1204], '๐ŸŒฒ': [1204], '๐Ÿบ': [1216], '๐Ÿน': [1216], 'โ˜‰': [1216], 'โ†': [1355], 'โ†’': [1355], '๐Ÿ‘ฝ': [1400], 'โฃ': [1417], '๐Ÿ…ฟ': [1418], '๐Ÿ›‚': [1448], 'โœˆ': [1474], '๐Ÿ’ฅ': [1508], '๐Ÿ•ฏ': [1521], '๐Ÿƒ': [1597], '๐ŸŽญ': [1662], '๐Ÿ˜ฎ': [1672], '๐Ÿ': [1678], 'เบด': [1704], 'อซ': [1704], 'ีŸ': [1704], '๐Ÿน': [1729], '๐Ÿ”จ': [1753], '๐Ÿ”ฉ': [1753], '๐Ÿข': [1753], 'โ™ฅ': [1806], '๐Ÿ‘ฟ': [1880], '\uf629': [1885], 'โ”': [1919], 'โ–ก': [1919], 'โ•ฏ': [1919], 'โ”ป': [1919], 'ยฐ': [1919], '๏ธต': [1919], '๐Ÿด': [1941], '๐Ÿฆ„': [1941], '๐Ÿ‚': [2019], '๐ŸŽต': [2019], '๐Ÿ†': [2048], '๐Ÿ‘‚': [2073], '๐Ÿฉ': [2091], 'โœ‹': [2246], '๐Ÿป': [2250], 'โ˜Š': [2267], '๐Ÿง€': [2379], '๐Ÿ': [2379], '๐Ÿ™': [2379], '๐Ÿฑ': [2379], '๐Ÿพ': [2433], '\U000fe334': [2486], 'โ„ข': [2513], '๐ŸŽผ': [2620], 'โฌ…': [2709], '๐Ÿ˜Ž': [2734], '๐Ÿพ': [2933], '๐Ÿถ': [3015], 'โ‰ฆ': [3100], 'โ‰ง': [3100], 'โˆ‡': [3100], '๏ผˆ': [3100], 'โœŠ': [3118], '๏ธŽ': [3199], '๐Ÿ˜ธ': [3314], 'โญ': [3318], '๐Ÿ‡จ': [3325], '๐Ÿ‘„': [3393], '๐Ÿ™': [3393], '๐Ÿ‡ณ': [3405], 'โœ‰': [3440], '๐ŸŒˆ': [3442], '๐ŸŒผ': [3446], '๐ŸŒป': [3446], '๐Ÿ˜จ': [3468], '๐Ÿ˜ต': [3468], '๐Ÿฐ': [3488], '๐Ÿซ': [3585], '๐Ÿ˜›': [3585], '\ufeff': [3627], '๐Ÿ’ฐ': [3751], '๐Ÿ’ต': [3751], '๐Ÿคข': [3763], '๐ŸŽƒ': [3782], 'โ™‚': [3786], '๐Ÿ’œ': [3844], 'โ™ซ': [3870], '๐Ÿ–ค': [3887], '๐Ÿคฌ': [3887], 'โ– ': [3997], '๐ŸŒฎ': [4041], '๐Ÿ’›': [4041], '๐ŸŠ': [4074], '๐Ÿคก': [4074], '๐Ÿ—ก': [4082], '๐Ÿœ': [4082], 'โ–ฝ': [4097], 'ใ€’': [4097], '๐Ÿ’ž': [4103], '๐Ÿ’—': [4103], '\u200b': [4195], 'โ˜•': [4283], '๐ŸŽ': [4299], 'โญ': [4360], 'โฌ': [4360], '๐Ÿฆ‡': [4404], 'ยซ': [4469], '๐Ÿ”ต': [4487], '๐ŸŒฅ': [4495], '๐ŸŒ‘': [4517], '๐ŸŒฑ': [4565], '๐Ÿ˜ฃ': [4571], 'ู': [4578], '๐Ÿ‡น': [4613], '๐Ÿ‘…': [4634]}

Define a condition#

We can add a condition that will validate that the percentage of samples with a significant ratio of special characters is below a certain threshold. Letโ€™s add a condition and re-run the check:

check.add_condition_samples_ratio_w_special_characters_less_or_equal(0.01)
result = check.run(text_data)
result.show()
Special Characters


Total running time of the script: (0 minutes 0.459 seconds)

Gallery generated by Sphinx-Gallery