.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "nlp/auto_checks/data_integrity/plot_special_characters.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_nlp_auto_checks_data_integrity_plot_special_characters.py: .. _nlp__special_characters: Special Characters ****************** This notebook provides an overview for using and understanding the special characters check. **Structure:** * `Why check for special characters? <#why-check-for-text-data-duplicates>`__ * `Generate data & model <#generate-data-model>`__ * `Run the Check <#run-the-check>`__ * `Define a Condition <#define-a-condition>`__ Why check for special characters? =================================== The ``SpecialCharacters`` check looks for text sample in which the percentage of special characters out of all characters is significant. Such samples can be an indicator for a problem in the data pipeline that require attention. Additionally, such examples may be problematic for the model to predict on. For example, a text sample with many emojis may be hard to predict on and a common methodology will be to replace them with a textual representation of the emoji. Generate data & model ===================== Let's create a simple dataset with some duplicate and similar text samples. .. GENERATED FROM PYTHON SOURCE LINES 31-37 .. code-block:: default from deepchecks.nlp.datasets.classification import tweet_emotion text_data = tweet_emotion.load_data(as_train_test=False) text_data.head(3) .. raw:: html
text label user_age gender days_on_platform user_region
0 โ€œWorry is a down payment on a problem you may ... optimism 30.73 Male 5614 Americas
1 My roommate: it's okay that we can't spell bec... anger 42.29 Female 4308 Europe
2 No but that's so cute. Atsu was probably shy a... happiness 24.97 Male 2729 Middle East/Africa


.. GENERATED FROM PYTHON SOURCE LINES 38-40 Run the Check ============= .. GENERATED FROM PYTHON SOURCE LINES 40-47 .. code-block:: default from deepchecks.nlp.checks import SpecialCharacters check = SpecialCharacters() result = check.run(text_data) result.show() .. raw:: html
Special Characters


.. GENERATED FROM PYTHON SOURCE LINES 48-54 We can see in the check display that ~17% of the samples contain at least one special character and that the samples with the highest percentage of special characters contain many emojis. In addition to the check display we can also see receive a summary of most common special characters and which samples contain them. This can assist us in conforming that the majority of the special characters in this dataset are indeed emojis. .. GENERATED FROM PYTHON SOURCE LINES 54-57 .. code-block:: default result.value['samples_per_special_char'] .. rst-class:: sphx-glr-script-out .. code-block:: none {'๐Ÿ˜‚': [58, 78, 200, 204, 354, 413, 469, 494, 525, 754, 810, 873, 916, 936, 1033, 1037, 1101, 1167, 1250, 1323, 1352, 1378, 1469, 1492, 1564, 1687, 1715, 1820, 1887, 1934, 2030, 2049, 2153, 2173, 2327, 2376, 2408, 2533, 2546, 2567, 2729, 2744, 2759, 2765, 2798, 2861, 2908, 2973, 3044, 3099, 3128, 3133, 3277, 3295, 3323, 3328, 3403, 3421, 3546, 3599, 3680, 3693, 3706, 3708, 3713, 3719, 3720, 3764, 3772, 3815, 3817, 3862, 3878, 3885, 3891, 3906, 3929, 3964, 4010, 4031, 4037, 4057, 4111, 4112, 4190, 4191, 4240, 4241, 4256, 4267, 4309, 4316, 4322, 4336, 4341, 4361, 4387, 4402, 4495, 4546, 4559, 4578], '๏ธ': [181, 184, 232, 296, 423, 747, 830, 889, 950, 1016, 1399, 1418, 1468, 1474, 1855, 1965, 2005, 2057, 2091, 2485, 2576, 2709, 2730, 2748, 2773, 2870, 3071, 3078, 3291, 3318, 3346, 3440, 3463, 3569, 3763, 3786, 3797, 3825, 3885, 3918, 3959, 4103, 4161, 4169, 4231, 4283, 4495, 4509, 4559, 4573], '๐Ÿ˜ญ': [78, 139, 478, 606, 754, 1275, 1492, 1637, 1687, 1721, 1781, 1918, 2008, 2016, 2081, 2178, 2533, 2620, 2744, 2971, 2973, 3308, 3420, 3456, 3483, 3554, 3615, 3640, 3692, 3696, 3725, 3772, 3792, 3815, 3883, 3887, 3898, 4008, 4051, 4119, 4127, 4157, 4231, 4384, 4420, 4460, 4525, 4540, 4563], 'โ€™': [21, 23, 39, 82, 361, 394, 557, 856, 980, 1086, 1272, 1296, 1397, 1420, 1670, 1714, 2117, 2166, 2267, 2406, 2434, 2569, 2578, 2596, 2719, 2775, 2819, 2887, 3020, 3052, 3693, 3727, 3805, 3962, 4002, 4063, 4186, 4381, 4453, 4497], '๐Ÿ˜ก': [169, 171, 272, 327, 495, 786, 807, 854, 1030, 1093, 1161, 1235, 1326, 1327, 2127, 2212, 2900, 3375, 3393, 3468, 3606, 3755, 3774, 3787, 4045, 4180, 4205, 4209, 4224], '๐Ÿ™„': [30, 167, 250, 478, 709, 714, 1297, 1331, 1352, 1418, 1497, 1678, 2153, 2312, 2525, 2748, 2756, 2759, 2765, 2854, 2973, 3099, 3204, 3343, 4597], 'โค': [184, 423, 668, 889, 1016, 1399, 1468, 2005, 2057, 2091, 2514, 2730, 3071, 3199, 3476, 3587, 3825, 3959, 4103, 4109, 4161, 4231, 4471, 4509, 4573], 'โ€œ': [0, 43, 349, 508, 598, 994, 1677, 1890, 2276, 2406, 2751, 2769, 2774, 2775, 2934, 3113, 3290, 3310, 3897, 3962, 4026, 4392], 'โ€': [43, 349, 508, 598, 994, 1677, 1879, 1890, 2276, 2751, 2769, 2775, 2934, 3113, 3290, 3310, 3897, 3962, 4026, 4360, 4392], '๐Ÿ˜ฉ': [116, 494, 1117, 1143, 1542, 1687, 1809, 1880, 2229, 2682, 2792, 3029, 3172, 3215, 3519, 3617, 3986, 4127, 4157, 4381, 4399], '๐Ÿ˜ข': [659, 1617, 1948, 2008, 2052, 2250, 2316, 2812, 3416, 3480, 3584, 3747, 3838, 3898, 3915, 4020, 4303, 4304, 4562, 4608], '๐Ÿ˜': [155, 764, 908, 1236, 1315, 1468, 2360, 3041, 3055, 3320, 3529, 3796, 3825, 3862, 4248, 4294, 4471, 4603], '๐Ÿ™ƒ': [200, 424, 549, 589, 625, 868, 1290, 1793, 2567, 3030, 3260, 3371, 3467, 3532, 3669, 3703, 3876, 4361], '๐Ÿ˜Š': [1037, 1080, 1319, 1468, 1696, 2005, 2285, 2992, 3098, 3202, 3577, 3591, 3769, 4006, 4116, 4207, 4251, 4335], '๐Ÿ˜ ': [250, 2053, 2211, 2212, 3316, 3375, 3393, 3468, 3532, 3611, 3632, 3670, 3705, 4027, 4054, 4418, 4450], 'โ€”': [39, 241, 437, 845, 1560, 1565, 1612, 1762, 2124, 2184, 2561, 2747, 2769, 2810, 3969, 4360], '๐Ÿ˜…': [311, 838, 2430, 2591, 2622, 2917, 3007, 3204, 3337, 3869, 3923, 4072, 4272, 4327, 4463, 4596], '๐Ÿ˜ค': [917, 1480, 2206, 2294, 2445, 2629, 2822, 2863, 3087, 3190, 3245, 3393, 3468, 4586, 4595], '๐Ÿ˜˜': [281, 423, 599, 946, 964, 1880, 2950, 3540, 3821, 3988, 4103, 4192, 4242, 4543], '๐Ÿ™ˆ': [300, 390, 554, 1107, 1394, 1478, 1570, 2570, 3078, 3182, 3409, 3680, 3986, 4164], '๐Ÿคฃ': [3453, 3471, 3481, 3713, 3722, 3791, 3822, 3834, 3863, 4227, 4232, 4392, 4651], '๐Ÿผ': [300, 1161, 1478, 2057, 2379, 3277, 3334, 3446, 3569, 3644, 4349, 4592], '๐Ÿป': [462, 1327, 1914, 2246, 2989, 3275, 3291, 3763, 3918, 3933, 4256, 4559], '๐Ÿ˜„': [890, 3476, 3578, 3714, 3797, 3816, 3829, 3980, 4087, 4101, 4400, 4491], 'โ€ฆ': [102, 418, 2106, 2387, 2839, 2990, 3130, 3144, 3332, 3377, 3733], '๐Ÿ˜ฐ': [205, 1124, 1838, 2023, 3348, 3372, 3898, 4065, 4385, 4505, 4646], '๐Ÿ˜ž': [694, 1071, 1701, 2981, 3295, 3683, 3698, 3803, 3898, 4167, 4381], '๐Ÿ’”': [139, 891, 2034, 2229, 2316, 3008, 3393, 3715, 4332, 4583], '๐Ÿ˜ณ': [171, 457, 844, 1341, 1478, 1907, 2481, 2640, 4048, 4321], 'ยฃ': [495, 933, 1258, 1748, 1866, 2687, 3787, 3903, 4368, 4383], 'โ˜น': [232, 296, 830, 3282, 3346, 3910, 4244, 4384, 4591], '๐Ÿ˜•': [488, 762, 1406, 1709, 2410, 2651, 3205, 4399, 4430], 'โ€•': [124, 1072, 1194, 2391, 2751, 3386, 4026, 4536], '๐Ÿ”ฅ': [169, 528, 570, 1175, 1347, 2825, 4326, 4634], '๐Ÿ™': [280, 511, 3370, 3512, 3544, 3694, 3723, 4588], '๐Ÿค”': [354, 589, 917, 990, 1071, 2445, 4296, 4449], '๐Ÿ™': [385, 462, 1914, 1999, 3314, 3495, 4300, 4329], '๐Ÿ’•': [764, 1830, 2223, 3202, 3446, 4256, 4299, 4645], '๐Ÿ˜': [1129, 1152, 2993, 3855, 4060, 4183, 4198, 4574], '๐Ÿ˜’': [1632, 2263, 2281, 3059, 3393, 4287, 4399, 4616], '๐Ÿ˜ฑ': [456, 650, 2066, 3630, 3654, 4396, 4621], '๐Ÿ‘Œ': [1053, 1431, 2989, 3644, 3933, 4349, 4592], '๐Ÿ˜': [1075, 1980, 2073, 2910, 3437, 3679, 4495], '๐Ÿ˜ƒ': [1732, 3945, 4121, 4225, 4314, 4344, 4365], '๐Ÿ˜Ÿ': [3446, 3524, 3777, 3798, 4139, 4174, 4234], '๐Ÿ˜ท': [117, 169, 675, 1167, 1910, 4127], '๐Ÿ˜ฅ': [240, 2167, 3479, 3581, 3737, 4298], '๐Ÿ˜‘': [343, 2327, 2413, 2565, 3188, 3698], '๐Ÿ˜ฌ': [444, 2086, 2547, 3185, 4386, 4601], '๐Ÿ˜†': [449, 1060, 2074, 3619, 3743, 3871], 'โ€˜': [856, 1086, 1272, 2887, 3727, 4002], '๐Ÿพ': [1175, 1418, 1999, 2748, 3405, 3495], '๐Ÿ˜ง': [1541, 3563, 3671, 3766, 3949, 4271], '๐ŸŽ‰': [1585, 2240, 3275, 4041, 4103, 4299], 'โ€“': [1890, 2596, 2719, 2771, 2978, 3736], '๐Ÿฝ': [3118, 3786, 4169, 4300, 4329, 4495], '\u200d': [3291, 3763, 3786, 3918, 4169, 4559], '\xa0': [0, 392, 2106, 2896, 4328], '๐Ÿ™Œ': [54, 1175, 3405, 3913, 4256], '๐Ÿ˜‹': [118, 217, 2729, 2732, 4575], '๐Ÿ’ฆ': [217, 1965, 2773, 2943, 4634], '๐Ÿ‘€': [300, 2718, 3393, 3885, 4000], '๐Ÿ˜“': [532, 901, 1603, 3450, 4020], '๐Ÿ’–': [587, 1585, 1696, 2240, 3641], '๐Ÿ˜ด': [642, 2481, 3090, 4175, 4184], 'โœŒ': [1053, 2091, 2748, 2870, 3569], '๐Ÿ˜': [1492, 2165, 2521, 3393, 4325], 'โ˜บ': [1855, 1965, 2576, 2773, 3463], '๐Ÿ‘': [2164, 3275, 3334, 4341, 4495], 'โ™€': [3291, 3763, 3918, 4169, 4559], '๐Ÿ™‚': [442, 1056, 2750, 3474], '๐Ÿ˜€': [969, 4212, 4571, 4632], '๐Ÿค—': [1061, 1101, 1487, 2989], '๐Ÿ˜‰': [1259, 3488, 3540, 3946], '๐Ÿ˜ช': [1815, 1831, 2682, 2947], '๐Ÿ˜ฆ': [2226, 3243, 3276, 3661], '๐Ÿ˜”': [2703, 3841, 3900, 4616], 'โœจ': [54, 1005, 2384], '๐Ÿ’˜': [296, 587, 2123], '๐Ÿ˜–': [557, 3032, 3381], '๐Ÿ™…': [557, 1418, 1478], '๐Ÿ˜‡': [645, 3015, 3442], '๐Ÿ’€': [690, 727, 1687], '๐Ÿ”ช': [691, 2454, 4082], '๐Ÿ˜ซ': [975, 1344, 4300], '๐Ÿ˜ถ': [1031, 3185, 4622], '๐Ÿ‘Š': [1161, 2379, 3393], '๐Ÿ’ช': [1327, 2057, 3405], '๐ŸŽŠ': [1585, 2240, 3275], '๐Ÿ‘Ž': [1672, 3393, 4056], '๐Ÿ‘ป': [1875, 2431, 3495], '๐Ÿค“': [2537, 2697, 3983], '๐Ÿ˜': [2537, 3570, 3960], '๐Ÿคท': [3291, 3786, 4559], '๐Ÿ‡ฌ': [3405, 4267, 4432], '๐Ÿ’ฏ': [3405, 4341, 4509], '๐Ÿ‘': [3579, 4311, 4495], '๐Ÿคฆ': [3763, 3918, 4169], '๐ŸŽพ': [3933, 4038, 4601], '๐Ÿ˜Œ': [4329, 4332, 4553], '๐ŸŒš': [32, 4410], '๐ŸŒ': [32, 4410], '๐Ÿ': [181, 2433], '๐Ÿ‘': [297, 4207], '๐Ÿ™‹': [300, 3446], '๐Ÿ’™': [359, 2903], '๐Ÿค': [427, 4573], 'โ™ช': [502, 4264], '๐Ÿ˜™': [668, 2366], 'โœ': [694, 2485], '๐Ÿ’': [708, 3421], 'โ‚ฌ': [895, 2816], 'โ€ข': [953, 1879], '๐Ÿ‘‹': [958, 2398], '๐Ÿ‘': [1155, 2968], '๐Ÿ•': [1216, 2379], '๐ŸŽฉ': [1292, 3440], 'ยป': [1309, 4469], '๐ŸŒ': [1448, 3966], '๐Ÿ™Š': [1478, 2786], '๐Ÿ’ธ': [1632, 3059], '๐Ÿ’ค': [1809, 2562], '๏ผ‰': [1919, 3100], 'โ˜€': [2019, 4495], '๐Ÿ–•': [2455, 3277], '๐ŸŽธ': [2620, 2734], '๐ŸŽˆ': [2709, 3275], '๐Ÿ˜œ': [2943, 3600], 'โšฝ': [3078, 3885], '๐Ÿ˜ฒ': [3279, 4433], 'ยฟ': [3282, 4105], '๐Ÿ‡ฆ': [3325, 4613], '๐Ÿ”': [3333, 4256], 'โ˜”': [3390, 3797], '๐ŸŒธ': [3446, 4103], 'โœ…': [3782, 4543], '๐Ÿ’š': [3793, 4543], '๐Ÿฟ': [3913, 4341], '๐ŸŒž': [4103, 4517], '๐Ÿ‡ง': [4267, 4432], '๐Ÿ’ฉ': [19], '๐ŸŽถ': [66], '๐Ÿก': [169], '๐Ÿž': [169], '๐ŸŒ“': [181], 'โ˜ฏ': [181], '๐ŸŒน': [181], 'ยด': [205], '๐Ÿ”ซ': [300], '๐Ÿ‘ˆ': [300], '๐Ÿ˜ฟ': [360], '๐Ÿฅ': [382], '๐Ÿค–': [427], '๐ŸŒซ': [484], '๐ŸŒŠ': [484], '๐ŸŽฌ': [499], '๐Ÿ“ฝ': [499], 'ใ€Œ': [502], 'ใƒป': [502], 'ใ€': [502], 'ยฉ': [502], '๐Ÿ‘บ': [557], 'โ™จ': [599], 'โšซ': [694], '\uf645': [696], '\uf64a': [696], '\uf3fc': [696], '\uf648': [696], '\uf633': [696], 'โ€ผ': [747], 'โ—': [950], '๐Ÿ˜ˆ': [954], '๐Ÿณ': [1168], '๐ŸŽ': [1204], '๐ŸŒฒ': [1204], '๐Ÿบ': [1216], '๐Ÿน': [1216], 'โ˜‰': [1216], 'โ†': [1355], 'โ†’': [1355], '๐Ÿ‘ฝ': [1400], 'โฃ': [1417], '๐Ÿ…ฟ': [1418], '๐Ÿ›‚': [1448], 'โœˆ': [1474], '๐Ÿ’ฅ': [1508], '๐Ÿ•ฏ': [1521], '๐Ÿƒ': [1597], '๐ŸŽญ': [1662], '๐Ÿ˜ฎ': [1672], '๐Ÿ': [1678], 'เบด': [1704], 'อซ': [1704], 'ีŸ': [1704], '๐Ÿน': [1729], '๐Ÿ”จ': [1753], '๐Ÿ”ฉ': [1753], '๐Ÿข': [1753], 'โ™ฅ': [1806], '๐Ÿ‘ฟ': [1880], '\uf629': [1885], 'โ”': [1919], 'โ–ก': [1919], 'โ•ฏ': [1919], 'โ”ป': [1919], 'ยฐ': [1919], '๏ธต': [1919], '๐Ÿด': [1941], '๐Ÿฆ„': [1941], '๐Ÿ‚': [2019], '๐ŸŽต': [2019], '๐Ÿ†': [2048], '๐Ÿ‘‚': [2073], '๐Ÿฉ': [2091], 'โœ‹': [2246], '๐Ÿป': [2250], 'โ˜Š': [2267], '๐Ÿง€': [2379], '๐Ÿ': [2379], '๐Ÿ™': [2379], '๐Ÿฑ': [2379], '๐Ÿพ': [2433], '\U000fe334': [2486], 'โ„ข': [2513], '๐ŸŽผ': [2620], 'โฌ…': [2709], '๐Ÿ˜Ž': [2734], '๐Ÿพ': [2933], '๐Ÿถ': [3015], 'โ‰ฆ': [3100], 'โ‰ง': [3100], 'โˆ‡': [3100], '๏ผˆ': [3100], 'โœŠ': [3118], '๏ธŽ': [3199], '๐Ÿ˜ธ': [3314], 'โญ': [3318], '๐Ÿ‡จ': [3325], '๐Ÿ‘„': [3393], '๐Ÿ™': [3393], '๐Ÿ‡ณ': [3405], 'โœ‰': [3440], '๐ŸŒˆ': [3442], '๐ŸŒผ': [3446], '๐ŸŒป': [3446], '๐Ÿ˜จ': [3468], '๐Ÿ˜ต': [3468], '๐Ÿฐ': [3488], '๐Ÿซ': [3585], '๐Ÿ˜›': [3585], '\ufeff': [3627], '๐Ÿ’ฐ': [3751], '๐Ÿ’ต': [3751], '๐Ÿคข': [3763], '๐ŸŽƒ': [3782], 'โ™‚': [3786], '๐Ÿ’œ': [3844], 'โ™ซ': [3870], '๐Ÿ–ค': [3887], '๐Ÿคฌ': [3887], 'โ– ': [3997], '๐ŸŒฎ': [4041], '๐Ÿ’›': [4041], '๐ŸŠ': [4074], '๐Ÿคก': [4074], '๐Ÿ—ก': [4082], '๐Ÿœ': [4082], 'โ–ฝ': [4097], 'ใ€’': [4097], '๐Ÿ’ž': [4103], '๐Ÿ’—': [4103], '\u200b': [4195], 'โ˜•': [4283], '๐ŸŽ': [4299], 'โญ': [4360], 'โฌ': [4360], '๐Ÿฆ‡': [4404], 'ยซ': [4469], '๐Ÿ”ต': [4487], '๐ŸŒฅ': [4495], '๐ŸŒ‘': [4517], '๐ŸŒฑ': [4565], '๐Ÿ˜ฃ': [4571], 'ู': [4578], '๐Ÿ‡น': [4613], '๐Ÿ‘…': [4634]} .. GENERATED FROM PYTHON SOURCE LINES 58-65 Define a condition ================== We can add a condition that will validate that the percentage of samples with a significant ratio of special characters is below a certain threshold. Let's add a condition and re-run the check: .. GENERATED FROM PYTHON SOURCE LINES 65-69 .. code-block:: default check.add_condition_samples_ratio_w_special_characters_less_or_equal(0.01) result = check.run(text_data) result.show() .. raw:: html
Special Characters


.. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 0.459 seconds) .. _sphx_glr_download_nlp_auto_checks_data_integrity_plot_special_characters.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_special_characters.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_special_characters.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_