MySQL8默认字符集utf8mb4

MySQL中utf8字符集使用三个字节编码一个字符,自2004(mysql4.1)年被引入,能够支持绝大多数语言,但依然有些字符不能正确编码,如表情字符。为此mysql5.5引入了utf8mb4字符集,提供了另一种选择。在mysql5.7对utf8mb4进行了大幅优化,并丰富了校验字符集。直到今日mysql8, utf8mb4成为了默认字符集,并极大地扩充了collation set。 增加了accent sensitive(as) , accent_insensitive(ai)(重读敏感,对于某些语言需要重读的字符可以是用专用字符标识,如:ã) ,可以支持更加丰富的字符校验。

登陆MySQL8 输入status命令

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
mysql8[(none)]>status
--------------
bin/mysql Ver 8.0.11 for linux-glibc2.12 on x86_64 (MySQL Community Server - GPL)

Connection id: 8
Current database:
Current user: [email protected]
SSL: Not in use
Current pager: stdout
Using outfile: ''
Using delimiter: ;
Server version: 8.0.11 MySQL Community Server - GPL
Protocol version: 10
Connection: Localhost via UNIX socket
Server characterset: utf8mb4
Db characterset: utf8mb4
Client characterset: utf8mb4
Conn. characterset: utf8mb4
UNIX socket: /tmp/mysql8.sock
Uptime: 13 min 7 sec

Threads: 3 Questions: 24 Slow queries: 0 Opens: 127 Flush tables: 2 Open tables: 103 Queries per second avg: 0.030
--------------

可以看到其所有字符集都是utf8mb4. 那么其默认的校验字符集是什么呢?执行如下SQL, 该SQL查出了MySQL8支持的所有字符集校验,还是蛮丰富的,针对不少特定语言都有专用的校验字符集:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
mysql8[(none)]>select * from information_schema.COLLATIONS where CHARACTER_SET_NAME = 'utf8mb4';
+----------------------------+--------------------+-----+------------+-------------+---------+---------------+
| COLLATION_NAME | CHARACTER_SET_NAME | ID | IS_DEFAULT | IS_COMPILED | SORTLEN | PAD_ATTRIBUTE |
+----------------------------+--------------------+-----+------------+-------------+---------+---------------+
| utf8mb4_general_ci | utf8mb4 | 45 | | Yes | 1 | PAD SPACE |
| utf8mb4_bin | utf8mb4 | 46 | | Yes | 1 | PAD SPACE |
| utf8mb4_unicode_ci | utf8mb4 | 224 | | Yes | 8 | PAD SPACE |
| utf8mb4_icelandic_ci | utf8mb4 | 225 | | Yes | 8 | PAD SPACE |
| utf8mb4_latvian_ci | utf8mb4 | 226 | | Yes | 8 | PAD SPACE |
| utf8mb4_romanian_ci | utf8mb4 | 227 | | Yes | 8 | PAD SPACE |
| utf8mb4_slovenian_ci | utf8mb4 | 228 | | Yes | 8 | PAD SPACE |
| utf8mb4_polish_ci | utf8mb4 | 229 | | Yes | 8 | PAD SPACE |
| utf8mb4_estonian_ci | utf8mb4 | 230 | | Yes | 8 | PAD SPACE |
| utf8mb4_spanish_ci | utf8mb4 | 231 | | Yes | 8 | PAD SPACE |
| utf8mb4_swedish_ci | utf8mb4 | 232 | | Yes | 8 | PAD SPACE |
| utf8mb4_turkish_ci | utf8mb4 | 233 | | Yes | 8 | PAD SPACE |
| utf8mb4_czech_ci | utf8mb4 | 234 | | Yes | 8 | PAD SPACE |
| utf8mb4_danish_ci | utf8mb4 | 235 | | Yes | 8 | PAD SPACE |
| utf8mb4_lithuanian_ci | utf8mb4 | 236 | | Yes | 8 | PAD SPACE |
| utf8mb4_slovak_ci | utf8mb4 | 237 | | Yes | 8 | PAD SPACE |
| utf8mb4_spanish2_ci | utf8mb4 | 238 | | Yes | 8 | PAD SPACE |
| utf8mb4_roman_ci | utf8mb4 | 239 | | Yes | 8 | PAD SPACE |
| utf8mb4_persian_ci | utf8mb4 | 240 | | Yes | 8 | PAD SPACE |
| utf8mb4_esperanto_ci | utf8mb4 | 241 | | Yes | 8 | PAD SPACE |
| utf8mb4_hungarian_ci | utf8mb4 | 242 | | Yes | 8 | PAD SPACE |
| utf8mb4_sinhala_ci | utf8mb4 | 243 | | Yes | 8 | PAD SPACE |
| utf8mb4_german2_ci | utf8mb4 | 244 | | Yes | 8 | PAD SPACE |
| utf8mb4_croatian_ci | utf8mb4 | 245 | | Yes | 8 | PAD SPACE |
| utf8mb4_unicode_520_ci | utf8mb4 | 246 | | Yes | 8 | PAD SPACE |
| utf8mb4_vietnamese_ci | utf8mb4 | 247 | | Yes | 8 | PAD SPACE |
| utf8mb4_0900_ai_ci | utf8mb4 | 255 | Yes | Yes | 0 | NO PAD |
| utf8mb4_de_pb_0900_ai_ci | utf8mb4 | 256 | | Yes | 0 | NO PAD |
| utf8mb4_is_0900_ai_ci | utf8mb4 | 257 | | Yes | 0 | NO PAD |
| utf8mb4_lv_0900_ai_ci | utf8mb4 | 258 | | Yes | 0 | NO PAD |
| utf8mb4_ro_0900_ai_ci | utf8mb4 | 259 | | Yes | 0 | NO PAD |
| utf8mb4_sl_0900_ai_ci | utf8mb4 | 260 | | Yes | 0 | NO PAD |
| utf8mb4_pl_0900_ai_ci | utf8mb4 | 261 | | Yes | 0 | NO PAD |
| utf8mb4_et_0900_ai_ci | utf8mb4 | 262 | | Yes | 0 | NO PAD |
| utf8mb4_es_0900_ai_ci | utf8mb4 | 263 | | Yes | 0 | NO PAD |
| utf8mb4_sv_0900_ai_ci | utf8mb4 | 264 | | Yes | 0 | NO PAD |
| utf8mb4_tr_0900_ai_ci | utf8mb4 | 265 | | Yes | 0 | NO PAD |
| utf8mb4_cs_0900_ai_ci | utf8mb4 | 266 | | Yes | 0 | NO PAD |
| utf8mb4_da_0900_ai_ci | utf8mb4 | 267 | | Yes | 0 | NO PAD |
| utf8mb4_lt_0900_ai_ci | utf8mb4 | 268 | | Yes | 0 | NO PAD |
| utf8mb4_sk_0900_ai_ci | utf8mb4 | 269 | | Yes | 0 | NO PAD |
| utf8mb4_es_trad_0900_ai_ci | utf8mb4 | 270 | | Yes | 0 | NO PAD |
| utf8mb4_la_0900_ai_ci | utf8mb4 | 271 | | Yes | 0 | NO PAD |
| utf8mb4_eo_0900_ai_ci | utf8mb4 | 273 | | Yes | 0 | NO PAD |
| utf8mb4_hu_0900_ai_ci | utf8mb4 | 274 | | Yes | 0 | NO PAD |
| utf8mb4_hr_0900_ai_ci | utf8mb4 | 275 | | Yes | 0 | NO PAD |
| utf8mb4_vi_0900_ai_ci | utf8mb4 | 277 | | Yes | 0 | NO PAD |
| utf8mb4_0900_as_cs | utf8mb4 | 278 | | Yes | 0 | NO PAD |
| utf8mb4_de_pb_0900_as_cs | utf8mb4 | 279 | | Yes | 0 | NO PAD |
| utf8mb4_is_0900_as_cs | utf8mb4 | 280 | | Yes | 0 | NO PAD |
| utf8mb4_lv_0900_as_cs | utf8mb4 | 281 | | Yes | 0 | NO PAD |
| utf8mb4_ro_0900_as_cs | utf8mb4 | 282 | | Yes | 0 | NO PAD |
| utf8mb4_sl_0900_as_cs | utf8mb4 | 283 | | Yes | 0 | NO PAD |
| utf8mb4_pl_0900_as_cs | utf8mb4 | 284 | | Yes | 0 | NO PAD |
| utf8mb4_et_0900_as_cs | utf8mb4 | 285 | | Yes | 0 | NO PAD |
| utf8mb4_es_0900_as_cs | utf8mb4 | 286 | | Yes | 0 | NO PAD |
| utf8mb4_sv_0900_as_cs | utf8mb4 | 287 | | Yes | 0 | NO PAD |
| utf8mb4_tr_0900_as_cs | utf8mb4 | 288 | | Yes | 0 | NO PAD |
| utf8mb4_cs_0900_as_cs | utf8mb4 | 289 | | Yes | 0 | NO PAD |
| utf8mb4_da_0900_as_cs | utf8mb4 | 290 | | Yes | 0 | NO PAD |
| utf8mb4_lt_0900_as_cs | utf8mb4 | 291 | | Yes | 0 | NO PAD |
| utf8mb4_sk_0900_as_cs | utf8mb4 | 292 | | Yes | 0 | NO PAD |
| utf8mb4_es_trad_0900_as_cs | utf8mb4 | 293 | | Yes | 0 | NO PAD |
| utf8mb4_la_0900_as_cs | utf8mb4 | 294 | | Yes | 0 | NO PAD |
| utf8mb4_eo_0900_as_cs | utf8mb4 | 296 | | Yes | 0 | NO PAD |
| utf8mb4_hu_0900_as_cs | utf8mb4 | 297 | | Yes | 0 | NO PAD |
| utf8mb4_hr_0900_as_cs | utf8mb4 | 298 | | Yes | 0 | NO PAD |
| utf8mb4_vi_0900_as_cs | utf8mb4 | 300 | | Yes | 0 | NO PAD |
| utf8mb4_ja_0900_as_cs | utf8mb4 | 303 | | Yes | 0 | NO PAD |
| utf8mb4_ja_0900_as_cs_ks | utf8mb4 | 304 | | Yes | 24 | NO PAD |
| utf8mb4_0900_as_ci | utf8mb4 | 305 | | Yes | 0 | NO PAD |
| utf8mb4_ru_0900_ai_ci | utf8mb4 | 306 | | Yes | 0 | NO PAD |
| utf8mb4_ru_0900_as_cs | utf8mb4 | 307 | | Yes | 0 | NO PAD |
+----------------------------+--------------------+-----+------------+-------------+---------+---------------+
73 rows in set (0.03 sec)

在IS_DEFAULT 标为YES的那一行,可以看到其默认collation set 是utf8mb4_0900_ai_ci。 注意后面的ai_ci, i是指insensitive,ai是指accent insensitive,cs s是指case insensitive. 也就是说默认的字符集校验不区分大小写,如果有区分大小写的校验需求可以使用utf8mb4_0900_as_cs。

下面做一些测试

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
mysql8[test]>select 'a' < 'A' collate utf8mb4_0900_as_cs;
+--------------------------------------+
| 'a' < 'A' collate utf8mb4_0900_as_cs |
+--------------------------------------+
| 1 |
+--------------------------------------+
1 row in set (0.00 sec)

mysql8[test]>select 'a' < 'A' collate utf8mb4_0900_ai_ci;
+--------------------------------------+
| 'a' < 'A' collate utf8mb4_0900_ai_ci |
+--------------------------------------+
| 0 |
+--------------------------------------+
1 row in set (0.00 sec)

mysql8[test]>select 'a' = 'A' collate utf8mb4_0900_ai_ci;
+--------------------------------------+
| 'a' = 'A' collate utf8mb4_0900_ai_ci |
+--------------------------------------+
| 1 |
+--------------------------------------+

可以看到使用collate utf8mb4_0900_ai_ci校验只有’a’=’A’是true,其余都是Fase,而使用utf8mb4_0900_as_cs校验’a’<’A’则是True . 其对比并非按ASCII码,因为按照ASCII, ‘a’是大于’A’的。 只有使用二进制字符集校验才使用ASCII,如下:

1
2
3
4
5
6
7
mysql8[(none)]>select 'a' < 'A' collate utf8mb4_bin;
+-------------------------------+
| 'a' < 'A' collate utf8mb4_bin |
+-------------------------------+
| 0 |
+-------------------------------+
1 row in set (0.00 sec)

这种情况需要了解,有时见到不一样的排序结果要知道是怎么回事。要深入理解其内部实现,请点击这里,内容太多,这里不详细赘述。

还记得以前登陆MySQL为各种编码而头疼吗? 什么数据库编码,连接编码,客户端编码,搞的晕头转向,一会应用进程读取正常,命令行不正常,一会儿命令行正常,应用进程读取的是乱码,真是无所适从啊。现在好了,MySQL 8 的到来使这些痛苦不再。utf8将会被载入历史,utf8mb4将会成为主流。当然,也有一个弊端,就是对每一个字符多用了一个字节来存储,这意味着你的数据库要增大不少,你是不是该扩容磁盘啦?!

参考资料

https://mysqlserverteam.com/mysql-8-0-when-to-use-utf8mb3-over-utf8mb4/

http://mysqlserverteam.com/mysql-8-0-1-accent-and-case-sensitive-collations-for-utf8mb4/