軟件一般采用三種方式來決定文本的字符集和編碼:
檢測文件頭標識,提示用戶選擇,根據一定的規則猜測
最標準的途徑是檢測文本最開頭的幾個字節,開頭字節Charset/encoding,如下表:
EF BB BF UTF-8
FE FF UTF-16/UCS-2, little endian
FF FE UTF-16/UCS-2, big endian
FF FE 00 00 UTF-32/UCS-4, little endian.
00 00 FE FF UTF-32/UCS-4, big-endian.