The Byte Order Marker (BOM) is Unicode character U+FEFF. (It can also represent a Zero Width No-break Space.) The code point U+FFFE is illegal in Unicode, and should never appear in a Unicode character stream. Therefore the BOM can be used in the first character of a file (or more generally a string), as an indicator of endian-ness. With UTF-16, if the first character is read as bytes FE FF then the text has the same endian-ness as the machine reading it. If the character is read as bytes FF FE, then the endian-ness is reversed and all 16-bit words should be byte-swapped as they are read-in. In the same way, the BOM indicates the endian-ness of text encoded with UTF-32.
Note that not all files start with a BOM however. In fact, the Unicode Standard says that text that does not begin with a BOM MUST be interpreted in big-endian form.
The character U+FEFF also serves as an encoding signature for the Unicode Encoding Forms. The table shows the encoding of U+FEFF in each of the Unicode encoding forms. Note that by definition, text labeled as UTF-16BE, UTF-32BE, UTF-32LE or UTF-16LE should not have a BOM. The endian-ness is indicated in the label.
For text that is compressed with the SCSU (Standard Compression Scheme for Unicode) algorithm, there is also a recommended signature.
Encoding Form | BOM Encoding |
UTF-8 | EF BB BF |
UTF-16 | FE FF |
UTF-16 | FF FE |
UTF-16BE, UTF-32BE | No BOM! |
UTF-16LE, UTF-32LE | No BOM! |
UTF-32 | 00 00 FE FF |
UTF-32 | FF FE 00 00 |
SCSU | 0E FE FF |
Solutions:
Case 1: using ANSI std::ofstream:
wchar_t BOM = 0xFEFF;
std::ofstream outFile("filename.dat", std::ios::out | std::ios::binary);
outfile.write((char *) &BOM,sizeof(wchar_t));
Case 2: using ANSI std::wofstream:
const wchar_t BOM = 0xFEFF;
const char *fname = "abc.txt";
std::wofstream wfout;
wfout.open(fname,ios_base::binary);
//S1:
testFile << BOM;
//S2:
//testFile.put(BOM);
1 comment:
it seems UTF-16LE and UTF-16BE should be exchange the value. I mean
0xFEFF (UTF-16LE)
0xFFFE (UTF-16BE)
Post a Comment