Add GBK encoding support to expat
expat is a good XML parser, light and quick. But it only support latin1, UTF-8 and UTF-16 naturally, if you want to use it to deal with other encoding XML's, you need to set a unknown encoding handler.
I use libiconv to convert the GBK string to Unicode for expat.
First, implement a function to pass to XML_SetUnknownEncodingHandler:
int
XML_Stream_Parser::xml_unknown_encoding(void* data, const char* name, XML_Encoding* info) {
iconv_t cd;
if(strncasecmp(name, "GB", 2) != 0 || (cd = iconv_open("UCS-2BE", name)) == (iconv_t)-1) { // not GB, unsupported
fprintf(stderr, "can't convert %s\n", name);
return 0;
}
for(size_t i=0; i<128; i++) info->map[i] = i;
for(size_t i=128; i<256; i++) info->map[i] = -2;
info->convert = XML_Stream_Parser::xml_convert_gb;
info->release = XML_Stream_Parser::xml_convert_release;
info->data = cd;
return 1;
}
In this function, I tell expat that for GBK encoding, ASCII 0~127 is left as is, and ASCII 128~255 will need to be dealt together with the next byte.
Then implement the "convert" and "release" functions:
int
XML_Stream_Parser::xml_convert_gb(void* data, const char* s) {
const size_t out_initial = 4;
size_t inbytesleft = 2, outbytesleft = out_initial;
char *out = new char[out_initial], *outnew = out;
size_t res = iconv(data, &s, &inbytesleft, &outnew, &outbytesleft);
int ret = 0;
if(res == (size_t)-1) {
fprintf(stderr, "error in conversion\n");
delete []out;
return '?';
}
for(size_t i = 0; i < out_initial - outbytesleft; i++)
ret = (ret<<8) + (unsigned char)out[i];
delete []out;
return ret;
}
void
XML_Stream_Parser::xml_convert_release(void* data) {
iconv_close(data);
}
In "convert", I use iconv to convert the string to unicode, and return the unicode to expat.
The limitation in this interface is that it can't deal with 4-byte GB18030 codes, as I can't judge whether it's a 4-byte code just by the first code.
Anyway, I suggest that all XML should be encoded to UTF-8, so that this is unneeded :P