La Vita è Bella

2007-06-18

Add GBK encoding support to expat

expat is a good XML parser, light and quick. But it only support latin1, UTF-8 and UTF-16 naturally, if you want to use it to deal with other encoding XML's, you need to set a unknown encoding handler.

I use libiconv to convert the GBK string to Unicode for expat.

First, implement a function to pass to XML_SetUnknownEncodingHandler:

int
XML_Stream_Parser::xml_unknown_encoding(void* data, const char* name, XML_Encoding* info) {
        iconv_t cd;
        if(strncasecmp(name, "GB", 2) != 0 || (cd = iconv_open("UCS-2BE", name)) == (iconv_t)-1) {     // not GB, unsupported
                fprintf(stderr, "can't convert %s\n", name);
                return 0;
        }
        for(size_t i=0; i<128; i++) info->map[i] = i;
        for(size_t i=128; i<256; i++) info->map[i] = -2;
        info->convert = XML_Stream_Parser::xml_convert_gb;
        info->release = XML_Stream_Parser::xml_convert_release;
        info->data = cd;
        return 1;
}

In this function, I tell expat that for GBK encoding, ASCII 0~127 is left as is, and ASCII 128~255 will need to be dealt together with the next byte.

Then implement the "convert" and "release" functions:

int
XML_Stream_Parser::xml_convert_gb(void* data, const char* s) {
        const size_t out_initial = 4;
        size_t inbytesleft = 2, outbytesleft = out_initial;
        char *out = new char[out_initial], *outnew = out;
        size_t res = iconv(data, &s, &inbytesleft, &outnew, &outbytesleft);
        int ret = 0;
        if(res == (size_t)-1) {
                fprintf(stderr, "error in conversion\n");
                delete []out;
                return '?';
        }
        for(size_t i = 0; i < out_initial - outbytesleft; i++)
                ret = (ret<<8) + (unsigned char)out[i];
        delete []out;
        return ret;
}

void
XML_Stream_Parser::xml_convert_release(void* data) {
        iconv_close(data);
}

In "convert", I use iconv to convert the string to unicode, and return the unicode to expat.

The limitation in this interface is that it can't deal with 4-byte GB18030 codes, as I can't judge whether it's a 4-byte code just by the first code.

Anyway, I suggest that all XML should be encoded to UTF-8, so that this is unneeded :P

18:46:30 by fishy - Permanent Link

May the Force be with you. RAmen