La Vita è Bella

Monday, June 18, 2007

Add GBK encoding support to expat

expat is a good XML parser, light and quick. But it only support latin1, UTF-8 and UTF-16 naturally, if you want to use it to deal with other encoding XML's, you need to set a unknown encoding handler.

I use libiconv to convert the GBK string to Unicode for expat.

First, implement a function to pass to XML_SetUnknownEncodingHandler:

XML_Stream_Parser::xml_unknown_encoding(void* data, const char* name, XML_Encoding* info) {
        iconv_t cd;
        if(strncasecmp(name, "GB", 2) != 0 || (cd = iconv_open("UCS-2BE", name)) == (iconv_t)-1) {     // not GB, unsupported
                fprintf(stderr, "can't convert %s\n", name);
                return 0;
        for(size_t i=0; i<128; i++) info->map[i] = i;
        for(size_t i=128; i<256; i++) info->map[i] = -2;
        info->convert = XML_Stream_Parser::xml_convert_gb;
        info->release = XML_Stream_Parser::xml_convert_release;
        info->data = cd;
        return 1;

In this function, I tell expat that for GBK encoding, ASCII 0~127 is left as is, and ASCII 128~255 will need to be dealt together with the next byte.

Then implement the "convert" and "release" functions:

XML_Stream_Parser::xml_convert_gb(void* data, const char* s) {
        const size_t out_initial = 4;
        size_t inbytesleft = 2, outbytesleft = out_initial;
        char *out = new char[out_initial], *outnew = out;
        size_t res = iconv(data, &s, &inbytesleft, &outnew, &outbytesleft);
        int ret = 0;
        if(res == (size_t)-1) {
                fprintf(stderr, "error in conversion\n");
                delete []out;
                return '?';
        for(size_t i = 0; i < out_initial - outbytesleft; i++)
                ret = (ret<<8) + (unsigned char)out[i];
        delete []out;
        return ret;

XML_Stream_Parser::xml_convert_release(void* data) {

In "convert", I use iconv to convert the string to unicode, and return the unicode to expat.

The limitation in this interface is that it can't deal with 4-byte GB18030 codes, as I can't judge whether it's a 4-byte code just by the first code.

Anyway, I suggest that all XML should be encoded to UTF-8, so that this is unneeded :P

tags: , , , ,

18:46:30 by fishy - dev - Permanent Link

Revision: 1.2/1.2, last modified on 2007-07- 3 @ 22:40.

Karma: 129 (76.99% out of 239 were positive) [+/-]

You can subscribe to RSS 2.0 feed for comments and trackbacks

There are currently no trackbacks for this item.
Use this TrackBack url to ping this item (right-click, copy link target). If your blog does not support Trackbacks you can manually add your trackback by using this form.

No comments yet

Add Comment


May the Force be with you. RAmen