GDAL 1.9 and Unicode issues

29.09.2011 08:25 ·  GIS  ·  gdal

GDAL is gradually moving towards full Unicode support: RFC 23 has already been implemented, and there are ongoing efforts to implement RFC 5.

Another step was the implementation of recoding of Shapefile attributes to UTF-8 when reading and from UTF-8 when writing. Except for that… the encoding is determined by reading the LDID (Language Driver ID) from the DBF header. In general, this is the right approach, but I can’t remember the last time I saw shapfiles with the encoding specified correctly. Mostly, there are files with LDID set to 87, which corresponds to the default value.

This is where the most interesting part begins. It is clear that this default is different for everyone. And in the current implementation, the LDID/87 value is interpreted as ISO8859_1 (Latin-1). The problem with this approach, I think, is clear to everyone. The proposed solution is either to edit the existing files and set the required DBF encoding or to override the interpretation of the LDID value by setting an environment variable. The first method requires more work (it is necessary to find out the encoding of each Shape file and write it in the DBF header), but it is also the most correct. The second one is actually an ugly workaround because, with this approach, only files in one, overridden, encoding will be read correctly. All others will still be displayed as unreadable garbage, which is unacceptable when using data in different encodings.

⮜ Prev
Next ⮞