hgwebdir, Apache, CGI and character encodings (and the Wiki)

Paul Boddie paul.boddie at biotek.uio.no
Thu Nov 12 09:13:27 CST 2009


Apologies for following up to myself here, but the problem has been 
solved. The nasty details are below.

Paul Boddie wrote:
> This makes sense: the metadata will always be transmitted in the form 
> in which it was entered. However, this doesn't explain why metadata 
> gets sent as UTF-8 by hgwebdir even when I specify HGENCODING as 
> ISO-8859-1.

Here, it appears that hgwebdir was doing the right thing all along. I 
had been under the impression that files were being written as 
ISO-8859-1, and creating a new file in vim does exactly this. However, 
my commit messages were not being saved as ISO-8859-1, and I verified 
this by saving a copy of the file and looking at it after the commit: 
vim was saving the file as UTF-8. So, the metadata ended up having each 
character UTF-8 encoded twice, and when displaying the metadata, hg (and 
hgwebdir) were encoding the characters to ISO-8859-1, but this merely 
reproduced a UTF-8 stream (not the anticipated ISO-8859-1 stream).

[...]

> It's quite possible that I'm exposing all kinds of weird behaviour 
> here - things which aren't likely to arise in the UTF-8-only example 
> given above - and I suppose I should try this out at home where my 
> locale is also an ISO-8859-1 variant, but I'm not yet convinced that 
> hgwebdir is doing the right thing, and when I first wrote to the list 
> I was more confident that hg was doing the right thing, but now I'm 
> not so sure about that, either.

I am convinced that hg and hgwebdir are doing the right thing now. At 
home, all my locale definitions use ISO-8859-15 as the character set and 
everything works fine. At work, where I have the problem, LC_CTYPE was 
set to "no_NO" (which is an ISO-8859-1 encoding, as reported by Python), 
but LANG and other LC_ variables (except LC_ALL) are set to 
"en_US.UTF-8". Here's where it gets really nasty:

1. vim appears to look at the LANG variable and sees UTF-8 as the encoding.

2. vim reads the /etc/vimrc file which, on this Red Hat Enterprise Linux 
installation (possibly modified in my organisation), has the following 
logic:

  if v:lang =~ "utf8$" || v:lang =~ "UTF-8$"
     set fileencodings=utf-8,latin1
  endif

3. Thinking that UTF-8 is the encoding, vim reads the commit message 
template and chooses UTF-8 as the output encoding. Note that a different 
setting (fileencoding) determines the actual output encoding for new 
files (not fileencodings), which is why I can save new files using the 
ISO-8859-1 encoding.

4. Mercurial uses Python to work out the encoding (presumably using 
LC_CTYPE) and sees ISO-8859-1 as the encoding.

5. Mercurial has now read in bytes from a UTF-8 byte sequence, treating 
each of them as a ISO-8859-1 character value.

So, this is really some kind of configuration issue on more than one 
level, unfortunately, and I now have to find out why my system is set up 
in this way.

Once again, my apologies for wasting people's time with this matter. 
I've learned some new things about how encodings are deduced by the 
different tools, and I'll probably write a summary of this up somewhere, 
for anyone else saddled with such bizarre locale configurations.

Paul


More information about the Mercurial mailing list