hgwebdir, Apache, CGI and character encodings (and the Wiki)
Paul Boddie
paul.boddie at biotek.uio.no
Thu Nov 12 09:13:27 CST 2009
Apologies for following up to myself here, but the problem has been
solved. The nasty details are below.
Paul Boddie wrote:
> This makes sense: the metadata will always be transmitted in the form
> in which it was entered. However, this doesn't explain why metadata
> gets sent as UTF-8 by hgwebdir even when I specify HGENCODING as
> ISO-8859-1.
Here, it appears that hgwebdir was doing the right thing all along. I
had been under the impression that files were being written as
ISO-8859-1, and creating a new file in vim does exactly this. However,
my commit messages were not being saved as ISO-8859-1, and I verified
this by saving a copy of the file and looking at it after the commit:
vim was saving the file as UTF-8. So, the metadata ended up having each
character UTF-8 encoded twice, and when displaying the metadata, hg (and
hgwebdir) were encoding the characters to ISO-8859-1, but this merely
reproduced a UTF-8 stream (not the anticipated ISO-8859-1 stream).
[...]
> It's quite possible that I'm exposing all kinds of weird behaviour
> here - things which aren't likely to arise in the UTF-8-only example
> given above - and I suppose I should try this out at home where my
> locale is also an ISO-8859-1 variant, but I'm not yet convinced that
> hgwebdir is doing the right thing, and when I first wrote to the list
> I was more confident that hg was doing the right thing, but now I'm
> not so sure about that, either.
I am convinced that hg and hgwebdir are doing the right thing now. At
home, all my locale definitions use ISO-8859-15 as the character set and
everything works fine. At work, where I have the problem, LC_CTYPE was
set to "no_NO" (which is an ISO-8859-1 encoding, as reported by Python),
but LANG and other LC_ variables (except LC_ALL) are set to
"en_US.UTF-8". Here's where it gets really nasty:
1. vim appears to look at the LANG variable and sees UTF-8 as the encoding.
2. vim reads the /etc/vimrc file which, on this Red Hat Enterprise Linux
installation (possibly modified in my organisation), has the following
logic:
if v:lang =~ "utf8$" || v:lang =~ "UTF-8$"
set fileencodings=utf-8,latin1
endif
3. Thinking that UTF-8 is the encoding, vim reads the commit message
template and chooses UTF-8 as the output encoding. Note that a different
setting (fileencoding) determines the actual output encoding for new
files (not fileencodings), which is why I can save new files using the
ISO-8859-1 encoding.
4. Mercurial uses Python to work out the encoding (presumably using
LC_CTYPE) and sees ISO-8859-1 as the encoding.
5. Mercurial has now read in bytes from a UTF-8 byte sequence, treating
each of them as a ISO-8859-1 character value.
So, this is really some kind of configuration issue on more than one
level, unfortunately, and I now have to find out why my system is set up
in this way.
Once again, my apologies for wasting people's time with this matter.
I've learned some new things about how encodings are deduced by the
different tools, and I'll probably write a summary of this up somewhere,
for anyone else saddled with such bizarre locale configurations.
Paul
More information about the Mercurial
mailing list