hgwebdir, Apache, CGI and character encodings (and the Wiki)
Paul Boddie
paul.boddie at biotek.uio.no
Wed Nov 11 06:26:00 CST 2009
Matt Mackall wrote:
> Mercurial has two categories of data: metadata (commit message, user,
> etc), and data (filenames and contents). Metadata is stored in UTF-8 and
> sent/received in the local encoding while data is always
> stored/transmitted exactly as is.
>
This is what I more or less thought.
> With hgwebdir, we should be picking up the system's local encoding by
> default and using that for declaring page encoding and transcoding
> metadata. You should choose your webserver encoding such that will match
> the encoding of as-is data for best results.
>
This makes sense: the metadata will always be transmitted in the form in
which it was entered. However, this doesn't explain why metadata gets
sent as UTF-8 by hgwebdir even when I specify HGENCODING as ISO-8859-1.
> Here's a data/metadata example:
>
> http://www.selenic.com/hg/rev/f1ed441ab8e9
>
> The server sets LC_CTYPE to en_US.UTF-8, hgweb picks this up and sends
>
> Content-Type: text/html; charset=UTF-8
>
> And properly presents the metadata in the header. By convention, the
> data in this repo (such as translations) is also stored in UTF-8, so
> that also presents nicely. If I change the LC_CTYPE passed to hgweb to
> ASCII, I get "B?ckman" in the metadata and nonsense characters in the
> body. If I set HGENCODING to Latin1, I get "Bäckman" and nonsense in the
> body.
>
If I set LC_CTYPE to "en_US.ISO-8859-1" (or "no_NO" - my actual locale)
in the hgwebdir.cgi script even before any imports of mercurial
packages, it doesn't seem to have any effect (I see "??" instead of "é",
indicating that Mercurial has read the UTF-8-encoded metadata using an
ASCII codec), although I'm willing to believe that I may have to work
harder to persuade the locale to change. Currently, I'm struggling to
persuade hg to produce readable metadata in my terminal window (Konsole,
KDE 3.5.4) without switching the window's encoding to UTF-8, even though
the locale uses ISO-8859-1 and data in that encoding gets displayed
without problems. I've just tried this with the tip of the latest stable
branch (1.3.1+401-a40ec11795c3).
It's quite possible that I'm exposing all kinds of weird behaviour here
- things which aren't likely to arise in the UTF-8-only example given
above - and I suppose I should try this out at home where my locale is
also an ISO-8859-1 variant, but I'm not yet convinced that hgwebdir is
doing the right thing, and when I first wrote to the list I was more
confident that hg was doing the right thing, but now I'm not so sure
about that, either.
Paul
More information about the Mercurial
mailing list