hgwebdir, Apache, CGI and character encodings (and the Wiki)

Paul Boddie paul.boddie at biotek.uio.no
Wed Nov 11 06:26:00 CST 2009


Matt Mackall wrote:
> Mercurial has two categories of data: metadata (commit message, user,
> etc), and data (filenames and contents). Metadata is stored in UTF-8 and
> sent/received in the local encoding while data is always
> stored/transmitted exactly as is.
>   

This is what I more or less thought.

> With hgwebdir, we should be picking up the system's local encoding by
> default and using that for declaring page encoding and transcoding
> metadata. You should choose your webserver encoding such that will match
> the encoding of as-is data for best results.
>   

This makes sense: the metadata will always be transmitted in the form in 
which it was entered. However, this doesn't explain why metadata gets 
sent as UTF-8 by hgwebdir even when I specify HGENCODING as ISO-8859-1.

> Here's a data/metadata example:
>
> http://www.selenic.com/hg/rev/f1ed441ab8e9
>
> The server sets LC_CTYPE to en_US.UTF-8, hgweb picks this up and sends
>
> Content-Type: text/html; charset=UTF-8
>
> And properly presents the metadata in the header. By convention, the
> data in this repo (such as translations) is also stored in UTF-8, so
> that also presents nicely. If I change the LC_CTYPE passed to hgweb to
> ASCII, I get "B?ckman" in the metadata and nonsense characters in the
> body. If I set HGENCODING to Latin1, I get "Bäckman" and nonsense in the
> body.
>   

If I set LC_CTYPE to "en_US.ISO-8859-1" (or "no_NO" - my actual locale) 
in the hgwebdir.cgi script even before any imports of mercurial 
packages, it doesn't seem to have any effect (I see "??" instead of "é", 
indicating that Mercurial has read the UTF-8-encoded metadata using an 
ASCII codec), although I'm willing to believe that I may have to work 
harder to persuade the locale to change. Currently, I'm struggling to 
persuade hg to produce readable metadata in my terminal window (Konsole, 
KDE 3.5.4) without switching the window's encoding to UTF-8, even though 
the locale uses ISO-8859-1 and data in that encoding gets displayed 
without problems. I've just tried this with the tip of the latest stable 
branch (1.3.1+401-a40ec11795c3).

It's quite possible that I'm exposing all kinds of weird behaviour here 
- things which aren't likely to arise in the UTF-8-only example given 
above - and I suppose I should try this out at home where my locale is 
also an ISO-8859-1 variant, but I'm not yet convinced that hgwebdir is 
doing the right thing, and when I first wrote to the list I was more 
confident that hg was doing the right thing, but now I'm not so sure 
about that, either.

Paul


More information about the Mercurial mailing list