hgwebdir, Apache, CGI and character encodings (and the Wiki)
Paul Boddie
paul.boddie at biotek.uio.no
Tue Nov 10 12:44:08 CST 2009
Dirkjan Ochtman wrote:
> On Tue, Nov 10, 2009 at 18:34, Paul Boddie <paul.boddie at biotek.uio.no> wrote:
>
>> What appears to happen is that hgwebdir (I'm using 1.0.2, but it appears to
>> be the case in 1.3.1 as well) appears to emit the "Content-type" header in
>> the mercurial.hgweb.request.wsgirequest.httphdr method, but omits any
>> "charset" qualifier. Apache, I presume, then decides to embellish the
>> header, adding locale-related information to it, indicating that, at least
>> for my system, the page uses ASCII as its encoding.
>>
>
> I think that the culprit might just be Mercurial itself.
>
Actually, I think I wrote too soon. My system is apparently using
ISO-8859-1 for character storage according to the locale, vim, Python,
and so on when invoking hg. This matters for what I write below.
>> However, it does seem to be the case that the commit messages are emitted as
>> UTF-8 by hgwebdir, even without setting the HGENCODING environment variable.
>> A simple fix for this appears to be a modification to the method mentioned
>> above, as follows:
>>
>> - headers.append(('Content-Type', type))
>> + headers.append(('Content-Type', "%s; charset=UTF-8" % type))
>>
>> I'm uncertain that this is a proper fix, given that I don't really know
>> enough about what hgwebdir or Mercurial are doing internally, but this fixed
>> my problem. (Maybe the HGENCODING gets propagated onto a "ctype" variable
>> somewhere which then has its value sent to the above method, but I can't
>> really tell after a couple of minutes looking.)
>>
>
> The latter should be the case IIRC. Where exactly did you hook in? In
> crew, the ctype variable gets instantiated from a template, passing in
> the encoding (mercurial/hgweb/hgweb_mod.py:170).
>
I was premature in writing down what I've done, since I also changed one
other thing: the HGENCODING environment variable. This is actually set
to "ISO-8859-1" in hgwebdir.cgi, presumably resulting in Mercurial using
that encoding to interpret its own textual data, but sending UTF-8
(declared by the above modification) to the browser.
In the default case with an unmodified hgweb and no environment
variables, non-ASCII characters such as "é" appear as "??" and a charset
of "ANSI_X3.4-1968" is declared. The Web server seems to have a locale
defined as "POSIX", so I expect that ASCII is the preferred encoding.
In the case where an unmodified hgweb has HGENCODING set to ISO-8859-1,
non-ASCII characters such as "é" appear as "é" and a charset of
"ISO-8859-1" is declared.
In the case where an unmodified hgweb has HGENCODING set to UTF-8,
non-ASCII characters such as "é" appear as "é" and a charset of "UTF-8"
is declared. (The characters are "double-encoded".)
Only by modifying hgweb and setting HGENCODING will non-ASCII characters
appear as intended.
Changing HGENCODING alone affects both Mercurial's interpretation of
text and then declares this encoding for its output, even though the
output would appear to be UTF-8-encoded. Unfortunately, I didn't manage
to make this clear in my last message because I forgot my change to
HGENCODING, but I hope this makes a bit more sense now.
Paul
More information about the Mercurial
mailing list