Charset of files (was: Re: [PATCH 0/4] Better HTML templates redux)

Thomas Arendsen Hein thomas at intevation.de
Sun Jul 3 04:53:27 CDT 2005


* Matt Mackall <mpm at selenic.com> [20050703 05:25]:
> On Sun, Jul 03, 2005 at 12:14:48AM +0200, Edouard Gomez wrote:
> > PS: though w3c validator shouts about the non presence of meta tag
> > sepicfying the charset, i don't know if it's correct to hardwire this to
> > utf-8 as hg doesn't "norm" the encoding of the changelog entries. Any
> > idea on this ? This will probably be resolved by the i18n/l10n work
> > discussed on another thread.
> 
> Good question.
> 
> File contents are all pure 8-bit. If someone commits something with a
> particular non-ASCII, non-UTF-8 encoding, we can't arbitrarily declare
> that it's UTF-8 in the web interface, nor can we attempt to convert it
> (especially given that we don't know what the encoding is!).

Possible solutions:

1. Charset guessing:
+ fire and forget, will simply work (if the guessing is good)
- may (will?) sometimes go wrong
- hard to implement or needs external dependency

2. Assuming everything not 7bit clean is binary
+ very easy to implement
- some text files will not be viewed in hgweb just because they
  contain some illegal chars

3. Same as 2., but a default charset can be specified in hgrc
   (defaulting to 7bit ASCII or UTF-8?), everything illegal in
   this charset is considered binary.
+ easy to implement
- assumption will not always be true, especially if there are files
  with different encodings in one repo

4. Same as 2., but default charset may be overridden per file, e.g.
   with charset metadata setting in the manifest.
+ can always be correct, if the files charsets are set correctly.
+ will be at least as good as 2.
- change of manifest format needed
- don't combine this with 3, this will yield in unpredictable
  results with people pulling from each other with different
  settings.

> And of
> course, we'll have a fun time displaying binaries in hgweb.

The mimetypes module can help here, or maybe 4. can be used to set a
MIME type. This way images, HTML pages or PDFs can be displayed
directly from the repository.

> Everything else (filenames, usernames, commit messages) ought to be
> stored as UTF-8 eventually.

Agreed.

Thomas

-- 
Email: thomas at intevation.de
http://intevation.de/~thomas/


More information about the Mercurial mailing list