Consequences for use of hg for other applications than SCM was Re: German umlauts in file names

Matt Mackall mpm at selenic.com
Fri Jun 20 19:22:04 CDT 2008


On Sat, 2008-06-21 at 00:54 +0200, Marko Käning wrote:
> Hi Matt,
> 
> yes, my mail client is obviously not properly configured. I am sorry for that. Did not think about that. (You see that's my crux here...) ;)
> 
> > Neither of those will have any effect: Mercurial does not encode
> > filenames. What comes out is the same as what goes in.
> > 
> > You either need to set your Windows machine to use UTF-8 or set your
> > Linux machine to use something roughly cp850-compatible like Latin1.
> 
> So, you want to tell me that TortoiseSVN sets Windows' codepage actually to UTF-8 when it pulls all the files from my server?
> Is that how TSVN is able to do what I want?

No.

One system (SVN) does name transcoding. It translates all filenames from
whatever encoding your system claims to be in (often wrong) to and from
UTF-8. This seems like the right thing to do, but ignores the fact that
the vast majority of tools (including compilers, web servers, etc.) are
not particularly smart about encoding and expect encoding of file names
to agree with encoding of file contents. It also ignores issues like
multiple encodings on a single filesystem, and the fact that for any
pair of single byte character sets, there are characters that can't be
transcoded.

Thus, if you want everything to work perfectly, there are only three
sane ways to work:

a) use only ASCII
b) force everyone to use a specific single-byte character set for file
names and contents
c) use UTF-8 everywhere

The other system (Mercurial), stores precisely the bytes that the
operating system reports and ignores encoding issues. This ensures that
dumb tools are never confused, even if two users can't agree on
encoding. They'll both get the same bytes in their filenames and those
bytes will match the contents of the files. But if you want everything
to work perfectly, there are again only three sane ways to do things:

a) use only ASCII
b) force everyone to use a specific single-byte character set for file
names and contents
c) use UTF-8 everywhere

In other words, if we're doing things sanely, transcoding isn't even an
issue!

When we're not doing things sanely and mixing encodings, we have to
choose the lesser of two evils. In your case, because you had UTF-8 on
one end, things -mostly- worked out with SVN. If you had a ü.h file, it
would have likely confused your compiler though. You'd have a file named
"\xfc" on disk but referred to as "\xc3\xbc" everywhere. And someone
without ü in their charset (Russia, Japan, etc.) using your repo would
have other issues.

With mercurial, ü.h on the UTF-8 machine would have become ü.h on a
cp1252 machine. This might have confused the user a bit, but the
compiler would probably not be confused because the build instructions
or includes would now refer to ü.h as well.

Ok, so which of these two cases should we prefer? My preference is to
choose the strategy that's least likely to break tools, because tools
are generally a lot stupider than people.

Again, if all your files are in the same encoding  and all users are
using encodings that support all the characters in your repo, this all
works fine. 

> How can I find out, what kind of coding the Windows Explorer actually uses?

Hopefully someone else can answer that.

> So, then the next question. How do I teach mercurial to use UTF-8 under Windows?

First you have to set your system to operate in UTF-8 mode. I believe
that's known as CP65001.

-- 
Mathematics is the supreme nostalgia of our time.



More information about the Mercurial mailing list