Consequences for use of hg for other applications than SCM was Re: German umlauts in file names
Adrian Buehlmann
adrian at cadifra.com
Thu Jun 26 04:08:35 CDT 2008
On 21.06.2008 02:22, Matt Mackall wrote:
> On Sat, 2008-06-21 at 00:54 +0200, Marko Käning wrote:
>> Hi Matt,
>>
>> yes, my mail client is obviously not properly configured. I am sorry for that. Did not think about that. (You see that's my crux here...) ;)
>>
>>> Neither of those will have any effect: Mercurial does not encode
>>> filenames. What comes out is the same as what goes in.
>>>
>>> You either need to set your Windows machine to use UTF-8 or set your
>>> Linux machine to use something roughly cp850-compatible like Latin1.
>> So, you want to tell me that TortoiseSVN sets Windows' codepage actually to UTF-8 when it pulls all the files from my server?
>> Is that how TSVN is able to do what I want?
>
> No.
>
> One system (SVN) does name transcoding. It translates all filenames from
> whatever encoding your system claims to be in (often wrong) to and from
> UTF-8. This seems like the right thing to do, but ignores the fact that
> the vast majority of tools (including compilers, web servers, etc.) are
> not particularly smart about encoding and expect encoding of file names
> to agree with encoding of file contents. It also ignores issues like
> multiple encodings on a single filesystem, and the fact that for any
> pair of single byte character sets, there are characters that can't be
> transcoded.
>
> Thus, if you want everything to work perfectly, there are only three
> sane ways to work:
>
> a) use only ASCII
> b) force everyone to use a specific single-byte character set for file
> names and contents
> c) use UTF-8 everywhere
>
> The other system (Mercurial), stores precisely the bytes that the
> operating system reports and ignores encoding issues. This ensures that
> dumb tools are never confused, even if two users can't agree on
> encoding. They'll both get the same bytes in their filenames and those
> bytes will match the contents of the files. But if you want everything
> to work perfectly, there are again only three sane ways to do things:
>
> a) use only ASCII
> b) force everyone to use a specific single-byte character set for file
> names and contents
> c) use UTF-8 everywhere
>
> In other words, if we're doing things sanely, transcoding isn't even an
> issue!
>
> When we're not doing things sanely and mixing encodings, we have to
> choose the lesser of two evils. In your case, because you had UTF-8 on
> one end, things -mostly- worked out with SVN. If you had a ü.h file, it
> would have likely confused your compiler though. You'd have a file named
> "\xfc" on disk but referred to as "\xc3\xbc" everywhere. And someone
> without ü in their charset (Russia, Japan, etc.) using your repo would
> have other issues.
>
> With mercurial, ü.h on the UTF-8 machine would have become ü.h on a
> cp1252 machine. This might have confused the user a bit, but the
> compiler would probably not be confused because the build instructions
> or includes would now refer to ü.h as well.
>
> Ok, so which of these two cases should we prefer? My preference is to
> choose the strategy that's least likely to break tools, because tools
> are generally a lot stupider than people.
>
> Again, if all your files are in the same encoding and all users are
> using encodings that support all the characters in your repo, this all
> works fine.
>
>> How can I find out, what kind of coding the Windows Explorer actually uses?
>
> Hopefully someone else can answer that.
>
>> So, then the next question. How do I teach mercurial to use UTF-8 under Windows?
>
> First you have to set your system to operate in UTF-8 mode. I believe
> that's known as CP65001.
>
A somewhat related question:
I have a problem with encoding of filenames in my experimental
long path patch [1].
What is the correct way to convert from / to unicode filename strings
on Windows, if I want to mimic the current behavior of Mercurial on
Windows (which is surely needed for compatibility with current repos)?
For example in osutil.listdir (see [2]), I get back unicode filenames that I need to
convert to non-unicode strings in order to be compatible with other parts of
the Mercurial implementation (e.g. manifest class).
In my current version of the patch where I used unicode() / encode(), I get a
traceback for file "ü.txt":
'''
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 0:
ordinal not in range(128)
'''
Thanks in advance for help on this,
Adrian
[1] http://www.cadifra.com/cgi-bin/repos/hg-longpath/file/tip/longpath.patch
[2] relevant parts from [1]:
diff --git a/mercurial/osutil.py b/mercurial/osutil.py
--- a/mercurial/osutil.py
+++ b/mercurial/osutil.py
@@ -1,4 +1,4 @@
-import os, stat
+import os, stat, util
def _mode_to_kind(mode):
if stat.S_ISREG(mode): return stat.S_IFREG
@@ -26,10 +26,11 @@
'''
result = []
prefix = path + os.sep
- names = os.listdir(path)
+ names = os.listdir(util.longpath(path)) # returns unicode strings on Windows
names.sort()
for fn in names:
- st = os.lstat(prefix + fn)
+ fn = fn.encode()
+ st = os.lstat(util.longpath(prefix + fn))
if stat:
result.append((fn, _mode_to_kind(st.st_mode), st))
else:
diff --git a/mercurial/util.py b/mercurial/util.py
--- a/mercurial/util.py
+++ b/mercurial/util.py
@@ -1089,6 +1109,25 @@
def localpath(path):
return path.replace('/', '\\')
+
+ _longpathprefix = "\\\\?\\"
+ def longpath(path):
+ '''convert path to a Windows long path
+ needed to call Windows api with paths longer than 260'''
+ # print "longpath(%s)" % path
+ if path.startswith(_longpathprefix):
+ res = path
+ else:
+ path = path.replace('/', '\\').replace('\\.\\', '\\')
+ if path[-1] == '.':
+ path = path[:-1]
+ # print "path = %s" % path
+ if not os.path.isabs(path):
+ # print "not absolute"
+ path = os.path.abspath(path)
+ # print "path = %s" % path
+ res = unicode(_longpathprefix + path)
+ return res
def normpath(path):
return pconvert(os.path.normpath(path))
More information about the Mercurial
mailing list