Abstracting filesystem API for UTF-8 support on Windows

Matt Mackall mpm at selenic.com
Thu Dec 15 14:30:46 CST 2011


So as discussed a few times over the last few months, I think the right
compromise for filename handling on Windows is:

        "if all files in the parent changeset are valid UTF-8, use the
        Unicode APIs on Windows for dealing with the working directory"

As a side-effect, newly created empty repos on Windows will
automatically use UTF-8 when committing. Existing repos will continue to
work as they always have, but can be made "portable" by renaming files
on Linux.

To implement this, there are a couple parts needed:

- create a filesystem abstraction object in util.py
- add a unicode subclass in windows.py
- create opener classes that inherit from them in scmutil.py
- switching direct filesystem use on working directory paths to wopener
methods
- detection of UTF-8 mode in dirstate
- detection of UTF-8 mode in update
- switching repo.wopener between byte and unicode

I did an audit and came across the following functions that need
abstracting:

osutil.listdir -> wopener.listdir
os.lstat -> wopener.stat
os.unlink -> wopener.unlink
os.path.join -> wopener.join (implicit repo root?)
os.getcwd -> wopener.cwd 
os.path.lexists -> wopener.exists
util.unlinkpath -> wopener.unlinkpath

Some extensions use some functions we'll probably want to kill:
os.path.getmtime (hgeol)
shutil.rmtree (largefiles)
os.path.isfile (largefiles)
os.removedirs (largefiles)

This is a pretty severe change in terms of internal API. There are two
alternatives I see:

- monkeypatch everything (win32mbcs and fix-utf8 approach)

This is messy in that it affects all file operations and third-party
in-process code.

- use unicode() or str() objects depending on mode

This is a different kind of messy: automatic promotion to Unicode will
result in all sorts of interesting string handling exceptions and
encoding failures that will be hard to test for (which is why we
carefully avoid using unicode() objects in the bulk of the code).

-- 
Mathematics is the supreme nostalgia of our time.




More information about the Mercurial-devel mailing list