File Locking in Linux 2.5

The file locking code in Linux 2.4 has a number of problems I'd like to address during 2.5 development. Here's a list:

The code is in pretty desperate need of a rewrite. It's overly complex and has accumulated some cruft over the years.
It's too incestuous with lockd.
It doesn't provide facilities needed by other networking / clustered file systems
A need for range-locks which aren't removed by close(dup(fd));

Here's a scheme which will hopefully address the above problems. Feedback welcome.

Providing the right facilities for networked/clustered filesystems

All filesystems will fill in their ->lock method.
Local filesystems should all use local_lock() for this method, unless they have a good reason to provide their own facilities.
nfs (client) will provide a ->lock() method which performs an RPC to the remote lockd. It does not call local_lock().
lockd (on the server) calls the underlying fs' ->lock() method. Note this is potentially recursive (ie we can reexport an NFS filesystem and have locking work.)

Note that this clears the way for filesystems to provide non-POSIX semantics (eg Netware, SMB OpLocks, etc). There is no requirement for any filesystem to use the local_lock() function.

lockd has an interesting problem. The semantics of fcntl(F_SETLKW) are that the process has to sleep until the lock is granted, or a signal interrupts the sleep. Clearly it's incredibly inefficient for lockd to spawn a new thread every time it wants to make a lock which would block. So at first glance, we need a different type of lock -- put the lock on the list of blocked locks, and return -EAGAIN (-EWOULDBLOCK?). Then, when that lock is held, notify lockd that it now has the lock, it can return that notification to the client and the client process unblocks.

But what if we simply replace the blocking lock with the would-block lock? That implies that the caller of ->lock() decides what to do with the -EWOULDBLOCK return code -- if it's fcntl(), it puts the process to sleep; if it's lockd, it just carries on.

A clustered filesystem might call out to the network and say `I want to put this lock on this file'. Either some other node in the cluster says `Denied', `Blocked' or `Granted' (ie handles the request), OR no other node accepts responsibility for the lock, in which case we lock it locally by calling local_lock().

lockd

lockd does the following to recover from a downed server:

for_each_lock
	if (belongs_to_my_fs)
		foo();

This requires it to have access to the global list of locks, which is a bad thing to have anyway.

I've written some replacement code which Trond approved of:

for_each_inode(sb)
	for_each_lock(inode)
		foo();

With the changes above, even this code can go away. The nfs client can keep a per-fs list of locks, and reestablish them at server restart. No need to interact with the local locking at all.

Non-POSIX locks

We already provide five different lock types:

4.4 BSD flock() locks. These are whole-file locks which are inherited across a fork(). They are not checked for deadlock.
POSIX fcntl() locks. These are byte-range locks which are inherited across an exec(), but not a fork(). They are checked for deadlock. (A subtype of this is the LFS variant which allows for 64-bit offsets.)
Leases. These are whole-file locks which are broken when another process attempts to open the file for an operation which would conflict with the lease type. When they are broken, the owner receives a signal and must ensure the file is in a consistent state before releasing their lease.
Share modes. These are whole-file mandatory locks. No other process may open a file which would conflict with the Share Mode on the file. Use flock() with the %LOCK_MAND flag to set a Share Mode.
Mandatory Locks. These are byte-range mandatory locks. To use them, mount the filesystem with the `mand' option enabled, and set the file mode to g-x g+s. POSIX locks applied to this file will now be mandatory. Mandatory locks do not prevent accesses via mmap(). You should not use Mandatory locks in new code.

The proposal mentioned above would add a sixth -- whatever the filesystem supports. Ncpfs already does this through an ioctl, but that could be supported `natively' through this new scheme.

I want to add another byte-range lock, which looks and smells like a POSIX fcntl lock except that it is not removed by closing any fd which happens to be open on this file. Samba keeps a list of open fds which are not currently in use on any locked file to work around this stupidity in the spec. I'd like the external interface to this to be fcntl(F_SETLK_NP) and F_SETLKW_NP. Clearly F_GETLK does not need to be altered or replaced.

Restructuring

locks.c still runs almost entirely under the BKL. An earlier attempt to move it to a different locking scheme was thwarted when the code was integrated into 2.4.0-test9 while I was on holiday, and without me submitting it to Linus. Grumble. I plan to move it to _one_ spinlock to cover all lock-related structures, and I think that will be possible with the plan described above (since this code will no longer sleep).

As soon as lockd no longer needs to keep its fingers inside locks.c, I want to remove the global list of locks. It's also used by /proc/locks -- which probably needs to go away anyway. So what's useful about /proc/locks? I'd like to be able to see which locks my process has, and which processes have a lock on a given file. The former is easy -- /proc/$PID/locks can be constructed relatively easily from the fd's open by that process. The latter? I don't know. Ideas welcome.

Links

POSIX file locking
Olaf Kirch's page on NLM (warning: out of date)

Matthew Wilcox <matthew@wil.cx>

Last updated 2001-04-30