File Locking in Linux 2.5
The file locking code in Linux 2.4 has a number of problems I'd like to
address during 2.5 development. Here's a list:
- The code is in pretty desperate need of a rewrite. It's overly complex
and has accumulated some cruft over the years.
- It's too incestuous with lockd.
- It doesn't provide facilities needed by other networking / clustered
file systems
- A need for range-locks which aren't removed by close(dup(fd));
Here's a scheme which will hopefully address the above problems.
Feedback welcome.
Providing the right facilities for networked/clustered filesystems
- All filesystems will fill in their ->lock method.
- Local filesystems should all use local_lock() for this method, unless
they have a good reason to provide their own facilities.
- nfs (client) will provide a ->lock() method which performs an RPC
to the remote lockd. It does not call local_lock().
- lockd (on the server) calls the underlying fs' ->lock() method. Note
this is potentially recursive (ie we can reexport an NFS filesystem
and have locking work.)
Note that this clears the way for filesystems to provide non-POSIX
semantics (eg Netware, SMB OpLocks, etc). There is no requirement for
any filesystem to use the local_lock() function.
lockd has an interesting problem. The semantics of fcntl(F_SETLKW)
are that the process has to sleep until the lock is granted, or a signal
interrupts the sleep. Clearly it's incredibly inefficient for lockd to
spawn a new thread every time it wants to make a lock which would block.
So at first glance, we need a different type of lock -- put the lock on
the list of blocked locks, and return -EAGAIN (-EWOULDBLOCK?). Then,
when that lock is held, notify lockd that it now has the lock, it can
return that notification to the client and the client process unblocks.
But what if we simply replace the blocking lock with the would-block lock?
That implies that the caller of ->lock() decides what to do with the
-EWOULDBLOCK return code -- if it's fcntl(), it puts the process to sleep;
if it's lockd, it just carries on.
A clustered filesystem might call out to the network and say `I want
to put this lock on this file'. Either some other node in the cluster
says `Denied', `Blocked' or `Granted' (ie handles the request), OR no
other node accepts responsibility for the lock, in which case we lock
it locally by calling local_lock().
lockd
lockd does the following to recover from a downed server:
for_each_lock
if (belongs_to_my_fs)
foo();
This requires it to have access to the global list of locks, which is
a bad thing to have anyway.
I've written some replacement code which Trond approved of:
for_each_inode(sb)
for_each_lock(inode)
foo();
With the changes above, even this code can go away. The nfs client can
keep a per-fs list of locks, and reestablish them at server restart.
No need to interact with the local locking at all.
Non-POSIX locks
We already provide five different lock types:
- 4.4 BSD flock() locks. These are whole-file locks which are
inherited across a fork(). They are not checked for deadlock.
- POSIX fcntl() locks. These are byte-range locks which are inherited
across an exec(), but not a fork(). They are checked for deadlock.
(A subtype of this is the LFS variant which allows for 64-bit
offsets.)
- Leases. These are whole-file locks which are broken when another
process attempts to open the file for an operation which would
conflict with the lease type. When they are broken, the owner
receives a signal and must ensure the file is in a consistent state
before releasing their lease.
- Share modes. These are whole-file mandatory locks. No other process
may open a file which would conflict with the Share Mode on the file. Use
flock() with the %LOCK_MAND flag to set a Share Mode.
- Mandatory Locks. These are byte-range mandatory locks. To use them,
mount the filesystem with the `mand' option enabled, and set the
file mode to g-x g+s. POSIX locks applied to this file will now
be mandatory. Mandatory locks do not prevent accesses via mmap().
You should not use Mandatory locks in new code.
The proposal mentioned above would add a sixth -- whatever the filesystem
supports. Ncpfs already does this through an ioctl, but that could be
supported `natively' through this new scheme.
I want to add another byte-range lock, which looks and smells like
a POSIX fcntl lock except that it is not removed by
closing any fd which happens to be open on this file. Samba keeps a
list of open fds which are not currently in use on any locked file to
work around this stupidity in the spec. I'd like the external interface
to this to be fcntl(F_SETLK_NP) and F_SETLKW_NP. Clearly F_GETLK does
not need to be altered or replaced.
Restructuring
locks.c still runs almost entirely under the BKL. An earlier attempt
to move it to a different locking scheme was thwarted when the code
was integrated into 2.4.0-test9 while I was on holiday, and without me
submitting it to Linus. Grumble. I plan to move it to _one_ spinlock
to cover all lock-related structures, and I think that will be possible
with the plan described above (since this code will no longer sleep).
As soon as lockd no longer needs to keep its fingers inside locks.c, I
want to remove the global list of locks. It's also used by /proc/locks
-- which probably needs to go away anyway. So what's useful about
/proc/locks? I'd like to be able to see which locks my process has,
and which processes have a lock on a given file. The former is easy --
/proc/$PID/locks can be constructed relatively easily from the fd's open
by that process. The latter? I don't know. Ideas welcome.
Links
POSIX
file locking
Olaf Kirch's page
on NLM (warning: out of date)
Matthew Wilcox <matthew@wil.cx>
Last updated 2001-04-30