hugetlbfs: introduce truncation/fault mutex to avoid races
The following hugetlbfs truncate/page fault race can be recreated
with programs doing something like the following.
A huegtlbfs file is mmap(MAP_SHARED) with a size of 4 pages. At
mmap time, 4 huge pages are reserved for the file/mapping. So,
the global reserve count is 4. In addition, since this is a shared
mapping an entry for 4 pages is added to the file's reserve map.
The first 3 of the 4 pages are faulted into the file. As a result,
the global reserve count is now 1.
Task A starts to fault in the last page (routines hugetlb_fault,
hugetlb_no_page). It allocates a huge page (alloc_huge_page).
The reserve map indicates there is a reserved page, so this is
used and the global reserve count goes to 0.
Now, task B truncates the file to size 0. It starts by setting
inode size to 0(hugetlb_vmtruncate). It then unmaps all mapping
of the file (hugetlb_vmdelete_list). Since task A's page table
lock is not held at the time, truncation is not blocked. Truncation
removes the 3 pages from the file (remove_inode_hugepages). When
cleaning up the reserved pages (hugetlb_unreserve_pages), it notices
the reserve map was for 4 pages. However, it has only freed 3 pages.
So it assumes there is still (4 - 3) 1 reserved pages. It then
decrements the global reserve count by 1 and it goes negative.
Task A then continues the page fault process and adds it's newly
acquired page to the page cache. Note that the index of this page
is beyond the size of the truncated file (0). The page fault process
then notices the file has been truncated and exits. However, the
page is left in the cache associated with the file.
Now, if the file is immediately deleted the truncate code runs again.
It will find and free the one page associated with the file. When
cleaning up reserves, it notices the reserve map is empty. Yet, one
page freed. So, the global reserve count is decremented by (0 - 1) -1.
This returns the global count to 0 as it should be. But, it is
possible for someone else to mmap this file/range before it is deleted.
If this happens, a reserve map entry for the allocated page is created
and the reserved page is forever leaked.
To avoid all these conditions, let's simply prevent faults to a file
while it is being truncated. Add a new truncation specific rw mutex
to hugetlbfs inode extensions. faults take the mutex in read mode,
truncation takes in write mode.
Orabug:
28734496
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Larry Bassel <larry.bassel@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>