As a part of the series to move xfs global stats from procfs to sysfs,
this patch consolidates the sysfs ops functions and removes redundancy.
Signed-off-by: Bill O'Donnell <billodo@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry picked from commit a27c264009cb236dc209875043a0c237af46a1af) Signed-off-by: Dan Duval <dan.duval@oracle.com>
As a part of the work to move xfs global stats from procfs to sysfs,
this patch removes the now unused procfs code that was xfs stat specific.
Signed-off-by: Bill O'Donnell <billodo@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry picked from commit 50cf5b7402633265fe11430fd63af82c401fad46) Signed-off-by: Dan Duval <dan.duval@oracle.com>
As a part of the work to move xfs global stats from procfs to sysfs,
this patch creates the symlink from proc/fs/xfs/stat to sys/fs/xfs/stats.
Signed-off-by: Bill O'Donnell <billodo@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry picked from commit 32f0ea05219e5212cafcbeaa255e627314f87e7e) Signed-off-by: Dan Duval <dan.duval@oracle.com>
Currently, xfs global stats are in procfs. This patch introduces
(replicates) the global stats in sysfs. Additionally a stats_clear file
is introduced in sysfs.
Signed-off-by: Bill O'Donnell <billodo@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry picked from commit bb230c124730f21eea13deab433f9f8fc96bd5f3) Signed-off-by: Dan Duval <dan.duval@oracle.com>
Commit 1ca1915 ("xfs: Don't use unwritten extents for DAX") enabled
the DAX allocation call to dip into the reserve pool in case it was
converting unwritten extents rather than allocating blocks. This was
a direct copy of the unwritten extent conversion code, but had an
unintended side effect of allowing normal data block allocation to
use the reserve pool. Hence normal block allocation could deplete
the reserve pool and prevent unwritten extent conversion at ENOSPC,
hence violating fallocate guarantees on preallocated space.
Fix it by checking whether the incoming map from __xfs_get_blocks()
spans an unwritten extent and only use the reserve pool if the
allocation covers an unwritten extent.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry picked from commit 3b0fe47805802216087259b07de691ef47ff6fbc) Signed-off-by: Dan Duval <dan.duval@oracle.com>
These actions are completely managed by a block driver or can use the
badblocks api directly.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 55f5560d8c18fe33fc169f8d244a9247dcac7612) Signed-off-by: Dan Duval <dan.duval@oracle.com>
Support badblock checking in all the pmem read paths that do not go
through the block layer. This protects info block reads (btt or pfn) as
well as data reads to a pmem namespace via a btt instance.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 710d69cc99507803ed91b4ec7368fbd66d59f014) Signed-off-by: Dan Duval <dan.duval@oracle.com>
Check the sectors specified in a read bio to see if they hit a known bad
block, and return an error code pmem_do_bvec().
Note that the ->rw_page() is not in a position to return errors. For
now, copy the same layering violation present in zram_rw_page() to avoid
crashes of the form:
...i.e. unlock a page that was already unlocked via pmem_rw_page() =>
page_endio().
Reported-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit e10624f8c09710b3b0740ea3847627ea02f55c39) Signed-off-by: Dan Duval <dan.duval@oracle.com>
If a device will ever have badblocks it should always have a badblocks
instance available. So, similar to md, embed a badblocks instance in
pmem_device. This reduces pointer chasing in the i/o fast path, and
simplifies the init path.
Reported-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit b95f5f4391fad65f1819c2404080b05ca95bdd92) Signed-off-by: Dan Duval <dan.duval@oracle.com>
If the badblocks list runs out of space it simply means that software is
unable to intercept all errors. This is no different than the latent
discovery of new badblocks case and should not be an initialization
failure condition.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 87ba05dff3510f9e058b35d3c3fa222b6f406ecc) Signed-off-by: Dan Duval <dan.duval@oracle.com>
Provide a devres interface for initializing a badblocks instance. The
pmem driver has several scenarios where it will be beneficial to have
this structure automatically freed when the device is disabled / fails
probe.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 16263ff6c72eb4cc00aa287230144dda12ccad12) Signed-off-by: Dan Duval <dan.duval@oracle.com>
The badblocks list attached to a gendisk is allocated by the driver
which equates to the driver owning the lifetime of the object. Do not
automatically free it in del_gendisk(). This is in preparation for
expanding the use of badblocks in libnvdimm drivers and introducing
devm_init_badblocks().
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 20a308f09e0d29ce6f5a4114cc476a998d569bfb) Signed-off-by: Dan Duval <dan.duval@oracle.com>
For symmetry with badblocks_init() make it clear that this path only
destroys incremental allocations of a badblocks instance, and does not
free the badblocks instance itself.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit d3b407fb3f782bd915db64e266010ea30a2d381e) Signed-off-by: Dan Duval <dan.duval@oracle.com>
nd-core.h is private to the libnvdimm core internals and should not be
used by drivers.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit ad9a8bde2cb19f6876f964fc48acc8b6a2f325ff) Signed-off-by: Dan Duval <dan.duval@oracle.com>
During region creation, perform Address Range Scrubs (ARS) for the SPA
(System Physical Address) ranges to retrieve known poison locations from
firmware. Add a new data structure 'nd_poison' which is used as a list
in nvdimm_bus to store these poison locations.
When creating a pmem namespace, if there is any known poison associated
with its physical address space, convert the poison ranges to bad sectors
that are exposed using the badblocks interface.
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 0caeef63e6d2f866d85bb507bf63e0ce8ec91cef) Signed-off-by: Dan Duval <dan.duval@oracle.com>
In preparation for getting a poison list using ARS DSMs, enable DSMs for
all manufactured NFITs supplied by the test framework. Also, supply
valid response data for ars_status.
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit d26f73f083ed6fbea7fd3fdbacb527b7f3e75ac0) Signed-off-by: Dan Duval <dan.duval@oracle.com>
Retain badblocks as part of rdev, but use the accessor functions from
include/linux/badblocks for all manipulation.
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Dan Duval <dan.duval@oracle.com>
(cherry picked from commit fc974ee2bffdde47d1e4b220cf326952cc2c4794)
NVDIMM devices, which can behave more like DRAM rather than block
devices, may develop bad cache lines, or 'poison'. A block device
exposed by the pmem driver can then consume poison via a read (or
write), and cause a machine check. On platforms without machine
check recovery features, this would mean a crash.
The block device maintaining a runtime list of all known sectors that
have poison can directly avoid this, and also provide a path forward
to enable proper handling/recovery for DAX faults on such a device.
Use the new badblock management interfaces to add a badblocks list to
gendisks.
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 99e6608c9e7414ae4f2168df8bf8fae3eb49e41f) Signed-off-by: Dan Duval <dan.duval@oracle.com>
Take the core badblocks implementation from md, and make it generally
available. This follows the same style as kernel implementations of
linked lists, rb-trees etc, where you can have a structure that can be
embedded anywhere, and accessor functions to manipulate the data.
The only changes in this copy of the code are ones to generalize
function/variable names from md-specific ones. Also add init and free
functions.
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 9e0e252a048b0ba5066f0dc15c3b2468ffe5c422) Signed-off-by: Dan Duval <dan.duval@oracle.com>
When tearing down a block device early in its lifetime, userspace may
still be performing discovery actions like blkdev_ioctl() to re-read
partitions.
The nvdimm_revalidate_disk() implementation depends on
disk->driverfs_dev to be valid at entry. However, it is set to NULL in
del_gendisk() and fatally this is happening *before* the disk device is
deleted from userspace view.
There's no reason for del_gendisk() to clear ->driverfs_dev. That
device is the parent of the disk. It is guaranteed to not be freed
until the disk, as a child, drops its ->parent reference.
We could also fix this issue locally in nvdimm_revalidate_disk() by
using disk_to_dev(disk)->parent, but lets fix it globally since
->driverfs_dev follows the lifetime of the parent. Longer term we
should probably just add a @parent parameter to add_disk(), and stop
carrying this pointer in the gendisk.
Reported-by: Robert Hu <robert.hu@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit ac34f15e0c6d2fd58480052b6985f6991fb53bcc) Signed-off-by: Dan Duval <dan.duval@oracle.com>
Similar to the file_inode() helper, provide a helper to lookup the inode for a
raw block device itself.
Cc: Al Viro <viro@zeniv.linux.org.uk> Suggested-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 4ebb16ca9a06a54cdb2e7f2ce1e506fa4d432445) Signed-off-by: Dan Duval <dan.duval@oracle.com>
This effectively promotes IORESOURCE_BUSY to IORESOURCE_EXCLUSIVE
semantics by default. If userspace really believes it is safe to access
the memory region it can also perform the extra step of disabling an
active driver. This protects device address ranges with read side
effects and otherwise directs userspace to use the driver.
Persistent memory presents a large "mistake surface" to /dev/mem as now
accidental writes can corrupt a filesystem.
In general if a device driver is busily using a memory region it already
informs other parts of the kernel to not touch it via
request_mem_region(). /dev/mem should honor the same safety restriction
by default. Debugging a device driver from userspace becomes more
difficult with this enabled. Any application using /dev/mem or mmap of
sysfs pci resources will now need to perform the extra step of either:
2/ Rebooting with "iomem=relaxed" on the command line
3/ Recompiling with CONFIG_IO_STRICT_DEVMEM=n
Traditional users of /dev/mem like dosemu are unaffected because the
first 1MB of memory is not subject to the IO_STRICT_DEVMEM restriction.
Legacy X configurations use /dev/mem to talk to graphics hardware, but
that functionality has since moved to kernel graphics drivers.
Cc: Arnd Bergmann <arnd@arndb.de> Cc: Russell King <linux@arm.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Ingo Molnar <mingo@redhat.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 90a545e981267e917b9d698ce07affd69787db87) Signed-off-by: Dan Duval <dan.duval@oracle.com>
When btt devices were re-worked to be child devices of regions this
routine was overlooked. It mistakenly attempts to_nd_namespace_pmem()
or to_nd_namespace_blk() conversions on btt and pfn devices. By luck to
date we have happened to be hitting valid memory leading to a uuid
miscompare, but a recent change to struct nd_namespace_common causes:
Cc: <stable@vger.kernel.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit e07ecd76d4db7bda1e9495395b2110a3fe28845a) Signed-off-by: Dan Duval <dan.duval@oracle.com>
'Memory mode' is defined as the capability of a DAX mapping to be the
source/target of DMA and other "direct I/O" scenarios. While it
currently requires allocating 'struct page' for each page frame of
persistent memory in the namespace it will not always be the case. Work
continues on reducing the kernel's dependency on 'struct page'.
Let's not maintain a suffix that is expected to lose meaning over time.
In other words a future 'raw mode' pmem namespace may be as capable as
today's 'memory mode' namespace. Undo the encoding of the mode in the
device name and leave it to other tooling to determine the mode of the
namespace from its attributes.
Reported-by: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 0731de0dd95b251ed6cfb5f132486e52357fce53) Signed-off-by: Dan Duval <dan.duval@oracle.com>
The -ENODEV case indicates that the info-block needs to established.
All other return codes cause nd_pfn_init() to abort.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 3fa962686568a1617d174e3d2f5d522e963077c5) Signed-off-by: Dan Duval <dan.duval@oracle.com>
The unit test infrastructure uses CMA and real memory to emulate nvdimm
resources. The call to devm_memremap_pages() can simply be mocked in
the same manner as memremap and we mock phys_to_pfn_t() to clear PFN_MAP
since these resources are not registered with in the pgmap_radix.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 979fccfb7348dbd968daf0249aa484a0297f83df) Signed-off-by: Dan Duval <dan.duval@oracle.com>
Similar to btt, plant a new pfn seed when the existing one is activated.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 2dc43331e34fa992a67f42ed44e5111cafafd6f3) Signed-off-by: Dan Duval <dan.duval@oracle.com>
Track and check the uuid of the namespace hosting a pfn instance. This
forces the pfn info block to be invalidated if the namespace is
re-configured with a different uuid.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit a34d5e8a6ad1a31b186019c9c351777626698863) Signed-off-by: Dan Duval <dan.duval@oracle.com>
When setting aside capacity for struct page it must be aligned to the
largest mapping size that is to be made available via DAX. Make the
alignment configurable to enable support for 1GiB page-size mappings.
The offset for PFN_MODE_RAM may now be larger than SZ_8K, so fixup the
offset check in nvdimm_namespace_attach_pfn().
Reported-by: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 315c562536c42aa4da9b6c5a2135dd6715a5e0b5) Signed-off-by: Dan Duval <dan.duval@oracle.com>
In all cases __nd_pfn_create is called with default parameters which are
then overridden by values in the info block. Clean up pfn creation by
dropping the parameters and setting default values internal to
__nd_pfn_create.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit f7c6ab80fa5fee3daccb83a3c1b3a9f39d7b931c) Signed-off-by: Dan Duval <dan.duval@oracle.com>
The alignment constraint isn't necessary now that devm_memremap_pages()
allows for unaligned mappings.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 9f1e8cee7742cadbe6b97f2c80b787b4ee067bae) Signed-off-by: Dan Duval <dan.duval@oracle.com>
This simple change hides pfn_seed attribute for non pmem
regions because they don't support pfn anyway.
Signed-off-by: Dmitry V. Krivenok <krivenok.dmitry@gmail.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 6bb691ac089c39bb0aa73bdcc21ffd8c846e4ba5) Signed-off-by: Dan Duval <dan.duval@oracle.com>
In order to bind namespace to the driver user must first
set all mandatory attributes in the following order:
- uuid
- size
- sector_size (for blk namespace only)
If the order is wrong, then user either won't be able to set
the attribute or bind the namespace.
This simple patch improves diagnosibility of common operations
with namespaces by printing some details about the error
instead of failing silently.
Below are examples of error messages (assuming dyndbg is
enabled for nvdimms):
[/]# echo 4194304 > /sys/bus/nd/devices/region5/namespace5.0/size
[ 288.372612] nd namespace5.0: __size_store: uuid not set
[ 288.374839] nd namespace5.0: size_store: 400000 fail (-6)
sh: write error: No such device or address
[/]#
[/]# echo namespace5.0 > /sys/bus/nd/drivers/nd_blk/bind
[ 554.671648] nd_blk namespace5.0: nvdimm_namespace_common_probe: sector size not set
[ 554.674688] ndbus1: nd_blk.probe(namespace5.0) = -19
sh: write error: No such device
[/]#
Signed-off-by: Dmitry V. Krivenok <krivenok.dmitry@gmail.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit bd26d0d0ce7434a86dde61a7c65c94fe3801d8f6) Signed-off-by: Dan Duval <dan.duval@oracle.com>
Even if dev->driver is null because we are being removed,
it is safer to not leave device locked.
Found by Linux Driver Verification project (linuxtesting.org).
Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit d91e892825ae6f0ed4f8b07ae5d348eff86ab2ea) Signed-off-by: Dan Duval <dan.duval@oracle.com>
When support for _FIT was added, the code presumed that the data
returned by the _FIT method is identical to the NFIT table, which
starts with an acpi_table_header. However, the _FIT is defined
to return a data in the format of a series of NFIT type structure
entries and as a method, has an acpi_object header rather tahn
an acpi_table_header.
To address the differences, explicitly save the acpi_table_header
from the NFIT, since it is accessible through /sys, and change
the nfit pointer in the acpi_desc structure to point to the
table entries rather than the headers.
Reported-by: Jeff Moyer (jmoyer@redhat.com> Signed-off-by: Linda Knippers <linda.knippers@hpe.com> Acked-by: Vishal Verma <vishal.l.verma@intel.com>
[vishal: fix up unit test for new header assumptions] Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 6b577c9d772c45448aec784ec235cea228b4d3ad) Signed-off-by: Dan Duval <dan.duval@oracle.com>
Missed previously due to a lack of test coverage on a platform that
provided an valid response to _FIT.
Signed-off-by: Linda Knippers <linda.knippers@hpe.com> Acked-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit ff5a55f89c6690a0b292f1a7e0cd4532961588d5) Signed-off-by: Dan Duval <dan.duval@oracle.com>
The size of NFIT tables don't necessarily match the size of the
data structures that we use for them. For example, the NVDIMM
Control Region Structure table is shorter for a device with
no block control windows than for a device with block control windows.
Other tables, such as Flush Hint Address Structure and the Interleave
Structure are variable length by definition.
Account for the size difference when comparing table entries by
using the actual table size from the table header if it's less
than the structure size.
Signed-off-by: Linda Knippers <linda.knippers@hpe.com> Acked-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 826c416f3c9493b69630a811832cfb7c9007f840) Signed-off-by: Dan Duval <dan.duval@oracle.com>
If there are no persistent memory ranges present then don't bother
creating the platform device. Otherwise, it loads the full libnvdimm
sub-system only to discover no resources present.
Reported-by: Andy Lutomirski <luto@amacapital.net> Acked-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit bc0d0d093b379b0b379c429e3348498287c8a9ca) Signed-off-by: Dan Duval <dan.duval@oracle.com>
To enable DAX to do atomic allocation of zeroed extents, we need to
drive the block zeroing deep into the allocator. Because
xfs_bmapi_write() can return merged extents on allocation that were
only partially allocated (i.e. requested range spans allocated and
hole regions, allocation into the hole was contiguous), we cannot
zero the extent returned from xfs_bmapi_write() as that can
overwrite existing data with zeros.
Hence we have to drive the extent zeroing into the allocation code,
prior to where we merge the extents into the BMBT and return the
resultant map. This means we need to propagate this need down to
the xfs_alloc_vextent() and issue the block zeroing at this point.
While this functionality is being introduced for DAX, there is no
reason why it is specific to DAX - we can per-zero blocks during the
allocation transaction on any type of device. It's just slow (and
usually slower than unwritten allocation and conversion) on
traditional block devices so doesn't tend to get used. We can,
however, hook hardware zeroing optimisations via sb_issue_zeroout()
to this operation, so it may be useful in future and hence the
"allocate zeroed blocks" API needs to be implementation neutral.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry picked from commit 3fbbbea34bac049c0b5938dc065f7d8ee1ef7e67) Signed-off-by: Dan Duval <dan.duval@oracle.com>
DAX handling of COW faults has wrong locking sequence:
dax_fault does i_mmap_lock_read
do_cow_fault does i_mmap_unlock_write
Ross's commit[1] missed a fix[2] that Kirill added to Matthew's
commit[3].
Original COW locking logic was introduced by Matthew here[4].
This should be applied to v4.3 as well.
[1] 0f90cc6609c7 mm, dax: fix DAX deadlocks
[2] 52a2b53ffde6 mm, dax: use i_mmap_unlock_write() in do_cow_fault()
[3] 843172978bb9 dax: fix race between simultaneous faults
[4] 2e4cdab0584f mm: allow page fault handlers to perform the COW
Cc: <stable@vger.kernel.org> Cc: Boaz Harrosh <boaz@plexistor.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dave Chinner <dchinner@redhat.com> Cc: Jan Kara <jack@suse.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Acked-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Yigal Korman <yigal@plexistor.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 0df9d41ab5d43dc5b20abc8b22a6b6d098b03994) Signed-off-by: Dan Duval <dan.duval@oracle.com>
All of these paths are interpreting a dax pmd mapping as a transparent
huge page and making the assumption that the pfn is covered by the
memmap, i.e. that the pfn has an associated struct page. PTE mappings
do not suffer the same fate since they have the _PAGE_SPECIAL flag to
cause the gup path to fault. We can do something similar for the PMD
path, or otherwise defer pmd support for cases where a struct page is
available. For now, 4.4-rc and -stable need to disable dax pmd support
by default.
For development the "depends on BROKEN" line can be removed from
CONFIG_FS_DAX_PMD.
Cc: <stable@vger.kernel.org> Cc: Jan Kara <jack@suse.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit ee82c9ed41e896bd47e121d87e4628de0f2656a3) Signed-off-by: Dan Duval <dan.duval@oracle.com>
Similar to XFS warn when mounting DAX while it is still considered under
development. Also, aspects of the DAX implementation, for example
synchronization against multiple faults and faults causing block
allocation, depend on the correct implementation in the filesystem. The
maturity of a given DAX implementation is filesystem specific.
Cc: <stable@vger.kernel.org> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: linux-ext4@vger.kernel.org Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Dave Chinner <david@fromorbit.com> Acked-by: Jan Kara <jack@suse.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit ef83b6e8f40bb24b92ad73b5889732346e54a793) Signed-off-by: Dan Duval <dan.duval@oracle.com>
Since 4.3 introduced devm_memremap_pages() the pfns handled by DAX may
optionally have a struct page backing. When a mapped pfn reaches
vmf_insert_pfn_pmd() it fails with a crash signature like the following:
Fix this by falling back to 4K mappings in the pfn_valid() case. Longer
term, vmf_insert_pfn_pmd() needs to grow support for architectures that
can provide a 'pmd_special' capability.
Cc: <stable@vger.kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 152d7bd80dca5ce77ec2d7313149a2ab990e808e) Signed-off-by: Dan Duval <dan.duval@oracle.com>
A bunch of changes that I hope will help in understanding it
better for first-time readers.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 8de5dff8bae634497f4413bc3067389f2ed267da) Signed-off-by: Dan Duval <dan.duval@oracle.com>
This masking prevents access to the end of the device via dax_do_io(),
and is unnecessary as arch_add_memory() would have rejected an unaligned
allocation.
Cc: <stable@vger.kernel.org> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 589e75d15702dc720b363a92f984876704864946) Signed-off-by: Dan Duval <dan.duval@oracle.com>
Commit ca321d1ca672 "ACPICA: Update NFIT table to rename a flags field"
performed a tree-wide s/ACPI_NFIT_MEM_ARMED/ACPI_NFIT_MEM_NOT_ARMED/
operation, but missed the tools/testing/nvdimm/ directory.
Cc: Bob Moore <robert.moore@intel.com> Cc: Lv Zheng <lv.zheng@intel.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit f42957967fb435aef6fc700fbbd9df89533b9a2e) Signed-off-by: Dan Duval <dan.duval@oracle.com>
->pfn_mkwrite support is needed so that when a page with allocated
backing store takes a write fault we can check that the fault has
not raced with a truncate and is pointing to a region beyond the
current end of file.
This also allows us to update the timestamp on the inode, too, which
fixes a generic/080 failure.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry picked from commit 3af49285854df66260a263198cc15abb07b95287) Signed-off-by: Dan Duval <dan.duval@oracle.com>
For DAX, we are now doing block zeroing during allocation. This
means we no longer need a special DAX fault IO completion callback
to do unwritten extent conversion. Because mmap never extends the
file size (it SEGVs the process) we don't need a callback to update
the file size, either. Hence we can remove the completion callbacks
from the __dax_fault and __dax_mkwrite calls.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry picked from commit 01a155e6cf7db1a8ff2aa73162d7d9ec05ad298f) Signed-off-by: Dan Duval <dan.duval@oracle.com>
DAX has a page fault serialisation problem with block allocation.
Because it allows concurrent page faults and does not have a page
lock to serialise faults to the same page, it can get two concurrent
faults to the page that race.
When two read faults race, this isn't a huge problem as the data
underlying the page is not changing and so "detect and drop" works
just fine. The issues are to do with write faults.
When two write faults occur, we serialise block allocation in
get_blocks() so only one faul will allocate the extent. It will,
however, be marked as an unwritten extent, and that is where the
problem lies - the DAX fault code cannot differentiate between a
block that was just allocated and a block that was preallocated and
needs zeroing. The result is that both write faults end up zeroing
the block and attempting to convert it back to written.
The problem is that the first fault can zero and convert before the
second fault starts zeroing, resulting in the zeroing for the second
fault overwriting the data that the first fault wrote with zeros.
The second fault then attempts to convert the unwritten extent,
which is then a no-op because it's already written. Data loss occurs
as a result of this race.
Because there is no sane locking construct in the page fault code
that we can use for serialisation across the page faults, we need to
ensure block allocation and zeroing occurs atomically in the
filesystem. This means we can still take concurrent page faults and
the only time they will serialise is in the filesystem
mapping/allocation callback. The page fault code will always see
written, initialised extents, so we will be able to remove the
unwritten extent handling from the DAX code when all filesystems are
converted.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry picked from commit 1ca191576fc862b4766f58e41aa362b28a7c1866) Signed-off-by: Dan Duval <dan.duval@oracle.com>
Both direct IO and DAX pass an offset and count into get_blocks that
will overflow a s64 variable when an IO goes into the last supported
block in a file (i.e. at offset 2^63 - 1FSB bytes). This can be seen
from the tracing:
0x7ffffffffffff000 + 4096 = 0x8000000000000000, and hence that
overflows the s64 offset and we fail to detect the need for a
filesize update and an ioend is not allocated.
This is *mostly* avoided for direct IO because such extending IOs
occur with full block allocation, and so the "IS_UNWRITTEN()" check
still evaluates as true and we get an ioend that way. However, doing
single sector extending IOs to this last block will expose the fact
that file size updates will not occur after the first allocating
direct IO as the overflow will then be exposed.
There is one further complexity: the DAX page fault path also
exposes the same issue in block allocation. However, page faults
cannot extend the file size, so in this case we want to allocate the
block but do not want to allocate an ioend to enable file size
update at IO completion. Hence we now need to distinguish between
the direct IO patch allocation and dax fault path allocation to
avoid leaking ioend structures.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry picked from commit 3e12dbbdbd8809f0455920e42fdbf9eddc002651) Signed-off-by: Dan Duval <dan.duval@oracle.com>
Add explicit filtering for DAX mappings to FDPIC ELF coredump. This is
useful because DAX mappings have the potential to be very large.
This patch has only been compile tested.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit ab27a8d04b32b6ee8c30c14c4afd1058e8addc82) Signed-off-by: Dan Duval <dan.duval@oracle.com>
Add two new flags to the existing coredump mechanism for ELF files to
allow us to explicitly filter DAX mappings. This is desirable because
DAX mappings, like hugetlb mappings, have the potential to be very
large.
Update the coredump_filter documentation in
Documentation/filesystems/proc.txt so that it addresses the new DAX
coredump flags. Also update the documented default value of
coredump_filter to be consistent with the core(5) man page. The
documentation being updated talks about bit 4, Dump ELF headers, which
is enabled if CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS is turned on in the
kernel config. This kernel config option defaults to "y" if both ELF
binaries and coredump are enabled.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 5037835c1f3eabf4f22163fc0278dd87165f8957) Signed-off-by: Dan Duval <dan.duval@oracle.com>
Add a .notify callback to the acpi_nfit_driver that gets called on a
hotplug event. From this, evaluate the _FIT ACPI method which returns
the updated NFIT with handles for the hot-plugged NVDIMM.
Iterate over the new NFIT, and add any new tables found, and
register/enable the corresponding regions.
In the nfit test framework, after normal initialization, update the NFIT
with a new hot-plugged NVDIMM, and directly call into the driver to
update its view of the available regions.
Cc: Dan Williams <dan.j.williams@intel.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Toshi Kani <toshi.kani@hpe.com> Cc: Elliott, Robert <elliott@hpe.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: <linux-acpi@vger.kernel.org> Cc: <linux-nvdimm@lists.01.org> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 209851649dc4f7900a6bfe1de5e2640ab2c7d931) Signed-off-by: Dan Duval <dan.duval@oracle.com>
If acpi_nfit_init is called (such as from nfit_test), with an nfit table
that has more memory allocated than it needs (and a similarly large
'size' field, add_tables would happily keep adding null SPA Range tables
filling up all available memory.
Make it friendlier by breaking out if a 0-length header is found in any
of the tables.
Cc: Dan Williams <dan.j.williams@intel.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: <linux-acpi@vger.kernel.org> Cc: <linux-nvdimm@lists.01.org> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 564d501187317f8df79ddda173cf23735cbddd16) Signed-off-by: Dan Duval <dan.duval@oracle.com>
Given that pmem ranges come with numa-locality hints, arrange for the
resulting driver objects to be obtained from node-local memory.
Reviewed-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 538ea4aa44737127ce2b5c8511c7349d2abdcf9c) Signed-off-by: Dan Duval <dan.duval@oracle.com>
Hint to closest numa node for the placement of newly allocated pages.
As that is where the device's other allocations will originate by
default when it does not specify a NUMA node.
Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 7eff93b7c99f5d0024aee677c6c92e32af22e1d2) Signed-off-by: Dan Duval <dan.duval@oracle.com>
Given we already have a device just use dev_to_node() to provide hint
allocations for devres. However, current devres_alloc() users will need
to explicitly opt-in with devres_alloc_node().
Reviewed-by: Tejun Heo <tj@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 7c683941f30a977c10ec6be174ec5f16939c7ce5) Signed-off-by: Dan Duval <dan.duval@oracle.com>
Make devm_memremap consistent with the error return scheme of
devm_memremap_pages to remove special casing in the pmem driver.
Cc: Christoph Hellwig <hch@lst.de> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit b36f47617f6ce7c5e8e7c264b9d9ea0654d9f20a) Signed-off-by: Dan Duval <dan.duval@oracle.com>
Now that the pmem-api is defined as "a set of apis that enables access
to WB mapped pmem", the mapping type is implied. Remove the wrapper
and push the functionality down into the pmem driver in preparation for
adding support for direct-mapped pmem.
Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit a639315d6c536c806724c9328941a2517507e3e3) Signed-off-by: Dan Duval <dan.duval@oracle.com>
Switch to pr_debug() so that dynamic-debug can disable these messages by
default. This gets noisy in the presence of devm_memremap_pages().
Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit c9cdaeb2027e535b956ff69f215522d79f6b54e3) Signed-off-by: Dan Duval <dan.duval@oracle.com>
Add locking to ensure that DAX faults are isolated from ext2 operations
that modify the data blocks allocation for an inode. This is intended to
be analogous to the work being done in XFS by Dave Chinner:
Compared with XFS the ext2 case is greatly simplified by the fact that ext2
already allocates and zeros new blocks before they are returned as part of
ext2_get_block(), so DAX doesn't need to worry about getting unmapped or
unwritten buffer heads.
This means that the only work we need to do in ext2 is to isolate the DAX
faults from inode block allocation changes. I believe this just means that
we need to isolate the DAX faults from truncate operations.
The newly introduced dax_sem is intended to replicate the protection
offered by i_mmaplock in XFS. In addition to truncate the i_mmaplock also
protects XFS operations like hole punching, fallocate down, extent
manipulation IOCTLS like xfs_ioc_space() and extent swapping. Truncate is
the only one of these operations supported by ext2.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Jan Kara <jack@suse.com>
(cherry picked from commit 5726b27b09cc92452b543764899a07e7c8037edd) Signed-off-by: Dan Duval <dan.duval@oracle.com>
The following two locking commits in the DAX code:
commit 843172978bb9 ("dax: fix race between simultaneous faults")
commit 46c043ede471 ("mm: take i_mmap_lock in unmap_mapping_range() for DAX")
introduced a number of deadlocks and other issues which need to be fixed
for the v4.3 kernel. The list of issues in DAX after these commits
(some newly introduced by the commits, some preexisting) can be found
here:
https://lkml.org/lkml/2015/9/25/602 (Subject: "Re: [PATCH] dax: fix deadlock in __dax_fault").
This undoes most of the changes introduced by those two commits,
essentially returning us to the DAX locking scheme that was used in
v4.2.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.williams@intel.com> Tested-by: Dave Chinner <dchinner@redhat.com> Cc: Jan Kara <jack@suse.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 0f90cc6609c72b0bdf2aad0cb0456194dd896e19) Signed-off-by: Dan Duval <dan.duval@oracle.com>
Commit 46c043ede471 ("mm: take i_mmap_lock in unmap_mapping_range() for
DAX") moved some code in __dax_pmd_fault() that was responsible for
zeroing newly allocated PMD pages. The new location didn't properly set
up 'kaddr', so when run this code resulted in a NULL pointer BUG.
Fix this by getting the correct 'kaddr' via bdev_direct_access().
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reported-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 8346c416d17bf5b4ea1508662959bb62e73fd6a5) Signed-off-by: Dan Duval <dan.duval@oracle.com>
For VM_PFNMAP and VM_MIXEDMAP we use vm_ops->pfn_mkwrite instead of
vm_ops->page_mkwrite to notify abort write access. This means we want
vma->vm_page_prot to be write-protected if the VMA provides this vm_ops.
A theoretical scenario that will cause these missed events is:
On writable mapping with vm_ops->pfn_mkwrite, but without
vm_ops->page_mkwrite: read fault followed by write access to the pfn.
Writable pte will be set up on read fault and write fault will not be
generated.
I found it examining Dave's complaint on generic/080:
It shouldn't be a problem for ext2/ext4 as they provide both pfn_mkwrite
and page_mkwrite.
[akpm@linux-foundation.org: add local vm_ops to avoid 80-cols mess] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Yigal Korman <yigal@plexistor.com> Acked-by: Boaz Harrosh <boaz@plexistor.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 8a04446ab0cf4f35d9f583cd6adcbf7c534e4995) Signed-off-by: Dan Duval <dan.duval@oracle.com>
The various definitions of __pfn_to_phys() have been consolidated to
use a generic macro in include/asm-generic/memory_model.h. This hit
mainline in the form of 012dcef3f058 "mm: move __phys_to_pfn and
__pfn_to_phys to asm/generic/memory_model.h". When the generic macro
was implemented the type cast to phys_addr_t was dropped which caused
boot regressions on ARM platforms with more than 4GB of memory and
LPAE enabled.
It was suggested to use PFN_PHYS() defined in include/linux/pfn.h
as provides the correct logic and avoids further duplication.
Reported-by: kernelci.org bot <bot@kernelci.org> Suggested-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Tyler Baker <tyler.baker@linaro.org> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit ae4f976968896f8f41b3a7aa21be6146492211e5) Signed-off-by: Dan Duval <dan.duval@oracle.com>
pmem_rw_page() needs to call wmb_pmem() on writes to make sure that the
newly written data is durable. This flow was added to pmem_rw_bytes()
and pmem_make_request() with this commit:
commit 61031952f4c8 ("arch, x86: pmem api for ensuring durability of
persistent memory updates")
...the pmem_rw_page() path was missed.
Cc: <stable@vger.kernel.org> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit ba8fe0f85e15d047686caf8a42463b592c63c98c) Signed-off-by: Dan Duval <dan.duval@oracle.com>
Always take device_lock() before nvdimm_bus_lock() to prevent deadlock.
Signed-off-by: Axel Lin <axel.lin@ingics.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 4ca8b57a0af145f4e791f21dbca6ad789da9ee8b) Signed-off-by: Dan Duval <dan.duval@oracle.com>
Always take device_lock() before nvdimm_bus_lock() to prevent deadlock.
Cc: <stable@vger.kernel.org> Signed-off-by: Axel Lin <axel.lin@ingics.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 4be9c1fc3df9c3b03c9bde8aec5e44fc73996a3f) Signed-off-by: Dan Duval <dan.duval@oracle.com>
commit bbab37ddc20b (block: Add support for DAX reads/writes to
block devices) caused a regression in mkfs.xfs. That utility
sets the block size of the device to the logical block size
using the BLKBSZSET ioctl, and then issues a single sector read
from the last sector of the device. This results in the dax_io
code trying to do a page-sized read from 512 bytes from the end
of the device. The result is -ERANGE being returned to userspace.
The fix is to align the block to the page size before calling
get_block.
Thanks to willy for simplifying my original patch.
Cc: <stable@vger.kernel.org> Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Tested-by: Linda Knippers <linda.knippers@hp.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit e94f5a2285fc94202a9efb2c687481f29b64132c) Signed-off-by: Dan Duval <dan.duval@oracle.com>
commit 61031952f4c8 ("arch, x86: pmem api for ensuring durability of
persistent memory updates") added a new __pmem annotation for sparse
verification. Add __pmem to the $Sparse variable so checkpatch can
appropriately ignore uses of this attribute too.
Signed-off-by: Joe Perches <joe@perches.com> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Acked-by: Andy Whitcroft <apw@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 54507b5183cc4f8e4f1a58a312e1f30c130658b7) Signed-off-by: Dan Duval <dan.duval@oracle.com>
As part of the v4.3 merge window the DAX code was updated by Matthew and
Kirill to handle PMD pages. Also as part of the v4.3 merge window we
updated the DAX code to do proper PMEM flushing (commit 2765cfbb342c:
"dax: update I/O path to do proper PMEM flushing").
The additional code added by the DAX PMD patches also needs to be
updated to properly use the PMEM API. This ensures that after a PMD
fault is handled the zeros written to the newly allocated pages are
durable on the DIMMs.
linux/dax.h is included to get rid of a bunch of sparse warnings.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <willy@linux.intel.com>, Cc: Dan Williams <dan.j.williams@intel.com> Cc: Kirill Shutemov <kirill@shutemov.name> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit d77e92e270edd79a2218ce0ba5b6179ad0c93175) Signed-off-by: Dan Duval <dan.duval@oracle.com>
If the first access to a huge page was a store, there would be no existing
zero pmd in this process's page tables. There could be a zero pmd in
another process's page tables, if it had done a load. We can detect this
case by noticing that the buffer_head returned from the filesystem is New,
and ensure that other processes mapping this huge page have their page
tables flushed.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Reported-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 73a6ec47f68787df1b41869def52915da2f4a6b7) Signed-off-by: Dan Duval <dan.duval@oracle.com>
This is another place where DAX assumed that pgtable_t was a pointer.
Open code the important parts of set_huge_zero_page() in DAX and make
set_huge_zero_page() static again.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit d295e3415a88ae63a37a22652808b20c7fcb970e) Signed-off-by: Dan Duval <dan.duval@oracle.com>
The original DAX code assumed that pgtable_t was a pointer, which isn't
true on all architectures. Restructure the code to not rely on that
assumption.
[willy@linux.intel.com: further fixes integrated into this patch] Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit da146769004e1dd5ed06853e6d009be8ca675d5f) Signed-off-by: Dan Duval <dan.duval@oracle.com>
If two threads write-fault on the same hole at the same time, the winner
of the race will return to userspace and complete their store, only to
have the loser overwrite their store with zeroes. Fix this for now by
taking the i_mmap_sem for write instead of read, and do so outside the
call to get_block(). Now the loser of the race will see the block has
already been zeroed, and will not zero it again.
This severely limits our scalability. I have ideas for improving it, but
those can wait for a later patch.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 843172978bb92997310d2f7fbc172ece423cfc02) Signed-off-by: Dan Duval <dan.duval@oracle.com>
Jan Kara pointed out that in the case where we are writing to a hole, we
can end up with a lock inversion between the page lock and the journal
lock. We can avoid this by starting the transaction in ext4 before
calling into DAX. The journal lock nests inside the superblock
pagefault lock, so we have to duplicate that code from dax_fault, like
XFS does.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 01a33b4ace68bc35679a347f21d5ed6e222e30dc) Signed-off-by: Dan Duval <dan.duval@oracle.com>
DAX wants different semantics from any currently-existing ext4 get_block
callback. Unlike ext4_get_block_write(), it needs to honour the
'create' flag, and unlike ext4_get_block(), it needs to be able to
return unwritten extents. So introduce a new ext4_get_block_dax() which
has those semantics.
We could also change ext4_get_block_write() to honour the 'create' flag,
but that might have consequences on other users that I do not currently
understand.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit ed923b5776a2d2e949bd5b20f3956d68f3c826b7) Signed-off-by: Dan Duval <dan.duval@oracle.com>
Jan Kara pointed out I should be more explicit here about the perils of
racing against truncate. The comment is mostly the same as for the PTE
case.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 84c4e5e675408b6fb7d74eec7da9a4a5698b50af) Signed-off-by: Dan Duval <dan.duval@oracle.com>
It would make more sense to have all the return values from
vmf_insert_pfn_pmd() encoded in one place instead of having to follow
the convention into insert_pfn(). Suggested by Jeff Moyer.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit ae18d6dcf57b56b984ff27fd55b4e2caf5bfbd44) Signed-off-by: Dan Duval <dan.duval@oracle.com>
DAX relies on the get_block function either zeroing newly allocated
blocks before they're findable by subsequent calls to get_block, or
marking newly allocated blocks as unwritten. ext4_get_block() cannot
create unwritten extents, but ext4_get_block_write() can.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Reported-by: Andy Rudoff <andy.rudoff@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit e676a4c191653787c3fe851fe3b9f1f33d49dac2) Signed-off-by: Dan Duval <dan.duval@oracle.com>
Similar to vm_insert_pfn(), but for PMDs rather than PTEs. The 'vmf_'
prefix instead of 'vm_' prefix is intended to indicate that it returns a
VMF_ value rather than an errno (which would only have to be converted
into a VMF_ value anyway).
Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 5cad465d7fa646bad3d677df276bfc8e2ad709e3) Signed-off-by: Dan Duval <dan.duval@oracle.com>
Allow non-anonymous VMAs to provide huge pages in response to a page fault.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit b96375f74a6d4f39fc6cbdc0bce5175115c7f96f) Signed-off-by: Dan Duval <dan.duval@oracle.com>
NOTE: Moved new member to the end of the vm_operations_struct and
surrounded their definitions with #ifdef __GENKSYMS__/#endif so as
to maintain kABI. - Dan Duval <dan.duval@oracle.com>
mm: clarify that the function operates on hugepage pte
We have confusing functions to clear pmd, pmd_clear_* and pmd_clear. Add
_huge_ to pmdp_clear functions so that we are clear that they operate on
hugepage pte.
We don't bother about other functions like pmdp_set_wrprotect,
pmdp_clear_flush_young, because they operate on PTE bits and hence
indicate they are operating on hugepage ptes
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Dan Duval <dan.duval@oracle.com>
(cherry picked from commit 8809aa2d28d74111ff2f1928edaa4e9845c97a7d)
Architectures like ppc64 [1] need to do special things while clearing pmd
before a collapse. For them this operation is largely different from a
normal hugepage pte clear. Hence add a separate function to clear pmd
before collapse. After this patch pmdp_* functions operate only on
hugepage pte, and not on regular pmd_t values pointing to page table.
[1] ppc64 needs to invalidate all the normal page pte mappings we already
have inserted in the hardware hash page table. But before doing that we
need to make sure there are no parallel hash page table insert going on.
So we need to do a kick_all_cpus_sync() before flushing the older hash
table entries. By moving this to a separate function we capture these
details and mention how it is different from a hugepage pte clear.
This patch is a cleanup and only does code movement for clarity. There
should not be any change in functionality.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 15a25b2ead5f97c5a63c169186e294b41ce03f9a) Signed-off-by: Dan Duval <dan.duval@oracle.com>