www.infradead.org Git - users/jedix/linux-maple.git/log

block: Add badblock management for gendisks

Orabug: 22913653

NVDIMM devices, which can behave more like DRAM rather than block
devices, may develop bad cache lines, or 'poison'. A block device
exposed by the pmem driver can then consume poison via a read (or
write), and cause a machine check. On platforms without machine
check recovery features, this would mean a crash.

The block device maintaining a runtime list of all known sectors that
have poison can directly avoid this, and also provide a path forward
to enable proper handling/recovery for DAX faults on such a device.

Use the new badblock management interfaces to add a badblocks list to
gendisks.

Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 99e6608c9e7414ae4f2168df8bf8fae3eb49e41f)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

badblocks: Add core badblock management code

Orabug: 22913653

Take the core badblocks implementation from md, and make it generally
available. This follows the same style as kernel implementations of
linked lists, rb-trees etc, where you can have a structure that can be
embedded anywhere, and accessor functions to manipulate the data.

The only changes in this copy of the code are ones to generalize
function/variable names from md-specific ones. Also add init and free
functions.

Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 9e0e252a048b0ba5066f0dc15c3b2468ffe5c422)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

block: fix del_gendisk() vs blkdev_ioctl crash

Orabug: 22913653

When tearing down a block device early in its lifetime, userspace may
still be performing discovery actions like blkdev_ioctl() to re-read
partitions.

The nvdimm_revalidate_disk() implementation depends on
disk->driverfs_dev to be valid at entry.  However, it is set to NULL in
del_gendisk() and fatally this is happening *before* the disk device is
deleted from userspace view.

There's no reason for del_gendisk() to clear ->driverfs_dev.  That
device is the parent of the disk.  It is guaranteed to not be freed
until the disk, as a child, drops its ->parent reference.

We could also fix this issue locally in nvdimm_revalidate_disk() by
using disk_to_dev(disk)->parent, but lets fix it globally since
->driverfs_dev follows the lifetime of the parent.  Longer term we
should probably just add a @parent parameter to add_disk(), and stop
carrying this pointer in the gendisk.

BUG: unable to handle kernel NULL pointer dereference at           (null)
IP: [<ffffffffa00340a8>] nvdimm_revalidate_disk+0x18/0x90 [libnvdimm]
CPU: 2 PID: 538 Comm: systemd-udevd Tainted: G           O    4.4.0-rc5 #2257
[..]
Call Trace:
  [<ffffffff8143e5c7>] rescan_partitions+0x87/0x2c0
  [<ffffffff810f37f9>] ? __lock_is_held+0x49/0x70
  [<ffffffff81438c62>] __blkdev_reread_part+0x72/0xb0
  [<ffffffff81438cc5>] blkdev_reread_part+0x25/0x40
  [<ffffffff8143982d>] blkdev_ioctl+0x4fd/0x9c0
  [<ffffffff811246c9>] ? current_kernel_time64+0x69/0xd0
  [<ffffffff812916dd>] block_ioctl+0x3d/0x50
  [<ffffffff81264c38>] do_vfs_ioctl+0x308/0x560
  [<ffffffff8115dbd1>] ? __audit_syscall_entry+0xb1/0x100
  [<ffffffff810031d6>] ? do_audit_syscall_entry+0x66/0x70
  [<ffffffff81264f09>] SyS_ioctl+0x79/0x90
  [<ffffffff81902672>] entry_SYSCALL_64_fastpath+0x12/0x76

Reported-by: Robert Hu <robert.hu@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit ac34f15e0c6d2fd58480052b6985f6991fb53bcc)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

block: introduce bdev_file_inode()

Orabug: 22913653

Similar to the file_inode() helper, provide a helper to lookup the inode for a
raw block device itself.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 4ebb16ca9a06a54cdb2e7f2ce1e506fa4d432445)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

restrict /dev/mem to idle io memory ranges

Orabug: 22913653

This effectively promotes IORESOURCE_BUSY to IORESOURCE_EXCLUSIVE
semantics by default.  If userspace really believes it is safe to access
the memory region it can also perform the extra step of disabling an
active driver.  This protects device address ranges with read side
effects and otherwise directs userspace to use the driver.

Persistent memory presents a large "mistake surface" to /dev/mem as now
accidental writes can corrupt a filesystem.

In general if a device driver is busily using a memory region it already
informs other parts of the kernel to not touch it via
request_mem_region().  /dev/mem should honor the same safety restriction
by default.  Debugging a device driver from userspace becomes more
difficult with this enabled.  Any application using /dev/mem or mmap of
sysfs pci resources will now need to perform the extra step of either:

1/ Disabling the driver, for example:

   echo <device id> > /dev/bus/<parent bus>/drivers/<driver name>/unbind

2/ Rebooting with "iomem=relaxed" on the command line

3/ Recompiling with CONFIG_IO_STRICT_DEVMEM=n

Traditional users of /dev/mem like dosemu are unaffected because the
first 1MB of memory is not subject to the IO_STRICT_DEVMEM restriction.
Legacy X configurations use /dev/mem to talk to graphics hardware, but
that functionality has since moved to kernel graphics drivers.

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 90a545e981267e917b9d698ce07affd69787db87)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

arch: consolidate CONFIG_STRICT_DEVM in lib/Kconfig.debug

Orabug: 22913653

Let all the archs that implement devmem_is_allowed() opt-in to a common
definition of CONFIG_STRICT_DEVM in lib/Kconfig.debug.

Cc: Kees Cook <keescook@chromium.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "David S. Miller" <davem@davemloft.net>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
[heiko: drop 'default y' for s390]
Acked-by: Ingo Molnar <mingo@redhat.com>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dan Duval <dan.duval@oracle.com>
(cherry picked from commit 21266be9ed542f13436bd9c75316d43e1e84f6ae)

Conflicts:

arch/powerpc/Kconfig
arch/x86/Kconfig

libnvdimm: fix namespace object confusion in is_uuid_busy()

Orabug: 22913653

When btt devices were re-worked to be child devices of regions this
routine was overlooked.  It mistakenly attempts to_nd_namespace_pmem()
or to_nd_namespace_blk() conversions on btt and pfn devices.  By luck to
date we have happened to be hitting valid memory leading to a uuid
miscompare, but a recent change to struct nd_namespace_common causes:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000001
IP: [<ffffffff814610dc>] memcmp+0xc/0x40
[..]
Call Trace:
  [<ffffffffa0028631>] is_uuid_busy+0xc1/0x2a0 [libnvdimm]
  [<ffffffffa0028570>] ? to_nd_blk_region+0x50/0x50 [libnvdimm]
  [<ffffffff8158c9c0>] device_for_each_child+0x50/0x90

Cc: <stable@vger.kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit e07ecd76d4db7bda1e9495395b2110a3fe28845a)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm, pfn: move 'memory mode' indication to sysfs

Orabug: 22913653

'Memory mode' is defined as the capability of a DAX mapping to be the
source/target of DMA and other "direct I/O" scenarios.  While it
currently requires allocating 'struct page' for each page frame of
persistent memory in the namespace it will not always be the case.  Work
continues on reducing the kernel's dependency on 'struct page'.

Let's not maintain a suffix that is expected to lose meaning over time.
In other words a future 'raw mode' pmem namespace may be as capable as
today's 'memory mode' namespace.  Undo the encoding of the mode in the
device name and leave it to other tooling to determine the mode of the
namespace from its attributes.

Reported-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 0731de0dd95b251ed6cfb5f132486e52357fce53)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm, pfn: fix nd_pfn_validate() return value handling

Orabug: 22913653

The -ENODEV case indicates that the info-block needs to established.
All other return codes cause nd_pfn_init() to abort.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 3fa962686568a1617d174e3d2f5d522e963077c5)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm, pfn: enable pfn sysfs interface unit testing

Orabug: 22913653

The unit test infrastructure uses CMA and real memory to emulate nvdimm
resources. The call to devm_memremap_pages() can simply be mocked in
the same manner as memremap and we mock phys_to_pfn_t() to clear PFN_MAP
since these resources are not registered with in the pgmap_radix.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 979fccfb7348dbd968daf0249aa484a0297f83df)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm, pfn: fix pfn seed creation

Orabug: 22913653

Similar to btt, plant a new pfn seed when the existing one is activated.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 2dc43331e34fa992a67f42ed44e5111cafafd6f3)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm, pfn: add parent uuid validation

Orabug: 22913653

Track and check the uuid of the namespace hosting a pfn instance. This
forces the pfn info block to be invalidated if the namespace is
re-configured with a different uuid.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit a34d5e8a6ad1a31b186019c9c351777626698863)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm, pfn: add 'align' attribute, default to HPAGE_SIZE

Orabug: 22913653

When setting aside capacity for struct page it must be aligned to the
largest mapping size that is to be made available via DAX. Make the
alignment configurable to enable support for 1GiB page-size mappings.

The offset for PFN_MODE_RAM may now be larger than SZ_8K, so fixup the
offset check in nvdimm_namespace_attach_pfn().

Reported-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 315c562536c42aa4da9b6c5a2135dd6715a5e0b5)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm, pfn: clean up pfn create parameters

Orabug: 22913653

In all cases __nd_pfn_create is called with default parameters which are
then overridden by values in the info block. Clean up pfn creation by
dropping the parameters and setting default values internal to
__nd_pfn_create.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit f7c6ab80fa5fee3daccb83a3c1b3a9f39d7b931c)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm, pfn: kill ND_PFN_ALIGN

Orabug: 22913653

The alignment constraint isn't necessary now that devm_memremap_pages()
allows for unaligned mappings.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 9f1e8cee7742cadbe6b97f2c80b787b4ee067bae)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

nvdimm: do not show pfn_seed for non pmem regions

Orabug: 22913653

This simple change hides pfn_seed attribute for non pmem
regions because they don't support pfn anyway.

Signed-off-by: Dmitry V. Krivenok <krivenok.dmitry@gmail.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 6bb691ac089c39bb0aa73bdcc21ffd8c846e4ba5)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

nvdimm: improve diagnosibility of namespaces

Orabug: 22913653

In order to bind namespace to the driver user must first
set all mandatory attributes in the following order:
- uuid
- size
- sector_size (for blk namespace only)

If the order is wrong, then user either won't be able to set
the attribute or bind the namespace.

This simple patch improves diagnosibility of common operations
with namespaces by printing some details about the error
instead of failing silently.

Below are examples of error messages (assuming dyndbg is
enabled for nvdimms):

[/]# echo 4194304 > /sys/bus/nd/devices/region5/namespace5.0/size
[  288.372612] nd namespace5.0: __size_store: uuid not set
[  288.374839] nd namespace5.0: size_store: 400000 fail (-6)
sh: write error: No such device or address
[/]#

[/]# echo namespace5.0 > /sys/bus/nd/drivers/nd_blk/bind
[  554.671648] nd_blk namespace5.0: nvdimm_namespace_common_probe: sector size not set
[  554.674688]  ndbus1: nd_blk.probe(namespace5.0) = -19
sh: write error: No such device
[/]#

Signed-off-by: Dmitry V. Krivenok <krivenok.dmitry@gmail.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit bd26d0d0ce7434a86dde61a7c65c94fe3801d8f6)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

nfit: acpi_nfit_notify(): Do not leave device locked

Orabug: 22913653

Even if dev->driver is null because we are being removed,
it is safer to not leave device locked.

Found by Linux Driver Verification project (linuxtesting.org).

Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit d91e892825ae6f0ed4f8b07ae5d348eff86ab2ea)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

nfit: Adjust for different _FIT and NFIT headers

Orabug: 22913653

When support for _FIT was added, the code presumed that the data
returned by the _FIT method is identical to the NFIT table, which
starts with an acpi_table_header. However, the _FIT is defined
to return a data in the format of a series of NFIT type structure
entries and as a method, has an acpi_object header rather tahn
an acpi_table_header.

To address the differences, explicitly save the acpi_table_header
from the NFIT, since it is accessible through /sys, and change
the nfit pointer in the acpi_desc structure to point to the
table entries rather than the headers.

Reported-by: Jeff Moyer (jmoyer@redhat.com>
Signed-off-by: Linda Knippers <linda.knippers@hpe.com>
Acked-by: Vishal Verma <vishal.l.verma@intel.com>
[vishal: fix up unit test for new header assumptions]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 6b577c9d772c45448aec784ec235cea228b4d3ad)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

nfit: Fix the check for a successful NFIT merge

Orabug: 22913653

Missed previously due to a lack of test coverage on a platform that
provided an valid response to _FIT.

Signed-off-by: Linda Knippers <linda.knippers@hpe.com>
Acked-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit ff5a55f89c6690a0b292f1a7e0cd4532961588d5)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

nfit: Account for table size length variation

Orabug: 22913653

The size of NFIT tables don't necessarily match the size of the
data structures that we use for them. For example, the NVDIMM
Control Region Structure table is shorter for a device with
no block control windows than for a device with block control windows.
Other tables, such as Flush Hint Address Structure and the Interleave
Structure are variable length by definition.

Account for the size difference when comparing table entries by
using the actual table size from the table header if it's less
than the structure size.

Signed-off-by: Linda Knippers <linda.knippers@hpe.com>
Acked-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 826c416f3c9493b69630a811832cfb7c9007f840)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm, e820: skip module loading when no type-12

Orabug: 22913653

If there are no persistent memory ranges present then don't bother
creating the platform device. Otherwise, it loads the full libnvdimm
sub-system only to discover no resources present.

Reported-by: Andy Lutomirski <luto@amacapital.net>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit bc0d0d093b379b0b379c429e3348498287c8a9ca)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

xfs: introduce BMAPI_ZERO for allocating zeroed extents

Orabug: 22913653

To enable DAX to do atomic allocation of zeroed extents, we need to
drive the block zeroing deep into the allocator. Because
xfs_bmapi_write() can return merged extents on allocation that were
only partially allocated (i.e. requested range spans allocated and
hole regions, allocation into the hole was contiguous), we cannot
zero the extent returned from xfs_bmapi_write() as that can
overwrite existing data with zeros.

Hence we have to drive the extent zeroing into the allocation code,
prior to where we merge the extents into the BMBT and return the
resultant map. This means we need to propagate this need down to
the xfs_alloc_vextent() and issue the block zeroing at this point.

While this functionality is being introduced for DAX, there is no
reason why it is specific to DAX - we can per-zero blocks during the
allocation transaction on any type of device. It's just slow (and
usually slower than unwritten allocation and conversion) on
traditional block devices so doesn't tend to get used. We can,
however, hook hardware zeroing optimisations via sb_issue_zeroout()
to this operation, so it may be useful in future and hence the
"allocate zeroed blocks" API needs to be implementation neutral.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry picked from commit 3fbbbea34bac049c0b5938dc065f7d8ee1ef7e67)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

mm, dax: fix DAX deadlocks (COW fault)

Orabug: 22913653

DAX handling of COW faults has wrong locking sequence:
dax_fault does i_mmap_lock_read
do_cow_fault does i_mmap_unlock_write

Ross's commit[1] missed a fix[2] that Kirill added to Matthew's
commit[3].

Original COW locking logic was introduced by Matthew here[4].

This should be applied to v4.3 as well.

[1] 0f90cc6609c7 mm, dax: fix DAX deadlocks
[2] 52a2b53ffde6 mm, dax: use i_mmap_unlock_write() in do_cow_fault()
[3] 843172978bb9 dax: fix race between simultaneous faults
[4] 2e4cdab0584f mm: allow page fault handlers to perform the COW

Cc: <stable@vger.kernel.org>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Acked-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Yigal Korman <yigal@plexistor.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 0df9d41ab5d43dc5b20abc8b22a6b6d098b03994)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

dax: disable pmd mappings

Orabug: 22913653

While dax pmd mappings are functional in the nominal path they trigger
kernel crashes in the following paths:

BUG: unable to handle kernel paging request at ffffea0004098000
IP: [<ffffffff812362f7>] follow_trans_huge_pmd+0x117/0x3b0
[..]
Call Trace:
  [<ffffffff811f6573>] follow_page_mask+0x2d3/0x380
  [<ffffffff811f6708>] __get_user_pages+0xe8/0x6f0
  [<ffffffff811f7045>] get_user_pages_unlocked+0x165/0x1e0
  [<ffffffff8106f5b1>] get_user_pages_fast+0xa1/0x1b0

kernel BUG at arch/x86/mm/gup.c:131!
[..]
Call Trace:
  [<ffffffff8106f34c>] gup_pud_range+0x1bc/0x220
  [<ffffffff8106f634>] get_user_pages_fast+0x124/0x1b0

BUG: unable to handle kernel paging request at ffffea0004088000
IP: [<ffffffff81235f49>] copy_huge_pmd+0x159/0x350
[..]
Call Trace:
  [<ffffffff811fad3c>] copy_page_range+0x34c/0x9f0
  [<ffffffff810a0daf>] copy_process+0x1b7f/0x1e10
  [<ffffffff810a11c1>] _do_fork+0x91/0x590

All of these paths are interpreting a dax pmd mapping as a transparent
huge page and making the assumption that the pfn is covered by the
memmap, i.e. that the pfn has an associated struct page.  PTE mappings
do not suffer the same fate since they have the _PAGE_SPECIAL flag to
cause the gup path to fault.  We can do something similar for the PMD
path, or otherwise defer pmd support for cases where a struct page is
available.  For now, 4.4-rc and -stable need to disable dax pmd support
by default.

For development the "depends on BROKEN" line can be removed from
CONFIG_FS_DAX_PMD.

Cc: <stable@vger.kernel.org>
Cc: Jan Kara <jack@suse.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit ee82c9ed41e896bd47e121d87e4628de0f2656a3)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

ext2, ext4: warn when mounting with dax enabled

Orabug: 22913653

Similar to XFS warn when mounting DAX while it is still considered under
development. Also, aspects of the DAX implementation, for example
synchronization against multiple faults and faults causing block
allocation, depend on the correct implementation in the filesystem. The
maturity of a given DAX implementation is filesystem specific.

Cc: <stable@vger.kernel.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: linux-ext4@vger.kernel.org
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Dave Chinner <david@fromorbit.com>
Acked-by: Jan Kara <jack@suse.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit ef83b6e8f40bb24b92ad73b5889732346e54a793)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

dax: fix __dax_pmd_fault crash

Orabug: 22913653

Since 4.3 introduced devm_memremap_pages() the pfns handled by DAX may
optionally have a struct page backing.  When a mapped pfn reaches
vmf_insert_pfn_pmd() it fails with a crash signature like the following:

kernel BUG at mm/huge_memory.c:905!
[..]
Call Trace:
  [<ffffffff812a73ba>] __dax_pmd_fault+0x2ea/0x5b0
  [<ffffffffa01a4182>] xfs_filemap_pmd_fault+0x92/0x150 [xfs]
  [<ffffffff811fbe02>] handle_mm_fault+0x312/0x1b50

Fix this by falling back to 4K mappings in the pfn_valid() case.  Longer
term, vmf_insert_pfn_pmd() needs to grow support for architectures that
can provide a 'pmd_special' capability.

Cc: <stable@vger.kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 152d7bd80dca5ce77ec2d7313149a2ab990e808e)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm: documentation clarifications

Orabug: 22913653

A bunch of changes that I hope will help in understanding it
better for first-time readers.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 8de5dff8bae634497f4413bc3067389f2ed267da)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm, pmem: fix size trim in pmem_direct_access()

Orabug: 22913653

This masking prevents access to the end of the device via dax_do_io(),
and is unnecessary as arch_add_memory() would have rejected an unaligned
allocation.

Cc: <stable@vger.kernel.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 589e75d15702dc720b363a92f984876704864946)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm, e820: fix numa node for e820-type-12 pmem ranges

Orabug: 22913653

Rather than punt on the numa node for these e820 ranges try to find a
better answer with memory_add_physaddr_to_nid() when it is available.

Cc: <stable@vger.kernel.org>
Reported-by: Boaz Harrosh <boaz@plexistor.com>
Tested-by: Boaz Harrosh <boaz@plexistor.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit f7256dc0cdbc68903502997bde619f555a910f50)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

tools/testing/nvdimm, acpica: fix flag rename build breakage

Orabug: 22913653

Commit ca321d1ca672 "ACPICA: Update NFIT table to rename a flags field"
performed a tree-wide s/ACPI_NFIT_MEM_ARMED/ACPI_NFIT_MEM_NOT_ARMED/
operation, but missed the tools/testing/nvdimm/ directory.

Cc: Bob Moore <robert.moore@intel.com>
Cc: Lv Zheng <lv.zheng@intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit f42957967fb435aef6fc700fbbd9df89533b9a2e)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

dax_io(): don't let non-error value escape via retval instead of EFAULT

Orabug: 22913653

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: stable@vger.kernel.org # 4.0+
Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit cadfbb6ec2e55171479191046142c927a8b12d87)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

xfs: add ->pfn_mkwrite support for DAX

Orabug: 22913653

->pfn_mkwrite support is needed so that when a page with allocated
backing store takes a write fault we can check that the fault has
not raced with a truncate and is pointing to a region beyond the
current end of file.

This also allows us to update the timestamp on the inode, too, which
fixes a generic/080 failure.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry picked from commit 3af49285854df66260a263198cc15abb07b95287)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

xfs: DAX does not use IO completion callbacks

Orabug: 22913653

For DAX, we are now doing block zeroing during allocation. This
means we no longer need a special DAX fault IO completion callback
to do unwritten extent conversion. Because mmap never extends the
file size (it SEGVs the process) we don't need a callback to update
the file size, either. Hence we can remove the completion callbacks
from the __dax_fault and __dax_mkwrite calls.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry picked from commit 01a155e6cf7db1a8ff2aa73162d7d9ec05ad298f)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

xfs: Don't use unwritten extents for DAX

Orabug: 22913653

DAX has a page fault serialisation problem with block allocation.
Because it allows concurrent page faults and does not have a page
lock to serialise faults to the same page, it can get two concurrent
faults to the page that race.

When two read faults race, this isn't a huge problem as the data
underlying the page is not changing and so "detect and drop" works
just fine. The issues are to do with write faults.

When two write faults occur, we serialise block allocation in
get_blocks() so only one faul will allocate the extent. It will,
however, be marked as an unwritten extent, and that is where the
problem lies - the DAX fault code cannot differentiate between a
block that was just allocated and a block that was preallocated and
needs zeroing. The result is that both write faults end up zeroing
the block and attempting to convert it back to written.

The problem is that the first fault can zero and convert before the
second fault starts zeroing, resulting in the zeroing for the second
fault overwriting the data that the first fault wrote with zeros.
The second fault then attempts to convert the unwritten extent,
which is then a no-op because it's already written. Data loss occurs
as a result of this race.

Because there is no sane locking construct in the page fault code
that we can use for serialisation across the page faults, we need to
ensure block allocation and zeroing occurs atomically in the
filesystem. This means we can still take concurrent page faults and
the only time they will serialise is in the filesystem
mapping/allocation callback. The page fault code will always see
written, initialised extents, so we will be able to remove the
unwritten extent handling from the DAX code when all filesystems are
converted.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry picked from commit 1ca191576fc862b4766f58e41aa362b28a7c1866)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

xfs: fix inode size update overflow in xfs_map_direct()

Orabug: 22913653

Both direct IO and DAX pass an offset and count into get_blocks that
will overflow a s64 variable when an IO goes into the last supported
block in a file (i.e. at offset 2^63 - 1FSB bytes). This can be seen
from the tracing:

xfs_get_blocks_alloc: [...] offset 0x7ffffffffffff000 count 4096
xfs_gbmap_direct: [...] offset 0x7ffffffffffff000 count 4096
xfs_gbmap_direct_none:[...] offset 0x7ffffffffffff000 count 4096

0x7ffffffffffff000 + 4096 = 0x8000000000000000, and hence that
overflows the s64 offset and we fail to detect the need for a
filesize update and an ioend is not allocated.

This is *mostly* avoided for direct IO because such extending IOs
occur with full block allocation, and so the "IS_UNWRITTEN()" check
still evaluates as true and we get an ioend that way. However, doing
single sector extending IOs to this last block will expose the fact
that file size updates will not occur after the first allocating
direct IO as the overflow will then be exposed.

There is one further complexity: the DAX page fault path also
exposes the same issue in block allocation. However, page faults
cannot extend the file size, so in this case we want to allocate the
block but do not want to allocate an ioend to enable file size
update at IO completion. Hence we now need to distinguish between
the direct IO patch allocation and dax fault path allocation to
avoid leaking ioend structures.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry picked from commit 3e12dbbdbd8809f0455920e42fdbf9eddc002651)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

coredump: add DAX filtering for FDPIC ELF coredumps

Orabug: 22913653

Add explicit filtering for DAX mappings to FDPIC ELF coredump. This is
useful because DAX mappings have the potential to be very large.

This patch has only been compile tested.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit ab27a8d04b32b6ee8c30c14c4afd1058e8addc82)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

coredump: add DAX filtering for ELF coredumps

Orabug: 22913653

Add two new flags to the existing coredump mechanism for ELF files to
allow us to explicitly filter DAX mappings.  This is desirable because
DAX mappings, like hugetlb mappings, have the potential to be very
large.

Update the coredump_filter documentation in
Documentation/filesystems/proc.txt so that it addresses the new DAX
coredump flags.  Also update the documented default value of
coredump_filter to be consistent with the core(5) man page.  The
documentation being updated talks about bit 4, Dump ELF headers, which
is enabled if CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS is turned on in the
kernel config.  This kernel config option defaults to "y" if both ELF
binaries and coredump are enabled.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 5037835c1f3eabf4f22163fc0278dd87165f8957)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

acpi: nfit: Add support for hot-add

Orabug: 22913653

Add a .notify callback to the acpi_nfit_driver that gets called on a
hotplug event. From this, evaluate the _FIT ACPI method which returns
the updated NFIT with handles for the hot-plugged NVDIMM.

Iterate over the new NFIT, and add any new tables found, and
register/enable the corresponding regions.

In the nfit test framework, after normal initialization, update the NFIT
with a new hot-plugged NVDIMM, and directly call into the driver to
update its view of the available regions.

Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Elliott, Robert <elliott@hpe.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: <linux-acpi@vger.kernel.org>
Cc: <linux-nvdimm@lists.01.org>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 209851649dc4f7900a6bfe1de5e2640ab2c7d931)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

nfit: in acpi_nfit_init, break on a 0-length table

Orabug: 22913653

If acpi_nfit_init is called (such as from nfit_test), with an nfit table
that has more memory allocated than it needs (and a similarly large
'size' field, add_tables would happily keep adding null SPA Range tables
filling up all available memory.

Make it friendlier by breaking out if a 0-length header is found in any
of the tables.

Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: <linux-acpi@vger.kernel.org>
Cc: <linux-nvdimm@lists.01.org>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 564d501187317f8df79ddda173cf23735cbddd16)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

pmem, memremap: convert to numa aware allocations

Orabug: 22913653

Given that pmem ranges come with numa-locality hints, arrange for the
resulting driver objects to be obtained from node-local memory.

Reviewed-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 538ea4aa44737127ce2b5c8511c7349d2abdcf9c)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

devm_memremap_pages: use numa_mem_id

Orabug: 22913653

Hint to closest numa node for the placement of newly allocated pages.
As that is where the device's other allocations will originate by
default when it does not specify a NUMA node.

Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 7eff93b7c99f5d0024aee677c6c92e32af22e1d2)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

devm: make allocations numa aware by default

Orabug: 22913653

Given we already have a device just use dev_to_node() to provide hint
allocations for devres. However, current devres_alloc() users will need
to explicitly opt-in with devres_alloc_node().

Reviewed-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 7c683941f30a977c10ec6be174ec5f16939c7ce5)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

devm_memremap: convert to return ERR_PTR

Orabug: 22913653

Make devm_memremap consistent with the error return scheme of
devm_memremap_pages to remove special casing in the pmem driver.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit b36f47617f6ce7c5e8e7c264b9d9ea0654d9f20a)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

pmem: kill memremap_pmem()

Orabug: 22913653

Now that the pmem-api is defined as "a set of apis that enables access
to WB mapped pmem", the mapping type is implied. Remove the wrapper
and push the functionality down into the pmem driver in preparation for
adding support for direct-mapped pmem.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit a639315d6c536c806724c9328941a2517507e3e3)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

x86, mm: quiet arch_add_memory()

Orabug: 22913653

Switch to pr_debug() so that dynamic-debug can disable these messages by
default. This gets noisy in the presence of devm_memremap_pages().

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit c9cdaeb2027e535b956ff69f215522d79f6b54e3)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

ext2: Add locking for DAX faults

Orabug: 22913653

Add locking to ensure that DAX faults are isolated from ext2 operations
that modify the data blocks allocation for an inode.  This is intended to
be analogous to the work being done in XFS by Dave Chinner:

http://www.spinics.net/lists/linux-fsdevel/msg90260.html

Compared with XFS the ext2 case is greatly simplified by the fact that ext2
already allocates and zeros new blocks before they are returned as part of
ext2_get_block(), so DAX doesn't need to worry about getting unmapped or
unwritten buffer heads.

This means that the only work we need to do in ext2 is to isolate the DAX
faults from inode block allocation changes.  I believe this just means that
we need to isolate the DAX faults from truncate operations.

The newly introduced dax_sem is intended to replicate the protection
offered by i_mmaplock in XFS.  In addition to truncate the i_mmaplock also
protects XFS operations like hole punching, fallocate down, extent
manipulation IOCTLS like xfs_ioc_space() and extent swapping.  Truncate is
the only one of these operations supported by ext2.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.com>
(cherry picked from commit 5726b27b09cc92452b543764899a07e7c8037edd)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

ACPICA: Update NFIT table to rename a flags field

Orabug: 22913653

ACPICA commit 534deab97fb416a13bfede15c538e2c9eac9384a

Updated one of the memory subtable flags to clarify.

Link: https://github.com/acpica/acpica/commit/534deab9
Signed-off-by: Bob Moore <robert.moore@intel.com>
Signed-off-by: Lv Zheng <lv.zheng@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit ca321d1ca6723ed0e04edd09de49c92b24e3648e)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

mm, dax: fix DAX deadlocks

Orabug: 22913653

The following two locking commits in the DAX code:

commit 843172978bb9 ("dax: fix race between simultaneous faults")
commit 46c043ede471 ("mm: take i_mmap_lock in unmap_mapping_range() for DAX")

introduced a number of deadlocks and other issues which need to be fixed
for the v4.3 kernel. The list of issues in DAX after these commits
(some newly introduced by the commits, some preexisting) can be found
here:

https://lkml.org/lkml/2015/9/25/602 (Subject: "Re: [PATCH] dax: fix deadlock in __dax_fault").

This undoes most of the changes introduced by those two commits,
essentially returning us to the DAX locking scheme that was used in
v4.2.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Tested-by: Dave Chinner <dchinner@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 0f90cc6609c72b0bdf2aad0cb0456194dd896e19)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

dax: fix NULL pointer in __dax_pmd_fault()

Orabug: 22913653

Commit 46c043ede471 ("mm: take i_mmap_lock in unmap_mapping_range() for
DAX") moved some code in __dax_pmd_fault() that was responsible for
zeroing newly allocated PMD pages. The new location didn't properly set
up 'kaddr', so when run this code resulted in a NULL pointer BUG.

Fix this by getting the correct 'kaddr' via bdev_direct_access().

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 8346c416d17bf5b4ea1508662959bb62e73fd6a5)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

mm, dax: VMA with vm_ops->pfn_mkwrite wants to be write-notified

Orabug: 22913653

For VM_PFNMAP and VM_MIXEDMAP we use vm_ops->pfn_mkwrite instead of
vm_ops->page_mkwrite to notify abort write access.  This means we want
vma->vm_page_prot to be write-protected if the VMA provides this vm_ops.

A theoretical scenario that will cause these missed events is:

  On writable mapping with vm_ops->pfn_mkwrite, but without
  vm_ops->page_mkwrite: read fault followed by write access to the pfn.
  Writable pte will be set up on read fault and write fault will not be
  generated.

I found it examining Dave's complaint on generic/080:

http://lkml.kernel.org/g/20150831233803.GO3902@dastard

Although I don't think it's the reason.

It shouldn't be a problem for ext2/ext4 as they provide both pfn_mkwrite
and page_mkwrite.

[akpm@linux-foundation.org: add local vm_ops to avoid 80-cols mess]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Yigal Korman <yigal@plexistor.com>
Acked-by: Boaz Harrosh <boaz@plexistor.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 8a04446ab0cf4f35d9f583cd6adcbf7c534e4995)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

mm: fix type cast in __pfn_to_phys()

Orabug: 22913653

The various definitions of __pfn_to_phys() have been consolidated to
use a generic macro in include/asm-generic/memory_model.h. This hit
mainline in the form of 012dcef3f058 "mm: move __phys_to_pfn and
__pfn_to_phys to asm/generic/memory_model.h". When the generic macro
was implemented the type cast to phys_addr_t was dropped which caused
boot regressions on ARM platforms with more than 4GB of memory and
LPAE enabled.

It was suggested to use PFN_PHYS() defined in include/linux/pfn.h
as provides the correct logic and avoids further duplication.

Reported-by: kernelci.org bot <bot@kernelci.org>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Tyler Baker <tyler.baker@linaro.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit ae4f976968896f8f41b3a7aa21be6146492211e5)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

pmem: add proper fencing to pmem_rw_page()

Orabug: 22913653

pmem_rw_page() needs to call wmb_pmem() on writes to make sure that the
newly written data is durable. This flow was added to pmem_rw_bytes()
and pmem_make_request() with this commit:

commit 61031952f4c8 ("arch, x86: pmem api for ensuring durability of
persistent memory updates")

...the pmem_rw_page() path was missed.

Cc: <stable@vger.kernel.org>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit ba8fe0f85e15d047686caf8a42463b592c63c98c)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm: pfn_devs: Fix locking in namespace_store

Orabug: 22913653

Always take device_lock() before nvdimm_bus_lock() to prevent deadlock.

Signed-off-by: Axel Lin <axel.lin@ingics.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 4ca8b57a0af145f4e791f21dbca6ad789da9ee8b)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm: btt_devs: Fix locking in namespace_store

Orabug: 22913653

Always take device_lock() before nvdimm_bus_lock() to prevent deadlock.

Cc: <stable@vger.kernel.org>
Signed-off-by: Axel Lin <axel.lin@ingics.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 4be9c1fc3df9c3b03c9bde8aec5e44fc73996a3f)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

dax: fix O_DIRECT I/O to the last block of a blockdev

Orabug: 22913653

commit bbab37ddc20b (block: Add support for DAX reads/writes to
block devices) caused a regression in mkfs.xfs.  That utility
sets the block size of the device to the logical block size
using the BLKBSZSET ioctl, and then issues a single sector read
from the last sector of the device.  This results in the dax_io
code trying to do a page-sized read from 512 bytes from the end
of the device.  The result is -ERANGE being returned to userspace.

The fix is to align the block to the page size before calling
get_block.

Thanks to willy for simplifying my original patch.

Cc: <stable@vger.kernel.org>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Tested-by: Linda Knippers <linda.knippers@hp.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit e94f5a2285fc94202a9efb2c687481f29b64132c)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

checkpatch: add __pmem to $Sparse annotations

Orabug: 22913653

commit 61031952f4c8 ("arch, x86: pmem api for ensuring durability of
persistent memory updates") added a new __pmem annotation for sparse
verification. Add __pmem to the $Sparse variable so checkpatch can
appropriately ignore uses of this attribute too.

Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Acked-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 54507b5183cc4f8e4f1a58a312e1f30c130658b7)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

dax: update PMD fault handler with PMEM API

Orabug: 22913653

As part of the v4.3 merge window the DAX code was updated by Matthew and
Kirill to handle PMD pages. Also as part of the v4.3 merge window we
updated the DAX code to do proper PMEM flushing (commit 2765cfbb342c:
"dax: update I/O path to do proper PMEM flushing").

The additional code added by the DAX PMD patches also needs to be
updated to properly use the PMEM API. This ensures that after a PMD
fault is handled the zeros written to the newly allocated pages are
durable on the DIMMs.

linux/dax.h is included to get rid of a bunch of sparse warnings.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>,
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Kirill Shutemov <kirill@shutemov.name>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit d77e92e270edd79a2218ce0ba5b6179ad0c93175)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

mm, dax: use i_mmap_unlock_write() in do_cow_fault()

Orabug: 22913653

__dax_fault() takes i_mmap_lock for write. Let's pair it with write
unlock on do_cow_fault() side.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 52a2b53ffde6d6018dfc454fbde34383351fb896)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

mm: take i_mmap_lock in unmap_mapping_range() for DAX

Orabug: 22913653

DAX is not so special: we need i_mmap_lock to protect mapping->i_mmap.

__dax_pmd_fault() uses unmap_mapping_range() shoot out zero page from
all mappings. We need to drop i_mmap_lock there to avoid lock deadlock.

Re-aquiring the lock should be fine since we check i_size after the
point.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 46c043ede4711e8d598b9d63c5616c1fedb0605e)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

dax: use linear_page_index()

Orabug: 22913653

I was basically open-coding it (thanks to copying code from do_fault()
which probably also needs to be fixed).

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 3fdd1b479dbc03347e98f904f54133a9cef5521f)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

dax: ensure that zero pages are removed from other processes

Orabug: 22913653

If the first access to a huge page was a store, there would be no existing
zero pmd in this process's page tables. There could be a zero pmd in
another process's page tables, if it had done a load. We can detect this
case by noticing that the buffer_head returned from the filesystem is New,
and ensure that other processes mapping this huge page have their page
tables flushed.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Reported-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 73a6ec47f68787df1b41869def52915da2f4a6b7)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

dax: don't use set_huge_zero_page()

Orabug: 22913653

This is another place where DAX assumed that pgtable_t was a pointer.
Open code the important parts of set_huge_zero_page() in DAX and make
set_huge_zero_page() static again.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit d295e3415a88ae63a37a22652808b20c7fcb970e)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

thp: fix zap_huge_pmd() for DAX

Orabug: 22913653

The original DAX code assumed that pgtable_t was a pointer, which isn't
true on all architectures. Restructure the code to not rely on that
assumption.

[willy@linux.intel.com: further fixes integrated into this patch]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit da146769004e1dd5ed06853e6d009be8ca675d5f)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

thp: decrement refcount on huge zero page if it is split

Orabug: 22913653

The DAX code neglected to put the refcount on the huge zero page.
Also we must notify on splits.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 5b701b846aad7909d20693bcced2522d0ce8d1bc)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

dax: fix race between simultaneous faults

Orabug: 22913653

If two threads write-fault on the same hole at the same time, the winner
of the race will return to userspace and complete their store, only to
have the loser overwrite their store with zeroes.  Fix this for now by
taking the i_mmap_sem for write instead of read, and do so outside the
call to get_block().  Now the loser of the race will see the block has
already been zeroed, and will not zero it again.

This severely limits our scalability.  I have ideas for improving it, but
those can wait for a later patch.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 843172978bb92997310d2f7fbc172ece423cfc02)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

ext4: start transaction before calling into DAX

Orabug: 22913653

Jan Kara pointed out that in the case where we are writing to a hole, we
can end up with a lock inversion between the page lock and the journal
lock. We can avoid this by starting the transaction in ext4 before
calling into DAX. The journal lock nests inside the superblock
pagefault lock, so we have to duplicate that code from dax_fault, like
XFS does.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 01a33b4ace68bc35679a347f21d5ed6e222e30dc)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

ext4: add ext4_get_block_dax()

Orabug: 22913653

DAX wants different semantics from any currently-existing ext4 get_block
callback. Unlike ext4_get_block_write(), it needs to honour the
'create' flag, and unlike ext4_get_block(), it needs to be able to
return unwritten extents. So introduce a new ext4_get_block_dax() which
has those semantics.

We could also change ext4_get_block_write() to honour the 'create' flag,
but that might have consequences on other users that I do not currently
understand.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit ed923b5776a2d2e949bd5b20f3956d68f3c826b7)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

dax: improve comment about truncate race

Orabug: 22913653

Jan Kara pointed out I should be more explicit here about the perils of
racing against truncate. The comment is mostly the same as for the PTE
case.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 84c4e5e675408b6fb7d74eec7da9a4a5698b50af)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

thp: change insert_pfn's return type to void

Orabug: 22913653

It would make more sense to have all the return values from
vmf_insert_pfn_pmd() encoded in one place instead of having to follow
the convention into insert_pfn(). Suggested by Jeff Moyer.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit ae18d6dcf57b56b984ff27fd55b4e2caf5bfbd44)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

ext4: use ext4_get_block_write() for DAX

Orabug: 22913653

DAX relies on the get_block function either zeroing newly allocated
blocks before they're findable by subsequent calls to get_block, or
marking newly allocated blocks as unwritten. ext4_get_block() cannot
create unwritten extents, but ext4_get_block_write() can.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Reported-by: Andy Rudoff <andy.rudoff@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit e676a4c191653787c3fe851fe3b9f1f33d49dac2)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

fs/dax.c: fix typo in #endif comment

Orabug: 22913653

Fix typo s/CONFIG_TRANSPARENT_HUGEPAGES/CONFIG_TRANSPARENT_HUGEPAGE/ in
fault support").

Signed-off-by: Valentin Rothberg <valentinrothberg@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit dd8a2b6c29a3221c19ab475c8408fc2b914ccfab)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

xfs: huge page fault support

Orabug: 22913653

Use DAX to provide support for huge pages.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit acd76e74d80f961553861d9cf49a62cbcf496d28)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

ext4: huge page fault support

Orabug: 22913653

Use DAX to provide support for huge pages.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 11bd1a9ecdd687b8a4b9b360b7e4b74a1a5e2bd5)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

ext2: huge page fault support

Orabug: 22913653

Use DAX to provide support for huge pages.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit e7b1ea2ad6581b83f63246db48aa2c2c9bf2ec8d)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

dax: add huge page fault support

Orabug: 22913653

This is the support code for DAX-enabled filesystems to allow them to
provide huge pages in response to faults.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 844f35db1088dd1a9de37b53d4d823626232bd19)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

mm: add vmf_insert_pfn_pmd()

Orabug: 22913653

Similar to vm_insert_pfn(), but for PMDs rather than PTEs. The 'vmf_'
prefix instead of 'vm_' prefix is intended to indicate that it returns a
VMF_ value rather than an errno (which would only have to be converted
into a VMF_ value anyway).

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 5cad465d7fa646bad3d677df276bfc8e2ad709e3)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

mm: export various functions for the benefit of DAX

Orabug: 22913653

To use the huge zero page in DAX, we need these functions exported.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit fc43704437ebe40f642ac53f7ee73661fe74e6b8)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

mm: add a pmd_fault handler

Orabug: 22913653

Allow non-anonymous VMAs to provide huge pages in response to a page fault.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit b96375f74a6d4f39fc6cbdc0bce5175115c7f96f)
Signed-off-by: Dan Duval <dan.duval@oracle.com>
NOTE: Moved new member to the end of the vm_operations_struct and
surrounded their definitions with #ifdef __GENKSYMS__/#endif so as
to maintain kABI. - Dan Duval <dan.duval@oracle.com>

thp: prepare for DAX huge pages

Orabug: 22913653

Add a vma_is_dax() helper macro to test whether the VMA is DAX, and use it
in zap_huge_pmd() and __split_huge_page_pmd().

[akpm@linux-foundation.org: fix build]
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 4897c7655d9419ba7e62bac145ec6a1847134d93)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

mm: clarify that the function operates on hugepage pte

We have confusing functions to clear pmd, pmd_clear_* and pmd_clear. Add
_huge_ to pmdp_clear functions so that we are clear that they operate on
hugepage pte.

We don't bother about other functions like pmdp_set_wrprotect,
pmdp_clear_flush_young, because they operate on PTE bits and hence
indicate they are operating on hugepage ptes

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Dan Duval <dan.duval@oracle.com>
(cherry picked from commit 8809aa2d28d74111ff2f1928edaa4e9845c97a7d)

Conflict:

arch/sparc/include/asm/pgtable_64.h

powerpc/mm: use generic version of pmdp_clear_flush()

Orabug: 22913653

Also move the pmd_trans_huge check to generic code.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit f28b6ff8c3d48af21354ef30164c8ccea69d5928)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

mm/thp: split out pmd collapse flush into separate functions

Orabug: 22913653

Architectures like ppc64 [1] need to do special things while clearing pmd
before a collapse.  For them this operation is largely different from a
normal hugepage pte clear.  Hence add a separate function to clear pmd
before collapse.  After this patch pmdp_* functions operate only on
hugepage pte, and not on regular pmd_t values pointing to page table.

[1] ppc64 needs to invalidate all the normal page pte mappings we already
have inserted in the hardware hash page table.  But before doing that we
need to make sure there are no parallel hash page table insert going on.
So we need to do a kick_all_cpus_sync() before flushing the older hash
table entries.  By moving this to a separate function we capture these
details and mention how it is different from a hugepage pte clear.

This patch is a cleanup and only does code movement for clarity.  There
should not be any change in functionality.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 15a25b2ead5f97c5a63c169186e294b41ce03f9a)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

dax: move DAX-related functions to a new header

Orabug: 22913653

In order to handle the !CONFIG_TRANSPARENT_HUGEPAGES case, we need to
return VM_FAULT_FALLBACK from the inlined dax_pmd_fault(), which is
defined in linux/mm.h. Given that we don't want to include <linux/mm.h>
in <linux/fs.h>, the easiest solution is to move the DAX-related
functions to a new header, <linux/dax.h>. We could also have moved
VM_FAULT_* definitions to a new header, or a different header that isn't
quite such a boil-the-ocean header as <linux/mm.h>, but this felt like
the best option.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit c94c2acf84dc16cf4b989bb0bc849785b7ff52f5)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

thp: vma_adjust_trans_huge(): adjust file-backed VMA too

Orabug: 22913653

This series of patches adds support for using PMD page table entries to
map DAX files. We expect NV-DIMMs to start showing up that are many
gigabytes in size and the memory consumption of 4kB PTEs will be
astronomical.

The patch series leverages much of the Transparant Huge Pages
infrastructure, going so far as to borrow one of Kirill's patches from
his THP page cache series.

This patch (of 10):

Since we're going to have huge pages in page cache, we need to call adjust
file-backed VMA, which potentially can contain huge pages.

For now we call it for all VMAs.

Probably later we will need to introduce a flag to indicate that the VMA
has huge pages.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit e1b9996b85ba3ff143ded04523cd015762d20f03)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

arch/*/io.h: Add ioremap_wt() to all architectures

Add ioremap_wt() to all arch-specific asm/io.h headers which
define ioremap_wc() locally. These headers do not include
<asm-generic/iomap.h>. Some of them include <asm-generic/io.h>,
but ioremap_wt() is defined for consistency since they define
all ioremap_xxx locally.

In all architectures without Write-Through support, ioremap_wt()
is defined indentical to ioremap_nocache().

frv and m68k already have ioremap_writethrough(). On those we
add ioremap_wt() indetical to ioremap_writethrough() and defines
ARCH_HAS_IOREMAP_WT in both architectures.

The ioremap_wt() interface is exported to drivers.

Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Elliott@hp.com
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: arnd@arndb.de
Cc: hch@lst.de
Cc: hmh@hmh.eng.br
Cc: jgross@suse.com
Cc: konrad.wilk@oracle.com
Cc: linux-mm <linux-mm@kvack.org>
Cc: linux-nvdimm@lists.01.org
Cc: stefan.bader@canonical.com
Cc: yigal@plexistor.com
Link: http://lkml.kernel.org/r/1433436928-31903-9-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 556269c138a8b2d3f5714b8105fa6119ecc505f2)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm, pmem: direct map legacy pmem by default

Orabug: 22913653

The expectation is that the legacy / non-standard pmem discovery method
(e820 type-12) will only ever be used to describe small quantities of
persistent memory. Larger capacities will be described via the ACPI
NFIT. When "allocate struct page from pmem" support is added this default
policy can be overridden by assigning a legacy pmem namespace to a pfn
device, however this would be only be necessary if a platform used the
legacy mechanism to define a very large range.

Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 004f1afbe199e6ab20805b95aefd83ccd24bc5c7)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm, pmem: 'struct page' for pmem

Orabug: 22913653

Enable the pmem driver to handle PFN device instances.  Attaching a pmem
namespace to a pfn device triggers the driver to allocate and initialize
struct page entries for pmem.  Memory capacity for this allocation comes
exclusively from RAM for now which is suitable for low PMEM to RAM
ratios.  This mechanism will be expanded later for setting an "allocate
from PMEM" policy.

Note: I've renamed the parameter "size" of routine
"pmem_direct_access()" to "not_used".  This parameter was not being used
by any caller, so upstream removed it, but UEK needed to retain it to
maintain kABI compatibility.  In this commit, though, upstream added
a local variable called "size", which caused a name conflict with the
parameter.  - ldd

Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 32ab0a3f51701cb37ab960635254d5f84ec3de0a)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm, pfn: 'struct page' provider infrastructure

Orabug: 22913653

Implement the base infrastructure for libnvdimm PFN devices. Similar to
BTT devices they take a namespace as a backing device and layer
functionality on top. In this case the functionality is reserving space
for an array of 'struct page' entries to be handed out through
pfn_to_page(). For now this is just the basic libnvdimm-device-model for
configuring the base PFN device.

As the namespace claiming mechanism for PFN devices is mostly identical
to BTT devices drivers/nvdimm/claim.c is created to house the common
bits.

Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit e1455744b27c9e6115c3508a7b2902157c2c4347)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

x86, pmem: clarify that ARCH_HAS_PMEM_API implies PMEM mapped WB

Orabug: 22913653

Given that a write-back (WB) mapping plus non-temporal stores is
expected to be the most efficient way to access PMEM, update the
definition of ARCH_HAS_PMEM_API to imply arch support for
WB-mapped-PMEM.  This is needed as a pre-requisite for adding PMEM to
the direct map and mapping it with struct page.

The above clarification for X86_64 means that memcpy_to_pmem() is
permitted to use the non-temporal arch_memcpy_to_pmem() rather than
needlessly fall back to default_memcpy_to_pmem() when the pcommit
instruction is not available.  When arch_memcpy_to_pmem() is not
guaranteed to flush writes out of cache, i.e. on older X86_32
implementations where non-temporal stores may just dirty cache,
ARCH_HAS_PMEM_API is simply disabled.

The default fall back for persistent memory handling remains.  Namely,
map it with the WT (write-through) cache-type and hope for the best.

arch_has_pmem_api() is updated to only indicate whether the arch
provides the proper helpers to meet the minimum "writes are visible
outside the cache hierarchy after memcpy_to_pmem() + wmb_pmem()".  Code
that cares whether wmb_pmem() actually flushes writes to pmem must now
call arch_has_wmb_pmem() directly.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
[hch: set ARCH_HAS_PMEM_API=n on x86_32]
Reviewed-by: Christoph Hellwig <hch@lst.de>
[toshi: x86_32 compile fixes]
Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 96601adb745186ccbcf5b078d4756f13381ec2af)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

add devm_memremap_pages

Orabug: 22913653

This behaves like devm_memremap except that it ensures we have page
structures available that can back the region.

Signed-off-by: Christoph Hellwig <hch@lst.de>
[djbw: catch attempts to remap RAM, drop flags]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dan Duval <dan.duval@oracle.com>
(cherry picked from commit 41e94a851304f7acac840adec4004f8aeee53ad4)

Conflict:

include/linux/io.h

mm: ZONE_DEVICE for "device memory"

Orabug: 22913653

While pmem is usable as a block device or via DAX mappings to userspace
there are several usage scenarios that can not target pmem due to its
lack of struct page coverage. In preparation for "hot plugging" pmem
into the vmemmap add ZONE_DEVICE as a new zone to tag these pages
separately from the ones that are subject to standard page allocations.
Importantly "device memory" can be removed at will by userspace
unbinding the driver of the device.

Having a separate zone prevents allocation and otherwise marks these
pages that are distinct from typical uniform memory. Device memory has
different lifetime and performance characteristics than RAM. However,
since we have run out of ZONES_SHIFT bits this functionality currently
depends on sacrificing ZONE_DMA.

Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Jerome Glisse <j.glisse@gmail.com>
[hch: various simplifications in the arch interface]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dan Duval <dan.duval@oracle.com>
(cherry picked from commit 033fbae988fcb67e5077203512181890848b8e90)

Conflicts:

include/linux/memory_hotplug.h
mm/Kconfig

mm: move __phys_to_pfn and __pfn_to_phys to asm/generic/memory_model.h

Orabug: 22913653

Three architectures already define these, and we'll need them genericly
soon.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 012dcef3f058385268630c0003e9b7f8dcafbeb4)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

nd_blk: change aperture mapping from WC to WB

Orabug: 22913653

This should result in a pretty sizeable performance gain for reads.  For
rough comparison I did some simple read testing using PMEM to compare
reads of write combining (WC) mappings vs write-back (WB).  This was
done on a random lab machine.

PMEM reads from a write combining mapping:
# dd of=/dev/null if=/dev/pmem0 bs=4096 count=100000
100000+0 records in
100000+0 records out
409600000 bytes (410 MB) copied, 9.2855 s, 44.1 MB/s

PMEM reads from a write-back mapping:
# dd of=/dev/null if=/dev/pmem0 bs=4096 count=1000000
1000000+0 records in
1000000+0 records out
4096000000 bytes (4.1 GB) copied, 3.44034 s, 1.2 GB/s

To be able to safely support a write-back aperture I needed to add
support for the "read flush" _DSM flag, as outlined in the DSM spec:

http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf

This flag tells the ND BLK driver that it needs to flush the cache lines
associated with the aperture after the aperture is moved but before any
new data is read.  This ensures that any stale cache lines from the
previous contents of the aperture will be discarded from the processor
cache, and the new data will be read properly from the DIMM.  We know
that the cache lines are clean and will be discarded without any
writeback because either a) the previous aperture operation was a read,
and we never modified the contents of the aperture, or b) the previous
aperture operation was a write and we must have written back the dirtied
contents of the aperture to the DIMM before the I/O was completed.

In order to add support for the "read flush" flag I needed to add a
generic routine to invalidate cache lines, mmio_flush_range().  This is
protected by the ARCH_HAS_MMIO_FLUSH Kconfig variable, and is currently
only supported on x86.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dan Duval <dan.duval@oracle.com>
(cherry picked from commit 67a3e8fe90156d41cd480d3dfbb40f3bc007c262)

Conflicts:

arch/x86/Kconfig
drivers/acpi/nfit.c

pmem, dax: have direct_access use __pmem annotation

Orabug: 22913653

Update the annotation for the kaddr pointer returned by direct_access()
so that it is a __pmem pointer. This is consistent with the PMEM driver
and with how this direct_access() pointer is used in the DAX code.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit e2e05394e4a3420dab96f728df4531893494e15d)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

dax: update I/O path to do proper PMEM flushing

Orabug: 22913653

Update the DAX I/O path so that all operations that store data (I/O
writes, zeroing blocks, punching holes, etc.) properly synchronize the
stores to media using the PMEM API. This ensures that the data DAX is
writing is durable on media before the operation completes.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 2765cfbb342c727c3fd47b165196cb16da158022)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

pmem: add copy_from_iter_pmem() and clear_pmem()

Orabug: 22913653

Add support for two new PMEM APIs, copy_from_iter_pmem() and
clear_pmem().  copy_from_iter_pmem() is used to copy data from an
iterator into a PMEM buffer.  clear_pmem() zeros a PMEM memory range.

Both of these new APIs must be explicitly ordered using a wmb_pmem()
function call and are implemented in such a way that the wmb_pmem()
will make the stores to PMEM durable.  Because both APIs are unordered
they can be called as needed without introducing any unwanted memory
barriers.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 5de490daec8b6354b90d5c9d3e2415b195f5adb6)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

pmem, x86: clean up conditional pmem includes

Orabug: 22913653

Prior to this change x86_64 used the pmem defines in
arch/x86/include/asm/pmem.h, and UM used the default ones at the
top of include/linux/pmem.h. The inclusion or exclusion in linux/pmem.h
was controlled by CONFIG_ARCH_HAS_PMEM_API, but the ones in asm/pmem.h
were controlled by ARCH_HAS_NOCACHE_UACCESS.

Instead, control them both with CONFIG_ARCH_HAS_PMEM_API so that it's
clear that they are related and we don't run into the possibility where
they are both included or excluded. Also remove a bunch of stale
function prototypes meant for UM in asm/pmem.h - these just conflicted
with the inline defaults in linux/pmem.h and gave compile errors.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 4a370df5534ef727cba9a9d74bf22e0609f91d6e)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

pmem: remove layer when calling arch_has_wmb_pmem()

Orabug: 22913653

Prior to this change arch_has_wmb_pmem() was only called by
arch_has_pmem_api(). Both arch_has_wmb_pmem() and arch_has_pmem_api()
checked to make sure that CONFIG_ARCH_HAS_PMEM_API was enabled.

Instead, remove the old arch_has_wmb_pmem() wrapper to be rid of one
extra layer of indirection and the redundant CONFIG_ARCH_HAS_PMEM_API
check. Rename __arch_has_wmb_pmem() to arch_has_wmb_pmem() since we no
longer have a wrapper, and just have arch_has_pmem_api() call the
architecture specific arch_has_wmb_pmem() directly.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 18279b467a9d89afe44afbc19d768e834dbf4545)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

pmem, x86: move x86 PMEM API to new pmem.h header

Orabug: 22913653

Move the x86 PMEM API implementation out of asm/cacheflush.h and into
its own header asm/pmem.h. This will allow members of the PMEM API to
be more easily identified on this and other architectures.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 40603526569b304dd92f720f2f8ab11e828ea145)
Signed-off-by: Dan Duval <dan.duval@oracle.com>