www.infradead.org Git - users/jedix/linux-maple.git/log

Merge tag 'v4.1.12-70#22913653a12' into pmem-4.1-merge

OLdev tag v4.1.12-70#22913653a12

Signed-off-by: Dan Duval <dan.duval@oracle.com>
Conflicts:

block/blk-mq-sysfs.c
block/blk-mq.c

Additional edit needed:

fs/xfs/xfs_aops.c

libnvdimm, dax: record the specified alignment of a dax-device instance

Orabug: 22913653

We want to use the alignment as the allocation and mapping unit.
Previously this information was only useful for establishing the data
offset, but now it is important to remember the granularity for the
later use.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dan Duval <dan.duval@oracle.com>
(cherry picked from commit 45a0dac0451136fa7ae34a6fea53ef6a136287ce)

Conflicts:

drivers/nvdimm/pfn.h
drivers/nvdimm/pfn_devs.c

libnvdimm, dax: reserve space to store labels for device-dax

Orabug: 22913653

We may want to subdivide a device-dax range into multiple devices so
that each can have separate permissions or naming. Reserve 128K of
label space by default so we have the capability of making allocation
decisions persistent. This reservation is not something we can add
later since it would result in the default size of a device-dax range
changing between kernel versions.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dan Duval <dan.duval@oracle.com>
(cherry picked from commit 52ac23b25eb26511f8dea2382534eeada2fa8244)

Conflict:

drivers/nvdimm/pfn_devs.c

libnvdimm, dax: introduce device-dax infrastructure

Orabug: 22913653

Device DAX is the device-centric analogue of Filesystem DAX
(CONFIG_FS_DAX). It allows persistent memory ranges to be allocated and
mapped without need of an intervening file system. This initial
infrastructure arranges for a libnvdimm pfn-device to be represented as
a different device-type so that it can be attached to a driver other
than the pmem driver.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dan Duval <dan.duval@oracle.com>
(cherry picked from commit cd03412a51ac4cb3001a8cdfae4560c9602f3387)

Conflicts:

drivers/nvdimm/namespace_devs.c
drivers/nvdimm/pfn_devs.c

libnvdimm: cleanup nvdimm_namespace_common_probe(), kill 'host'

Orabug: 22913653

The 'host' variable can be killed as it is always the same as the passed
in device.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 0bfb8dd3edd6e423b5053c86e10c97e92cf205ea)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

mm, dax: fix livelock, allow dax pmd mappings to become writeable

Orabug: 22913653

Prior to this change DAX PMD mappings that were made read-only were
never able to be made writable again. This is because the code in
insert_pfn_pmd() that calls pmd_mkdirty() and pmd_mkwrite() would skip
these calls if the PMD already existed in the page table.

Instead, if we are doing a write always mark the PMD entry as dirty and
writeable. Without this code we can get into a condition where we mark
the PMD as read-only, and then on a subsequent write fault we get into
an infinite loop of PMD faults where we try unsuccessfully to make the
PMD writeable.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: Jeff Moyer <jmoyer@redhat.com>
Reported-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 01871e59af5cc1cbf290ad6b4b95cd2f0cec9e8c)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

dax: fix lifetime of in-kernel dax mappings with dax_map_atomic()

Orabug: 22913653

The DAX implementation needs to protect new calls to ->direct_access()
and usage of its return value against the driver for the underlying
block device being disabled. Use blk_queue_enter()/blk_queue_exit() to
hold off blk_cleanup_queue() from proceeding, or otherwise fail new
mapping requests if the request_queue is being torn down.

This also introduces blk_dax_ctl to simplify the interface from fs/dax.c
through dax_map_atomic() to bdev_direct_access().

[willy@linux.intel.com: fix read() of a hole]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Dan Duval <dan.duval@oracle.com>
(cherry picked from commit b2e0d1625e193b40cbbd45b799f82d54d34e015c)

Conflict:

fs/block_dev.c

tools/testing/libnvdimm: cleanup mock resource lookup

Orabug: 22913653

Push the locking around get_nfit_res() into get_nfit_res().

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 9bfa84969dd52bf0aa635c4c8a67c81d75ffca37)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

block: protect rw_page against device teardown

Orabug: 22913653

Fix use after free crashes like the following:

general protection fault: 0000 [#1] SMP
Call Trace:
  [<ffffffffa0050216>] ? pmem_do_bvec.isra.12+0xa6/0xf0 [nd_pmem]
  [<ffffffffa0050ba2>] pmem_rw_page+0x42/0x80 [nd_pmem]
  [<ffffffff8128fd90>] bdev_read_page+0x50/0x60
  [<ffffffff812972f0>] do_mpage_readpage+0x510/0x770
  [<ffffffff8128fd20>] ? I_BDEV+0x20/0x20
  [<ffffffff811d86dc>] ? lru_cache_add+0x1c/0x50
  [<ffffffff81297657>] mpage_readpages+0x107/0x170
  [<ffffffff8128fd20>] ? I_BDEV+0x20/0x20
  [<ffffffff8128fd20>] ? I_BDEV+0x20/0x20
  [<ffffffff8129058d>] blkdev_readpages+0x1d/0x20
  [<ffffffff811d615f>] __do_page_cache_readahead+0x28f/0x310
  [<ffffffff811d6039>] ? __do_page_cache_readahead+0x169/0x310
  [<ffffffff811c5abd>] ? pagecache_get_page+0x2d/0x1d0
  [<ffffffff811c76f6>] filemap_fault+0x396/0x530
  [<ffffffff811f816e>] __do_fault+0x4e/0xf0
  [<ffffffff811fce7d>] handle_mm_fault+0x11bd/0x1b50

Cc: <stable@vger.kernel.org>
Cc: Jens Axboe <axboe@fb.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Reported-by: kbuild test robot <lkp@intel.com>
Acked-by: Matthew Wilcox <willy@linux.intel.com>
[willy: symmetry fixups]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dan Duval <dan.duval@oracle.com>
(cherry picked from commit 2e6edc95382cc36423aff18a237173ad62d5ab52)

Conflict:

include/linux/blkdev.h

fix kABI breakage caused by "block: generic request_queue reference counting"

Orabug: 22913653

Commit 3ef28e83ab15799742e55fd13243a5f678b04242 ("block: generic
request_queue reference counting") changed the name of an element of
the request_queue structure, and in so doing, broke kABI.

This patch uses __GENKSYMS__ to hide the change from the kABI
checker.

Signed-off-by: Dan Duval <dan.duval@oracle.com>

block: generic request_queue reference counting

Orabug: 22913653

Allow pmem, and other synchronous/bio-based block drivers, to fallback
on a per-cpu reference count managed by the core for tracking queue
live/dead state.

The existing per-cpu reference count for the blk_mq case is promoted to
be used in all block i/o scenarios.  This involves initializing it by
default, waiting for it to drop to zero at exit, and holding a live
reference over the invocation of q->make_request_fn() in
generic_make_request().  The blk_mq code continues to take its own
reference per blk_mq request and retains the ability to freeze the
queue, but the check that the queue is frozen is moved to
generic_make_request().

This fixes crash signatures like the following:

BUG: unable to handle kernel paging request at ffff880140000000
[..]
Call Trace:
  [<ffffffff8145e8bf>] ? copy_user_handle_tail+0x5f/0x70
  [<ffffffffa004e1e0>] pmem_do_bvec.isra.11+0x70/0xf0 [nd_pmem]
  [<ffffffffa004e331>] pmem_make_request+0xd1/0x200 [nd_pmem]
  [<ffffffff811c3162>] ? mempool_alloc+0x72/0x1a0
  [<ffffffff8141f8b6>] generic_make_request+0xd6/0x110
  [<ffffffff8141f966>] submit_bio+0x76/0x170
  [<ffffffff81286dff>] submit_bh_wbc+0x12f/0x160
  [<ffffffff81286e62>] submit_bh+0x12/0x20
  [<ffffffff813395bd>] jbd2_write_superblock+0x8d/0x170
  [<ffffffff8133974d>] jbd2_mark_journal_empty+0x5d/0x90
  [<ffffffff813399cb>] jbd2_journal_destroy+0x24b/0x270
  [<ffffffff810bc4ca>] ? put_pwq_unlocked+0x2a/0x30
  [<ffffffff810bc6f5>] ? destroy_workqueue+0x225/0x250
  [<ffffffff81303494>] ext4_put_super+0x64/0x360
  [<ffffffff8124ab1a>] generic_shutdown_super+0x6a/0xf0

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Dan Duval <dan.duval@oracle.com>
(cherry picked from commit 3ef28e83ab15799742e55fd13243a5f678b04242)

fix kABI breakage caused by "block: use an atomic_t for mq_freeze_depth"

Orabug: 22913653

The aforementioned commit changed the type of "mq_freeze_depth" from
"int" to "atomic_t", thereby breaking kABI. The types are the same
size, and third-party module code should not be fiddling with this
field's value. This commit adds the usual "__GENKSYMS__" armor to
hide the type change from the kABI checker.

Signed-off-by: Dan Duval <dan.duval@oracle.com>

block: use an atomic_t for mq_freeze_depth

Orabug: 22913653

lockdep gets unhappy about the not disabling irqs when using the queue_lock
around it. Instead of trying to fix that up just switch to an atomic_t
and get rid of the lock.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 4ecd4fef3a074c8bb43c391a57742c422469ebbd)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

dax: guarantee page aligned results from bdev_direct_access()

Orabug: 22913653

If a ->direct_access() implementation ever returns a map count less than
PAGE_SIZE, catch the error in bdev_direct_access(). This simplifies
error checking in upper layers.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit fe683adabfe6629c0b6db32bbbc1ce6cacbf2117)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

dax: increase granularity of dax_clear_blocks() operations

Orabug: 22913653

dax_clear_blocks is currently performing a cond_resched() after every
PAGE_SIZE memset.  We need not check so frequently, for example md-raid
only calls cond_resched() at stripe granularity.  Also, in preparation
for introducing a dax_map_atomic() operation that temporarily pins a dax
mapping move the call to cond_resched() to the outer loop.

The worst case latency between calls to cond_resched() after this change
is 500us the average latency is 133us.  This is up from a 10us max and
4us average.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jan Kara <jack@suse.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 0e749e54244eec87b2a3cd0a4314e60bc6781115)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

pmem, dax: clean up clear_pmem()

Orabug: 22913653

To date, we have implemented two I/O usage models for persistent memory,
PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
userspace).  This series adds a third, DAX-GUP, that allows DAX mappings
to be the target of direct-i/o.  It allows userspace to coordinate
DMA/RDMA from/to persistent memory.

The implementation leverages the ZONE_DEVICE mm-zone that went into
4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
and dynamically mapped by a device driver.  The pmem driver, after
mapping a persistent memory range into the system memmap via
devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
page-backed pmem-pfns via flags in the new pfn_t type.

The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
resulting pte(s) inserted into the process page tables with a new
_PAGE_DEVMAP flag.  Later, when get_user_pages() is walking ptes it keys
off _PAGE_DEVMAP to pin the device hosting the page range active.
Finally, get_page() and put_page() are modified to take references
against the device driver established page mapping.

Finally, this need for "struct page" for persistent memory requires
memory capacity to store the memmap array.  Given the memmap array for a
large pool of persistent may exhaust available DRAM introduce a
mechanism to allocate the memmap from persistent memory.  The new
"struct vmem_altmap *" parameter to devm_memremap_pages() enables
arch_add_memory() to use reserved pmem capacity rather than the page
allocator.

This patch (of 25):

Both __dax_pmd_fault, and clear_pmem() were taking special steps to
clear memory a page at a time to take advantage of non-temporal
clear_page() implementations.  However, x86_64 does not use non-temporal
instructions for clear_page(), and arch_clear_pmem() was always
incurring the cost of __arch_wb_cache_pmem().

Clean up the assumption that doing clear_pmem() a page at a time is more
performant.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 52db400fcd50216dd8511a0880b936235755836f)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

xfs: fix recursive splice read locking with DAX

Orabug: 22913653

Doing a splice read (generic/249) generates a lockdep splat because
we recursively lock the inode iolock in this path:

SyS_sendfile64
do_sendfile
do_splice_direct
splice_direct_to_actor
do_splice_to
xfs_file_splice_read <<<<<< lock here
default_file_splice_read
vfs_readv
do_readv_writev
do_iter_readv_writev
xfs_file_read_iter <<<<<< then here

The issue here is that for DAX inodes we need to avoid the page
cache path and hence simply push it into the normal read path.
Unfortunately, we can't tell down at xfs_file_read_iter() whether we
are being called from the splice path and hence we cannot avoid the
locking at this layer. Hence we simply have to drop the inode
locking at the higher splice layer for DAX.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry picked from commit a6d7636e8d0fd94fd1937db91d5b06a91fa85dde)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

xfs: per-filesystem stats counter implementation

Orabug: 22913653

This patch modifies the stats counting macros and the callers
to those macros to properly increment, decrement, and add-to
the xfs stats counts. The counts for global and per-fs stats
are correctly advanced, and cleared by writing a "1" to the
corresponding clear file.

global counts: /sys/fs/xfs/stats/stats
per-fs counts: /sys/fs/xfs/sda*/stats/stats

global clear: /sys/fs/xfs/stats/stats_clear
per-fs clear: /sys/fs/xfs/sda*/stats/stats_clear

[dchinner: cleaned up macro variables, removed CONFIG_FS_PROC around
stats structures and macros. ]

Signed-off-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Dan Duval <dan.duval@oracle.com>
(cherry picked from commit ff6d6af2351caea7db681f4539d0d893e400557a)

Conflict:

fs/xfs/xfs_icache.c

xfs: per-filesystem stats in sysfs

Orabug: 22913653

This patch implements per-filesystem stats objects in sysfs. It
depends on the application of the previous patch series that
develops the infrastructure to support both xfs global stats and
xfs per-fs stats in sysfs.

Stats objects are instantiated when an xfs filesystem is mounted
and deleted on unmount. With this patch, the stats directory is
created and populated with the familiar stats and stats_clear files.
Example:
/sys/fs/xfs/sda9/stats/stats
/sys/fs/xfs/sda9/stats/stats_clear

With this patch, the individual counts within the new per-fs
stats file(s) remain at zero. Functions that use the the macros
to increment, decrement, and add-to the per-fs stats counts will
be covered in a separate new patch to follow this one. Note that
the counts within the global stats file (/sys/fs/xfs/stats/stats)
advance normally and can be cleared as it was prior to this patch.

[dchinner: move setup/teardown to xfs_fs_{fill|put}_super() so
it is down before/after any path that uses the per-mount stats. ]

Signed-off-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry picked from commit 225e4635580ce9fb12f8a2dc88473161cd64dbf6)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

xfs: pass xfsstats structures to handlers and macros

Orabug: 22913653

This patch is the next step toward per-fs xfs stats. The patch makes
the show and clear routines able to handle any stats structure
associated with a kobject.

Instead of a single global xfsstats structure, add kobject and a pointer
to a per-cpu struct xfsstats. Modify the macros that manipulate the stats
accordingly: XFS_STATS_INC, XFS_STATS_DEC, and XFS_STATS_ADD now access
xfsstats->xs_stats.

The sysfs functions need to get from the kobject back to the xfsstats
structure which contains it, and pass the pointer to the ->xs_stats
percpu structure into the show & clear routines.

Signed-off-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry picked from commit 80529c45ab66188a0b3814ba31409b31f5fcb53d)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

xfs: consolidate sysfs ops

Orabug: 22913653

As a part of the series to move xfs global stats from procfs to sysfs,
this patch consolidates the sysfs ops functions and removes redundancy.

Signed-off-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry picked from commit a27c264009cb236dc209875043a0c237af46a1af)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

xfs: remove unused procfs code

Orabug: 22913653

As a part of the work to move xfs global stats from procfs to sysfs,
this patch removes the now unused procfs code that was xfs stat specific.

Signed-off-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry picked from commit 50cf5b7402633265fe11430fd63af82c401fad46)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

xfs: create symlink proc/fs/xfs/stat to sys/fs/xfs/stats

Orabug: 22913653

As a part of the work to move xfs global stats from procfs to sysfs,
this patch creates the symlink from proc/fs/xfs/stat to sys/fs/xfs/stats.

Signed-off-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry picked from commit 32f0ea05219e5212cafcbeaa255e627314f87e7e)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

xfs: create global stats and stats_clear in sysfs

Orabug: 22913653

Currently, xfs global stats are in procfs. This patch introduces
(replicates) the global stats in sysfs. Additionally a stats_clear file
is introduced in sysfs.

Signed-off-by: Bill O'Donnell <billodo@redhat.com>
Reviewed-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry picked from commit bb230c124730f21eea13deab433f9f8fc96bd5f3)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

xfs: Don't use reserved blocks for data blocks with DAX

Orabug: 22913653

Commit 1ca1915 ("xfs: Don't use unwritten extents for DAX") enabled
the DAX allocation call to dip into the reserve pool in case it was
converting unwritten extents rather than allocating blocks. This was
a direct copy of the unwritten extent conversion code, but had an
unintended side effect of allowing normal data block allocation to
use the reserve pool. Hence normal block allocation could deplete
the reserve pool and prevent unwritten extent conversion at ENOSPC,
hence violating fallocate guarantees on preallocated space.

Fix it by checking whether the incoming map from __xfs_get_blocks()
spans an unwritten extent and only use the reserve pool if the
allocation covers an unwritten extent.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry picked from commit 3b0fe47805802216087259b07de691ef47ff6fbc)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

block: kill disk_{check|set|clear|alloc}_badblocks

Orabug: 22913653

These actions are completely managed by a block driver or can use the
badblocks api directly.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 55f5560d8c18fe33fc169f8d244a9247dcac7612)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm, pmem: nvdimm_read_bytes() badblocks support

Orabug: 22913653

Support badblock checking in all the pmem read paths that do not go
through the block layer. This protects info block reads (btt or pfn) as
well as data reads to a pmem namespace via a btt instance.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 710d69cc99507803ed91b4ec7368fbd66d59f014)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

pmem: fail io-requests to known bad blocks

Orabug: 22913653

Check the sectors specified in a read bio to see if they hit a known bad
block, and return an error code pmem_do_bvec().

Note that the ->rw_page() is not in a position to return errors.  For
now, copy the same layering violation present in zram_rw_page() to avoid
crashes of the form:

kernel BUG at mm/filemap.c:822!
[..]
Call Trace:
  [<ffffffff811c540e>] page_endio+0x1e/0x60
  [<ffffffff81290d29>] mpage_end_io+0x39/0x60
  [<ffffffff8141c4ef>] bio_endio+0x3f/0x60
  [<ffffffffa005c491>] pmem_make_request+0x111/0x230 [nd_pmem]

...i.e. unlock a page that was already unlocked via pmem_rw_page() =>
page_endio().

Reported-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit e10624f8c09710b3b0740ea3847627ea02f55c39)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm: convert to statically allocated badblocks

Orabug: 22913653

If a device will ever have badblocks it should always have a badblocks
instance available. So, similar to md, embed a badblocks instance in
pmem_device. This reduces pointer chasing in the i/o fast path, and
simplifies the init path.

Reported-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit b95f5f4391fad65f1819c2404080b05ca95bdd92)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm: don't fail init for full badblocks list

Orabug: 22913653

If the badblocks list runs out of space it simply means that software is
unable to intercept all errors. This is no different than the latent
discovery of new badblocks case and should not be an initialization
failure condition.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 87ba05dff3510f9e058b35d3c3fa222b6f406ecc)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

block, badblocks: introduce devm_init_badblocks

Orabug: 22913653

Provide a devres interface for initializing a badblocks instance. The
pmem driver has several scenarios where it will be beneficial to have
this structure automatically freed when the device is disabled / fails
probe.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 16263ff6c72eb4cc00aa287230144dda12ccad12)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

block: clarify badblocks lifetime

Orabug: 22913653

The badblocks list attached to a gendisk is allocated by the driver
which equates to the driver owning the lifetime of the object. Do not
automatically free it in del_gendisk(). This is in preparation for
expanding the use of badblocks in libnvdimm drivers and introducing
devm_init_badblocks().

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 20a308f09e0d29ce6f5a4114cc476a998d569bfb)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

badblocks: rename badblocks_free to badblocks_exit

Orabug: 22913653

For symmetry with badblocks_init() make it clear that this path only
destroys incremental allocations of a badblocks instance, and does not
free the badblocks instance itself.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit d3b407fb3f782bd915db64e266010ea30a2d381e)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm, pmem: move definition of nvdimm_namespace_add_poison to nd.h

Orabug: 22913653

nd-core.h is private to the libnvdimm core internals and should not be
used by drivers.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit ad9a8bde2cb19f6876f964fc48acc8b6a2f325ff)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm: Add a poison list and export badblocks

Orabug: 22913653

During region creation, perform Address Range Scrubs (ARS) for the SPA
(System Physical Address) ranges to retrieve known poison locations from
firmware. Add a new data structure 'nd_poison' which is used as a list
in nvdimm_bus to store these poison locations.

When creating a pmem namespace, if there is any known poison associated
with its physical address space, convert the poison ranges to bad sectors
that are exposed using the badblocks interface.

Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 0caeef63e6d2f866d85bb507bf63e0ce8ec91cef)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

nfit_test: Enable DSMs for all test NFITs

Orabug: 22913653

In preparation for getting a poison list using ARS DSMs, enable DSMs for
all manufactured NFITs supplied by the test framework. Also, supply
valid response data for ars_status.

Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit d26f73f083ed6fbea7fd3fdbacb527b7f3e75ac0)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

md: convert to use the generic badblocks code

Orabug: 22913653

Retain badblocks as part of rdev, but use the accessor functions from
include/linux/badblocks for all manipulation.

Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dan Duval <dan.duval@oracle.com>
(cherry picked from commit fc974ee2bffdde47d1e4b220cf326952cc2c4794)

Conflict:

drivers/md/md.h

block: Add badblock management for gendisks

Orabug: 22913653

NVDIMM devices, which can behave more like DRAM rather than block
devices, may develop bad cache lines, or 'poison'. A block device
exposed by the pmem driver can then consume poison via a read (or
write), and cause a machine check. On platforms without machine
check recovery features, this would mean a crash.

The block device maintaining a runtime list of all known sectors that
have poison can directly avoid this, and also provide a path forward
to enable proper handling/recovery for DAX faults on such a device.

Use the new badblock management interfaces to add a badblocks list to
gendisks.

Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 99e6608c9e7414ae4f2168df8bf8fae3eb49e41f)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

badblocks: Add core badblock management code

Orabug: 22913653

Take the core badblocks implementation from md, and make it generally
available. This follows the same style as kernel implementations of
linked lists, rb-trees etc, where you can have a structure that can be
embedded anywhere, and accessor functions to manipulate the data.

The only changes in this copy of the code are ones to generalize
function/variable names from md-specific ones. Also add init and free
functions.

Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 9e0e252a048b0ba5066f0dc15c3b2468ffe5c422)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

block: fix del_gendisk() vs blkdev_ioctl crash

Orabug: 22913653

When tearing down a block device early in its lifetime, userspace may
still be performing discovery actions like blkdev_ioctl() to re-read
partitions.

The nvdimm_revalidate_disk() implementation depends on
disk->driverfs_dev to be valid at entry.  However, it is set to NULL in
del_gendisk() and fatally this is happening *before* the disk device is
deleted from userspace view.

There's no reason for del_gendisk() to clear ->driverfs_dev.  That
device is the parent of the disk.  It is guaranteed to not be freed
until the disk, as a child, drops its ->parent reference.

We could also fix this issue locally in nvdimm_revalidate_disk() by
using disk_to_dev(disk)->parent, but lets fix it globally since
->driverfs_dev follows the lifetime of the parent.  Longer term we
should probably just add a @parent parameter to add_disk(), and stop
carrying this pointer in the gendisk.

BUG: unable to handle kernel NULL pointer dereference at           (null)
IP: [<ffffffffa00340a8>] nvdimm_revalidate_disk+0x18/0x90 [libnvdimm]
CPU: 2 PID: 538 Comm: systemd-udevd Tainted: G           O    4.4.0-rc5 #2257
[..]
Call Trace:
  [<ffffffff8143e5c7>] rescan_partitions+0x87/0x2c0
  [<ffffffff810f37f9>] ? __lock_is_held+0x49/0x70
  [<ffffffff81438c62>] __blkdev_reread_part+0x72/0xb0
  [<ffffffff81438cc5>] blkdev_reread_part+0x25/0x40
  [<ffffffff8143982d>] blkdev_ioctl+0x4fd/0x9c0
  [<ffffffff811246c9>] ? current_kernel_time64+0x69/0xd0
  [<ffffffff812916dd>] block_ioctl+0x3d/0x50
  [<ffffffff81264c38>] do_vfs_ioctl+0x308/0x560
  [<ffffffff8115dbd1>] ? __audit_syscall_entry+0xb1/0x100
  [<ffffffff810031d6>] ? do_audit_syscall_entry+0x66/0x70
  [<ffffffff81264f09>] SyS_ioctl+0x79/0x90
  [<ffffffff81902672>] entry_SYSCALL_64_fastpath+0x12/0x76

Reported-by: Robert Hu <robert.hu@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit ac34f15e0c6d2fd58480052b6985f6991fb53bcc)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

block: introduce bdev_file_inode()

Orabug: 22913653

Similar to the file_inode() helper, provide a helper to lookup the inode for a
raw block device itself.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 4ebb16ca9a06a54cdb2e7f2ce1e506fa4d432445)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

restrict /dev/mem to idle io memory ranges

Orabug: 22913653

This effectively promotes IORESOURCE_BUSY to IORESOURCE_EXCLUSIVE
semantics by default.  If userspace really believes it is safe to access
the memory region it can also perform the extra step of disabling an
active driver.  This protects device address ranges with read side
effects and otherwise directs userspace to use the driver.

Persistent memory presents a large "mistake surface" to /dev/mem as now
accidental writes can corrupt a filesystem.

In general if a device driver is busily using a memory region it already
informs other parts of the kernel to not touch it via
request_mem_region().  /dev/mem should honor the same safety restriction
by default.  Debugging a device driver from userspace becomes more
difficult with this enabled.  Any application using /dev/mem or mmap of
sysfs pci resources will now need to perform the extra step of either:

1/ Disabling the driver, for example:

   echo <device id> > /dev/bus/<parent bus>/drivers/<driver name>/unbind

2/ Rebooting with "iomem=relaxed" on the command line

3/ Recompiling with CONFIG_IO_STRICT_DEVMEM=n

Traditional users of /dev/mem like dosemu are unaffected because the
first 1MB of memory is not subject to the IO_STRICT_DEVMEM restriction.
Legacy X configurations use /dev/mem to talk to graphics hardware, but
that functionality has since moved to kernel graphics drivers.

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Kees Cook <keescook@chromium.org>
Acked-by: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 90a545e981267e917b9d698ce07affd69787db87)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

arch: consolidate CONFIG_STRICT_DEVM in lib/Kconfig.debug

Orabug: 22913653

Let all the archs that implement devmem_is_allowed() opt-in to a common
definition of CONFIG_STRICT_DEVM in lib/Kconfig.debug.

Cc: Kees Cook <keescook@chromium.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "David S. Miller" <davem@davemloft.net>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com>
[heiko: drop 'default y' for s390]
Acked-by: Ingo Molnar <mingo@redhat.com>
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dan Duval <dan.duval@oracle.com>
(cherry picked from commit 21266be9ed542f13436bd9c75316d43e1e84f6ae)

Conflicts:

arch/powerpc/Kconfig
arch/x86/Kconfig

libnvdimm: fix namespace object confusion in is_uuid_busy()

Orabug: 22913653

When btt devices were re-worked to be child devices of regions this
routine was overlooked.  It mistakenly attempts to_nd_namespace_pmem()
or to_nd_namespace_blk() conversions on btt and pfn devices.  By luck to
date we have happened to be hitting valid memory leading to a uuid
miscompare, but a recent change to struct nd_namespace_common causes:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000001
IP: [<ffffffff814610dc>] memcmp+0xc/0x40
[..]
Call Trace:
  [<ffffffffa0028631>] is_uuid_busy+0xc1/0x2a0 [libnvdimm]
  [<ffffffffa0028570>] ? to_nd_blk_region+0x50/0x50 [libnvdimm]
  [<ffffffff8158c9c0>] device_for_each_child+0x50/0x90

Cc: <stable@vger.kernel.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit e07ecd76d4db7bda1e9495395b2110a3fe28845a)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm, pfn: move 'memory mode' indication to sysfs

Orabug: 22913653

'Memory mode' is defined as the capability of a DAX mapping to be the
source/target of DMA and other "direct I/O" scenarios.  While it
currently requires allocating 'struct page' for each page frame of
persistent memory in the namespace it will not always be the case.  Work
continues on reducing the kernel's dependency on 'struct page'.

Let's not maintain a suffix that is expected to lose meaning over time.
In other words a future 'raw mode' pmem namespace may be as capable as
today's 'memory mode' namespace.  Undo the encoding of the mode in the
device name and leave it to other tooling to determine the mode of the
namespace from its attributes.

Reported-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 0731de0dd95b251ed6cfb5f132486e52357fce53)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm, pfn: fix nd_pfn_validate() return value handling

Orabug: 22913653

The -ENODEV case indicates that the info-block needs to established.
All other return codes cause nd_pfn_init() to abort.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 3fa962686568a1617d174e3d2f5d522e963077c5)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm, pfn: enable pfn sysfs interface unit testing

Orabug: 22913653

The unit test infrastructure uses CMA and real memory to emulate nvdimm
resources. The call to devm_memremap_pages() can simply be mocked in
the same manner as memremap and we mock phys_to_pfn_t() to clear PFN_MAP
since these resources are not registered with in the pgmap_radix.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 979fccfb7348dbd968daf0249aa484a0297f83df)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm, pfn: fix pfn seed creation

Orabug: 22913653

Similar to btt, plant a new pfn seed when the existing one is activated.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 2dc43331e34fa992a67f42ed44e5111cafafd6f3)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm, pfn: add parent uuid validation

Orabug: 22913653

Track and check the uuid of the namespace hosting a pfn instance. This
forces the pfn info block to be invalidated if the namespace is
re-configured with a different uuid.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit a34d5e8a6ad1a31b186019c9c351777626698863)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm, pfn: add 'align' attribute, default to HPAGE_SIZE

Orabug: 22913653

When setting aside capacity for struct page it must be aligned to the
largest mapping size that is to be made available via DAX. Make the
alignment configurable to enable support for 1GiB page-size mappings.

The offset for PFN_MODE_RAM may now be larger than SZ_8K, so fixup the
offset check in nvdimm_namespace_attach_pfn().

Reported-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 315c562536c42aa4da9b6c5a2135dd6715a5e0b5)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm, pfn: clean up pfn create parameters

Orabug: 22913653

In all cases __nd_pfn_create is called with default parameters which are
then overridden by values in the info block. Clean up pfn creation by
dropping the parameters and setting default values internal to
__nd_pfn_create.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit f7c6ab80fa5fee3daccb83a3c1b3a9f39d7b931c)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm, pfn: kill ND_PFN_ALIGN

Orabug: 22913653

The alignment constraint isn't necessary now that devm_memremap_pages()
allows for unaligned mappings.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 9f1e8cee7742cadbe6b97f2c80b787b4ee067bae)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

nvdimm: do not show pfn_seed for non pmem regions

Orabug: 22913653

This simple change hides pfn_seed attribute for non pmem
regions because they don't support pfn anyway.

Signed-off-by: Dmitry V. Krivenok <krivenok.dmitry@gmail.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 6bb691ac089c39bb0aa73bdcc21ffd8c846e4ba5)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

nvdimm: improve diagnosibility of namespaces

Orabug: 22913653

In order to bind namespace to the driver user must first
set all mandatory attributes in the following order:
- uuid
- size
- sector_size (for blk namespace only)

If the order is wrong, then user either won't be able to set
the attribute or bind the namespace.

This simple patch improves diagnosibility of common operations
with namespaces by printing some details about the error
instead of failing silently.

Below are examples of error messages (assuming dyndbg is
enabled for nvdimms):

[/]# echo 4194304 > /sys/bus/nd/devices/region5/namespace5.0/size
[  288.372612] nd namespace5.0: __size_store: uuid not set
[  288.374839] nd namespace5.0: size_store: 400000 fail (-6)
sh: write error: No such device or address
[/]#

[/]# echo namespace5.0 > /sys/bus/nd/drivers/nd_blk/bind
[  554.671648] nd_blk namespace5.0: nvdimm_namespace_common_probe: sector size not set
[  554.674688]  ndbus1: nd_blk.probe(namespace5.0) = -19
sh: write error: No such device
[/]#

Signed-off-by: Dmitry V. Krivenok <krivenok.dmitry@gmail.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit bd26d0d0ce7434a86dde61a7c65c94fe3801d8f6)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

nfit: acpi_nfit_notify(): Do not leave device locked

Orabug: 22913653

Even if dev->driver is null because we are being removed,
it is safer to not leave device locked.

Found by Linux Driver Verification project (linuxtesting.org).

Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit d91e892825ae6f0ed4f8b07ae5d348eff86ab2ea)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

nfit: Adjust for different _FIT and NFIT headers

Orabug: 22913653

When support for _FIT was added, the code presumed that the data
returned by the _FIT method is identical to the NFIT table, which
starts with an acpi_table_header. However, the _FIT is defined
to return a data in the format of a series of NFIT type structure
entries and as a method, has an acpi_object header rather tahn
an acpi_table_header.

To address the differences, explicitly save the acpi_table_header
from the NFIT, since it is accessible through /sys, and change
the nfit pointer in the acpi_desc structure to point to the
table entries rather than the headers.

Reported-by: Jeff Moyer (jmoyer@redhat.com>
Signed-off-by: Linda Knippers <linda.knippers@hpe.com>
Acked-by: Vishal Verma <vishal.l.verma@intel.com>
[vishal: fix up unit test for new header assumptions]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 6b577c9d772c45448aec784ec235cea228b4d3ad)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

nfit: Fix the check for a successful NFIT merge

Orabug: 22913653

Missed previously due to a lack of test coverage on a platform that
provided an valid response to _FIT.

Signed-off-by: Linda Knippers <linda.knippers@hpe.com>
Acked-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit ff5a55f89c6690a0b292f1a7e0cd4532961588d5)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

nfit: Account for table size length variation

Orabug: 22913653

The size of NFIT tables don't necessarily match the size of the
data structures that we use for them. For example, the NVDIMM
Control Region Structure table is shorter for a device with
no block control windows than for a device with block control windows.
Other tables, such as Flush Hint Address Structure and the Interleave
Structure are variable length by definition.

Account for the size difference when comparing table entries by
using the actual table size from the table header if it's less
than the structure size.

Signed-off-by: Linda Knippers <linda.knippers@hpe.com>
Acked-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 826c416f3c9493b69630a811832cfb7c9007f840)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm, e820: skip module loading when no type-12

Orabug: 22913653

If there are no persistent memory ranges present then don't bother
creating the platform device. Otherwise, it loads the full libnvdimm
sub-system only to discover no resources present.

Reported-by: Andy Lutomirski <luto@amacapital.net>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit bc0d0d093b379b0b379c429e3348498287c8a9ca)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

xfs: introduce BMAPI_ZERO for allocating zeroed extents

Orabug: 22913653

To enable DAX to do atomic allocation of zeroed extents, we need to
drive the block zeroing deep into the allocator. Because
xfs_bmapi_write() can return merged extents on allocation that were
only partially allocated (i.e. requested range spans allocated and
hole regions, allocation into the hole was contiguous), we cannot
zero the extent returned from xfs_bmapi_write() as that can
overwrite existing data with zeros.

Hence we have to drive the extent zeroing into the allocation code,
prior to where we merge the extents into the BMBT and return the
resultant map. This means we need to propagate this need down to
the xfs_alloc_vextent() and issue the block zeroing at this point.

While this functionality is being introduced for DAX, there is no
reason why it is specific to DAX - we can per-zero blocks during the
allocation transaction on any type of device. It's just slow (and
usually slower than unwritten allocation and conversion) on
traditional block devices so doesn't tend to get used. We can,
however, hook hardware zeroing optimisations via sb_issue_zeroout()
to this operation, so it may be useful in future and hence the
"allocate zeroed blocks" API needs to be implementation neutral.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry picked from commit 3fbbbea34bac049c0b5938dc065f7d8ee1ef7e67)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

mm, dax: fix DAX deadlocks (COW fault)

Orabug: 22913653

DAX handling of COW faults has wrong locking sequence:
dax_fault does i_mmap_lock_read
do_cow_fault does i_mmap_unlock_write

Ross's commit[1] missed a fix[2] that Kirill added to Matthew's
commit[3].

Original COW locking logic was introduced by Matthew here[4].

This should be applied to v4.3 as well.

[1] 0f90cc6609c7 mm, dax: fix DAX deadlocks
[2] 52a2b53ffde6 mm, dax: use i_mmap_unlock_write() in do_cow_fault()
[3] 843172978bb9 dax: fix race between simultaneous faults
[4] 2e4cdab0584f mm: allow page fault handlers to perform the COW

Cc: <stable@vger.kernel.org>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Acked-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Yigal Korman <yigal@plexistor.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 0df9d41ab5d43dc5b20abc8b22a6b6d098b03994)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

dax: disable pmd mappings

Orabug: 22913653

While dax pmd mappings are functional in the nominal path they trigger
kernel crashes in the following paths:

BUG: unable to handle kernel paging request at ffffea0004098000
IP: [<ffffffff812362f7>] follow_trans_huge_pmd+0x117/0x3b0
[..]
Call Trace:
  [<ffffffff811f6573>] follow_page_mask+0x2d3/0x380
  [<ffffffff811f6708>] __get_user_pages+0xe8/0x6f0
  [<ffffffff811f7045>] get_user_pages_unlocked+0x165/0x1e0
  [<ffffffff8106f5b1>] get_user_pages_fast+0xa1/0x1b0

kernel BUG at arch/x86/mm/gup.c:131!
[..]
Call Trace:
  [<ffffffff8106f34c>] gup_pud_range+0x1bc/0x220
  [<ffffffff8106f634>] get_user_pages_fast+0x124/0x1b0

BUG: unable to handle kernel paging request at ffffea0004088000
IP: [<ffffffff81235f49>] copy_huge_pmd+0x159/0x350
[..]
Call Trace:
  [<ffffffff811fad3c>] copy_page_range+0x34c/0x9f0
  [<ffffffff810a0daf>] copy_process+0x1b7f/0x1e10
  [<ffffffff810a11c1>] _do_fork+0x91/0x590

All of these paths are interpreting a dax pmd mapping as a transparent
huge page and making the assumption that the pfn is covered by the
memmap, i.e. that the pfn has an associated struct page.  PTE mappings
do not suffer the same fate since they have the _PAGE_SPECIAL flag to
cause the gup path to fault.  We can do something similar for the PMD
path, or otherwise defer pmd support for cases where a struct page is
available.  For now, 4.4-rc and -stable need to disable dax pmd support
by default.

For development the "depends on BROKEN" line can be removed from
CONFIG_FS_DAX_PMD.

Cc: <stable@vger.kernel.org>
Cc: Jan Kara <jack@suse.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit ee82c9ed41e896bd47e121d87e4628de0f2656a3)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

ext2, ext4: warn when mounting with dax enabled

Orabug: 22913653

Similar to XFS warn when mounting DAX while it is still considered under
development. Also, aspects of the DAX implementation, for example
synchronization against multiple faults and faults causing block
allocation, depend on the correct implementation in the filesystem. The
maturity of a given DAX implementation is filesystem specific.

Cc: <stable@vger.kernel.org>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: linux-ext4@vger.kernel.org
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Dave Chinner <david@fromorbit.com>
Acked-by: Jan Kara <jack@suse.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit ef83b6e8f40bb24b92ad73b5889732346e54a793)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

dax: fix __dax_pmd_fault crash

Orabug: 22913653

Since 4.3 introduced devm_memremap_pages() the pfns handled by DAX may
optionally have a struct page backing.  When a mapped pfn reaches
vmf_insert_pfn_pmd() it fails with a crash signature like the following:

kernel BUG at mm/huge_memory.c:905!
[..]
Call Trace:
  [<ffffffff812a73ba>] __dax_pmd_fault+0x2ea/0x5b0
  [<ffffffffa01a4182>] xfs_filemap_pmd_fault+0x92/0x150 [xfs]
  [<ffffffff811fbe02>] handle_mm_fault+0x312/0x1b50

Fix this by falling back to 4K mappings in the pfn_valid() case.  Longer
term, vmf_insert_pfn_pmd() needs to grow support for architectures that
can provide a 'pmd_special' capability.

Cc: <stable@vger.kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 152d7bd80dca5ce77ec2d7313149a2ab990e808e)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm: documentation clarifications

Orabug: 22913653

A bunch of changes that I hope will help in understanding it
better for first-time readers.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 8de5dff8bae634497f4413bc3067389f2ed267da)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm, pmem: fix size trim in pmem_direct_access()

Orabug: 22913653

This masking prevents access to the end of the device via dax_do_io(),
and is unnecessary as arch_add_memory() would have rejected an unaligned
allocation.

Cc: <stable@vger.kernel.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 589e75d15702dc720b363a92f984876704864946)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm, e820: fix numa node for e820-type-12 pmem ranges

Orabug: 22913653

Rather than punt on the numa node for these e820 ranges try to find a
better answer with memory_add_physaddr_to_nid() when it is available.

Cc: <stable@vger.kernel.org>
Reported-by: Boaz Harrosh <boaz@plexistor.com>
Tested-by: Boaz Harrosh <boaz@plexistor.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit f7256dc0cdbc68903502997bde619f555a910f50)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

tools/testing/nvdimm, acpica: fix flag rename build breakage

Orabug: 22913653

Commit ca321d1ca672 "ACPICA: Update NFIT table to rename a flags field"
performed a tree-wide s/ACPI_NFIT_MEM_ARMED/ACPI_NFIT_MEM_NOT_ARMED/
operation, but missed the tools/testing/nvdimm/ directory.

Cc: Bob Moore <robert.moore@intel.com>
Cc: Lv Zheng <lv.zheng@intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit f42957967fb435aef6fc700fbbd9df89533b9a2e)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

dax_io(): don't let non-error value escape via retval instead of EFAULT

Orabug: 22913653

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: stable@vger.kernel.org # 4.0+
Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit cadfbb6ec2e55171479191046142c927a8b12d87)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

xfs: add ->pfn_mkwrite support for DAX

Orabug: 22913653

->pfn_mkwrite support is needed so that when a page with allocated
backing store takes a write fault we can check that the fault has
not raced with a truncate and is pointing to a region beyond the
current end of file.

This also allows us to update the timestamp on the inode, too, which
fixes a generic/080 failure.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry picked from commit 3af49285854df66260a263198cc15abb07b95287)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

xfs: DAX does not use IO completion callbacks

Orabug: 22913653

For DAX, we are now doing block zeroing during allocation. This
means we no longer need a special DAX fault IO completion callback
to do unwritten extent conversion. Because mmap never extends the
file size (it SEGVs the process) we don't need a callback to update
the file size, either. Hence we can remove the completion callbacks
from the __dax_fault and __dax_mkwrite calls.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry picked from commit 01a155e6cf7db1a8ff2aa73162d7d9ec05ad298f)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

xfs: Don't use unwritten extents for DAX

Orabug: 22913653

DAX has a page fault serialisation problem with block allocation.
Because it allows concurrent page faults and does not have a page
lock to serialise faults to the same page, it can get two concurrent
faults to the page that race.

When two read faults race, this isn't a huge problem as the data
underlying the page is not changing and so "detect and drop" works
just fine. The issues are to do with write faults.

When two write faults occur, we serialise block allocation in
get_blocks() so only one faul will allocate the extent. It will,
however, be marked as an unwritten extent, and that is where the
problem lies - the DAX fault code cannot differentiate between a
block that was just allocated and a block that was preallocated and
needs zeroing. The result is that both write faults end up zeroing
the block and attempting to convert it back to written.

The problem is that the first fault can zero and convert before the
second fault starts zeroing, resulting in the zeroing for the second
fault overwriting the data that the first fault wrote with zeros.
The second fault then attempts to convert the unwritten extent,
which is then a no-op because it's already written. Data loss occurs
as a result of this race.

Because there is no sane locking construct in the page fault code
that we can use for serialisation across the page faults, we need to
ensure block allocation and zeroing occurs atomically in the
filesystem. This means we can still take concurrent page faults and
the only time they will serialise is in the filesystem
mapping/allocation callback. The page fault code will always see
written, initialised extents, so we will be able to remove the
unwritten extent handling from the DAX code when all filesystems are
converted.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry picked from commit 1ca191576fc862b4766f58e41aa362b28a7c1866)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

xfs: fix inode size update overflow in xfs_map_direct()

Orabug: 22913653

Both direct IO and DAX pass an offset and count into get_blocks that
will overflow a s64 variable when an IO goes into the last supported
block in a file (i.e. at offset 2^63 - 1FSB bytes). This can be seen
from the tracing:

xfs_get_blocks_alloc: [...] offset 0x7ffffffffffff000 count 4096
xfs_gbmap_direct: [...] offset 0x7ffffffffffff000 count 4096
xfs_gbmap_direct_none:[...] offset 0x7ffffffffffff000 count 4096

0x7ffffffffffff000 + 4096 = 0x8000000000000000, and hence that
overflows the s64 offset and we fail to detect the need for a
filesize update and an ioend is not allocated.

This is *mostly* avoided for direct IO because such extending IOs
occur with full block allocation, and so the "IS_UNWRITTEN()" check
still evaluates as true and we get an ioend that way. However, doing
single sector extending IOs to this last block will expose the fact
that file size updates will not occur after the first allocating
direct IO as the overflow will then be exposed.

There is one further complexity: the DAX page fault path also
exposes the same issue in block allocation. However, page faults
cannot extend the file size, so in this case we want to allocate the
block but do not want to allocate an ioend to enable file size
update at IO completion. Hence we now need to distinguish between
the direct IO patch allocation and dax fault path allocation to
avoid leaking ioend structures.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry picked from commit 3e12dbbdbd8809f0455920e42fdbf9eddc002651)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

coredump: add DAX filtering for FDPIC ELF coredumps

Orabug: 22913653

Add explicit filtering for DAX mappings to FDPIC ELF coredump. This is
useful because DAX mappings have the potential to be very large.

This patch has only been compile tested.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit ab27a8d04b32b6ee8c30c14c4afd1058e8addc82)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

coredump: add DAX filtering for ELF coredumps

Orabug: 22913653

Add two new flags to the existing coredump mechanism for ELF files to
allow us to explicitly filter DAX mappings.  This is desirable because
DAX mappings, like hugetlb mappings, have the potential to be very
large.

Update the coredump_filter documentation in
Documentation/filesystems/proc.txt so that it addresses the new DAX
coredump flags.  Also update the documented default value of
coredump_filter to be consistent with the core(5) man page.  The
documentation being updated talks about bit 4, Dump ELF headers, which
is enabled if CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS is turned on in the
kernel config.  This kernel config option defaults to "y" if both ELF
binaries and coredump are enabled.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 5037835c1f3eabf4f22163fc0278dd87165f8957)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

acpi: nfit: Add support for hot-add

Orabug: 22913653

Add a .notify callback to the acpi_nfit_driver that gets called on a
hotplug event. From this, evaluate the _FIT ACPI method which returns
the updated NFIT with handles for the hot-plugged NVDIMM.

Iterate over the new NFIT, and add any new tables found, and
register/enable the corresponding regions.

In the nfit test framework, after normal initialization, update the NFIT
with a new hot-plugged NVDIMM, and directly call into the driver to
update its view of the available regions.

Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Elliott, Robert <elliott@hpe.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: <linux-acpi@vger.kernel.org>
Cc: <linux-nvdimm@lists.01.org>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 209851649dc4f7900a6bfe1de5e2640ab2c7d931)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

nfit: in acpi_nfit_init, break on a 0-length table

Orabug: 22913653

If acpi_nfit_init is called (such as from nfit_test), with an nfit table
that has more memory allocated than it needs (and a similarly large
'size' field, add_tables would happily keep adding null SPA Range tables
filling up all available memory.

Make it friendlier by breaking out if a 0-length header is found in any
of the tables.

Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: <linux-acpi@vger.kernel.org>
Cc: <linux-nvdimm@lists.01.org>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 564d501187317f8df79ddda173cf23735cbddd16)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

pmem, memremap: convert to numa aware allocations

Orabug: 22913653

Given that pmem ranges come with numa-locality hints, arrange for the
resulting driver objects to be obtained from node-local memory.

Reviewed-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 538ea4aa44737127ce2b5c8511c7349d2abdcf9c)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

devm_memremap_pages: use numa_mem_id

Orabug: 22913653

Hint to closest numa node for the placement of newly allocated pages.
As that is where the device's other allocations will originate by
default when it does not specify a NUMA node.

Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 7eff93b7c99f5d0024aee677c6c92e32af22e1d2)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

devm: make allocations numa aware by default

Orabug: 22913653

Given we already have a device just use dev_to_node() to provide hint
allocations for devres. However, current devres_alloc() users will need
to explicitly opt-in with devres_alloc_node().

Reviewed-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 7c683941f30a977c10ec6be174ec5f16939c7ce5)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

devm_memremap: convert to return ERR_PTR

Orabug: 22913653

Make devm_memremap consistent with the error return scheme of
devm_memremap_pages to remove special casing in the pmem driver.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit b36f47617f6ce7c5e8e7c264b9d9ea0654d9f20a)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

pmem: kill memremap_pmem()

Orabug: 22913653

Now that the pmem-api is defined as "a set of apis that enables access
to WB mapped pmem", the mapping type is implied. Remove the wrapper
and push the functionality down into the pmem driver in preparation for
adding support for direct-mapped pmem.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit a639315d6c536c806724c9328941a2517507e3e3)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

x86, mm: quiet arch_add_memory()

Orabug: 22913653

Switch to pr_debug() so that dynamic-debug can disable these messages by
default. This gets noisy in the presence of devm_memremap_pages().

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit c9cdaeb2027e535b956ff69f215522d79f6b54e3)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

ext2: Add locking for DAX faults

Orabug: 22913653

Add locking to ensure that DAX faults are isolated from ext2 operations
that modify the data blocks allocation for an inode.  This is intended to
be analogous to the work being done in XFS by Dave Chinner:

http://www.spinics.net/lists/linux-fsdevel/msg90260.html

Compared with XFS the ext2 case is greatly simplified by the fact that ext2
already allocates and zeros new blocks before they are returned as part of
ext2_get_block(), so DAX doesn't need to worry about getting unmapped or
unwritten buffer heads.

This means that the only work we need to do in ext2 is to isolate the DAX
faults from inode block allocation changes.  I believe this just means that
we need to isolate the DAX faults from truncate operations.

The newly introduced dax_sem is intended to replicate the protection
offered by i_mmaplock in XFS.  In addition to truncate the i_mmaplock also
protects XFS operations like hole punching, fallocate down, extent
manipulation IOCTLS like xfs_ioc_space() and extent swapping.  Truncate is
the only one of these operations supported by ext2.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.com>
(cherry picked from commit 5726b27b09cc92452b543764899a07e7c8037edd)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

ACPICA: Update NFIT table to rename a flags field

Orabug: 22913653

ACPICA commit 534deab97fb416a13bfede15c538e2c9eac9384a

Updated one of the memory subtable flags to clarify.

Link: https://github.com/acpica/acpica/commit/534deab9
Signed-off-by: Bob Moore <robert.moore@intel.com>
Signed-off-by: Lv Zheng <lv.zheng@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit ca321d1ca6723ed0e04edd09de49c92b24e3648e)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

mm, dax: fix DAX deadlocks

Orabug: 22913653

The following two locking commits in the DAX code:

commit 843172978bb9 ("dax: fix race between simultaneous faults")
commit 46c043ede471 ("mm: take i_mmap_lock in unmap_mapping_range() for DAX")

introduced a number of deadlocks and other issues which need to be fixed
for the v4.3 kernel. The list of issues in DAX after these commits
(some newly introduced by the commits, some preexisting) can be found
here:

https://lkml.org/lkml/2015/9/25/602 (Subject: "Re: [PATCH] dax: fix deadlock in __dax_fault").

This undoes most of the changes introduced by those two commits,
essentially returning us to the DAX locking scheme that was used in
v4.2.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Tested-by: Dave Chinner <dchinner@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 0f90cc6609c72b0bdf2aad0cb0456194dd896e19)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

dax: fix NULL pointer in __dax_pmd_fault()

Orabug: 22913653

Commit 46c043ede471 ("mm: take i_mmap_lock in unmap_mapping_range() for
DAX") moved some code in __dax_pmd_fault() that was responsible for
zeroing newly allocated PMD pages. The new location didn't properly set
up 'kaddr', so when run this code resulted in a NULL pointer BUG.

Fix this by getting the correct 'kaddr' via bdev_direct_access().

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 8346c416d17bf5b4ea1508662959bb62e73fd6a5)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

mm, dax: VMA with vm_ops->pfn_mkwrite wants to be write-notified

Orabug: 22913653

For VM_PFNMAP and VM_MIXEDMAP we use vm_ops->pfn_mkwrite instead of
vm_ops->page_mkwrite to notify abort write access.  This means we want
vma->vm_page_prot to be write-protected if the VMA provides this vm_ops.

A theoretical scenario that will cause these missed events is:

  On writable mapping with vm_ops->pfn_mkwrite, but without
  vm_ops->page_mkwrite: read fault followed by write access to the pfn.
  Writable pte will be set up on read fault and write fault will not be
  generated.

I found it examining Dave's complaint on generic/080:

http://lkml.kernel.org/g/20150831233803.GO3902@dastard

Although I don't think it's the reason.

It shouldn't be a problem for ext2/ext4 as they provide both pfn_mkwrite
and page_mkwrite.

[akpm@linux-foundation.org: add local vm_ops to avoid 80-cols mess]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Yigal Korman <yigal@plexistor.com>
Acked-by: Boaz Harrosh <boaz@plexistor.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 8a04446ab0cf4f35d9f583cd6adcbf7c534e4995)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

mm: fix type cast in __pfn_to_phys()

Orabug: 22913653

The various definitions of __pfn_to_phys() have been consolidated to
use a generic macro in include/asm-generic/memory_model.h. This hit
mainline in the form of 012dcef3f058 "mm: move __phys_to_pfn and
__pfn_to_phys to asm/generic/memory_model.h". When the generic macro
was implemented the type cast to phys_addr_t was dropped which caused
boot regressions on ARM platforms with more than 4GB of memory and
LPAE enabled.

It was suggested to use PFN_PHYS() defined in include/linux/pfn.h
as provides the correct logic and avoids further duplication.

Reported-by: kernelci.org bot <bot@kernelci.org>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Tyler Baker <tyler.baker@linaro.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit ae4f976968896f8f41b3a7aa21be6146492211e5)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

pmem: add proper fencing to pmem_rw_page()

Orabug: 22913653

pmem_rw_page() needs to call wmb_pmem() on writes to make sure that the
newly written data is durable. This flow was added to pmem_rw_bytes()
and pmem_make_request() with this commit:

commit 61031952f4c8 ("arch, x86: pmem api for ensuring durability of
persistent memory updates")

...the pmem_rw_page() path was missed.

Cc: <stable@vger.kernel.org>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit ba8fe0f85e15d047686caf8a42463b592c63c98c)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm: pfn_devs: Fix locking in namespace_store

Orabug: 22913653

Always take device_lock() before nvdimm_bus_lock() to prevent deadlock.

Signed-off-by: Axel Lin <axel.lin@ingics.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 4ca8b57a0af145f4e791f21dbca6ad789da9ee8b)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm: btt_devs: Fix locking in namespace_store

Orabug: 22913653

Always take device_lock() before nvdimm_bus_lock() to prevent deadlock.

Cc: <stable@vger.kernel.org>
Signed-off-by: Axel Lin <axel.lin@ingics.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 4be9c1fc3df9c3b03c9bde8aec5e44fc73996a3f)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

dax: fix O_DIRECT I/O to the last block of a blockdev

Orabug: 22913653

commit bbab37ddc20b (block: Add support for DAX reads/writes to
block devices) caused a regression in mkfs.xfs.  That utility
sets the block size of the device to the logical block size
using the BLKBSZSET ioctl, and then issues a single sector read
from the last sector of the device.  This results in the dax_io
code trying to do a page-sized read from 512 bytes from the end
of the device.  The result is -ERANGE being returned to userspace.

The fix is to align the block to the page size before calling
get_block.

Thanks to willy for simplifying my original patch.

Cc: <stable@vger.kernel.org>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Tested-by: Linda Knippers <linda.knippers@hp.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit e94f5a2285fc94202a9efb2c687481f29b64132c)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

checkpatch: add __pmem to $Sparse annotations

Orabug: 22913653

commit 61031952f4c8 ("arch, x86: pmem api for ensuring durability of
persistent memory updates") added a new __pmem annotation for sparse
verification. Add __pmem to the $Sparse variable so checkpatch can
appropriately ignore uses of this attribute too.

Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Acked-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 54507b5183cc4f8e4f1a58a312e1f30c130658b7)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

dax: update PMD fault handler with PMEM API

Orabug: 22913653

As part of the v4.3 merge window the DAX code was updated by Matthew and
Kirill to handle PMD pages. Also as part of the v4.3 merge window we
updated the DAX code to do proper PMEM flushing (commit 2765cfbb342c:
"dax: update I/O path to do proper PMEM flushing").

The additional code added by the DAX PMD patches also needs to be
updated to properly use the PMEM API. This ensures that after a PMD
fault is handled the zeros written to the newly allocated pages are
durable on the DIMMs.

linux/dax.h is included to get rid of a bunch of sparse warnings.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>,
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Kirill Shutemov <kirill@shutemov.name>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit d77e92e270edd79a2218ce0ba5b6179ad0c93175)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

mm, dax: use i_mmap_unlock_write() in do_cow_fault()

Orabug: 22913653

__dax_fault() takes i_mmap_lock for write. Let's pair it with write
unlock on do_cow_fault() side.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 52a2b53ffde6d6018dfc454fbde34383351fb896)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

mm: take i_mmap_lock in unmap_mapping_range() for DAX

Orabug: 22913653

DAX is not so special: we need i_mmap_lock to protect mapping->i_mmap.

__dax_pmd_fault() uses unmap_mapping_range() shoot out zero page from
all mappings. We need to drop i_mmap_lock there to avoid lock deadlock.

Re-aquiring the lock should be fine since we check i_size after the
point.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 46c043ede4711e8d598b9d63c5616c1fedb0605e)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

dax: use linear_page_index()

Orabug: 22913653

I was basically open-coding it (thanks to copying code from do_fault()
which probably also needs to be fixed).

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 3fdd1b479dbc03347e98f904f54133a9cef5521f)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

dax: ensure that zero pages are removed from other processes

Orabug: 22913653

If the first access to a huge page was a store, there would be no existing
zero pmd in this process's page tables. There could be a zero pmd in
another process's page tables, if it had done a load. We can detect this
case by noticing that the buffer_head returned from the filesystem is New,
and ensure that other processes mapping this huge page have their page
tables flushed.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Reported-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 73a6ec47f68787df1b41869def52915da2f4a6b7)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

dax: don't use set_huge_zero_page()

Orabug: 22913653

This is another place where DAX assumed that pgtable_t was a pointer.
Open code the important parts of set_huge_zero_page() in DAX and make
set_huge_zero_page() static again.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit d295e3415a88ae63a37a22652808b20c7fcb970e)
Signed-off-by: Dan Duval <dan.duval@oracle.com>