www.infradead.org Git - users/jedix/linux-maple.git/log

devm_memremap: convert to return ERR_PTR

Orabug: 22913653

Make devm_memremap consistent with the error return scheme of
devm_memremap_pages to remove special casing in the pmem driver.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit b36f47617f6ce7c5e8e7c264b9d9ea0654d9f20a)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

pmem: kill memremap_pmem()

Orabug: 22913653

Now that the pmem-api is defined as "a set of apis that enables access
to WB mapped pmem", the mapping type is implied. Remove the wrapper
and push the functionality down into the pmem driver in preparation for
adding support for direct-mapped pmem.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit a639315d6c536c806724c9328941a2517507e3e3)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

x86, mm: quiet arch_add_memory()

Orabug: 22913653

Switch to pr_debug() so that dynamic-debug can disable these messages by
default. This gets noisy in the presence of devm_memremap_pages().

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit c9cdaeb2027e535b956ff69f215522d79f6b54e3)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

ext2: Add locking for DAX faults

Orabug: 22913653

Add locking to ensure that DAX faults are isolated from ext2 operations
that modify the data blocks allocation for an inode.  This is intended to
be analogous to the work being done in XFS by Dave Chinner:

http://www.spinics.net/lists/linux-fsdevel/msg90260.html

Compared with XFS the ext2 case is greatly simplified by the fact that ext2
already allocates and zeros new blocks before they are returned as part of
ext2_get_block(), so DAX doesn't need to worry about getting unmapped or
unwritten buffer heads.

This means that the only work we need to do in ext2 is to isolate the DAX
faults from inode block allocation changes.  I believe this just means that
we need to isolate the DAX faults from truncate operations.

The newly introduced dax_sem is intended to replicate the protection
offered by i_mmaplock in XFS.  In addition to truncate the i_mmaplock also
protects XFS operations like hole punching, fallocate down, extent
manipulation IOCTLS like xfs_ioc_space() and extent swapping.  Truncate is
the only one of these operations supported by ext2.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Jan Kara <jack@suse.com>
(cherry picked from commit 5726b27b09cc92452b543764899a07e7c8037edd)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

ACPICA: Update NFIT table to rename a flags field

Orabug: 22913653

ACPICA commit 534deab97fb416a13bfede15c538e2c9eac9384a

Updated one of the memory subtable flags to clarify.

Link: https://github.com/acpica/acpica/commit/534deab9
Signed-off-by: Bob Moore <robert.moore@intel.com>
Signed-off-by: Lv Zheng <lv.zheng@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit ca321d1ca6723ed0e04edd09de49c92b24e3648e)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

mm, dax: fix DAX deadlocks

Orabug: 22913653

The following two locking commits in the DAX code:

commit 843172978bb9 ("dax: fix race between simultaneous faults")
commit 46c043ede471 ("mm: take i_mmap_lock in unmap_mapping_range() for DAX")

introduced a number of deadlocks and other issues which need to be fixed
for the v4.3 kernel. The list of issues in DAX after these commits
(some newly introduced by the commits, some preexisting) can be found
here:

https://lkml.org/lkml/2015/9/25/602 (Subject: "Re: [PATCH] dax: fix deadlock in __dax_fault").

This undoes most of the changes introduced by those two commits,
essentially returning us to the DAX locking scheme that was used in
v4.2.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dan Williams <dan.j.williams@intel.com>
Tested-by: Dave Chinner <dchinner@redhat.com>
Cc: Jan Kara <jack@suse.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 0f90cc6609c72b0bdf2aad0cb0456194dd896e19)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

dax: fix NULL pointer in __dax_pmd_fault()

Orabug: 22913653

Commit 46c043ede471 ("mm: take i_mmap_lock in unmap_mapping_range() for
DAX") moved some code in __dax_pmd_fault() that was responsible for
zeroing newly allocated PMD pages. The new location didn't properly set
up 'kaddr', so when run this code resulted in a NULL pointer BUG.

Fix this by getting the correct 'kaddr' via bdev_direct_access().

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 8346c416d17bf5b4ea1508662959bb62e73fd6a5)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

mm, dax: VMA with vm_ops->pfn_mkwrite wants to be write-notified

Orabug: 22913653

For VM_PFNMAP and VM_MIXEDMAP we use vm_ops->pfn_mkwrite instead of
vm_ops->page_mkwrite to notify abort write access.  This means we want
vma->vm_page_prot to be write-protected if the VMA provides this vm_ops.

A theoretical scenario that will cause these missed events is:

  On writable mapping with vm_ops->pfn_mkwrite, but without
  vm_ops->page_mkwrite: read fault followed by write access to the pfn.
  Writable pte will be set up on read fault and write fault will not be
  generated.

I found it examining Dave's complaint on generic/080:

http://lkml.kernel.org/g/20150831233803.GO3902@dastard

Although I don't think it's the reason.

It shouldn't be a problem for ext2/ext4 as they provide both pfn_mkwrite
and page_mkwrite.

[akpm@linux-foundation.org: add local vm_ops to avoid 80-cols mess]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Yigal Korman <yigal@plexistor.com>
Acked-by: Boaz Harrosh <boaz@plexistor.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 8a04446ab0cf4f35d9f583cd6adcbf7c534e4995)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

mm: fix type cast in __pfn_to_phys()

Orabug: 22913653

The various definitions of __pfn_to_phys() have been consolidated to
use a generic macro in include/asm-generic/memory_model.h. This hit
mainline in the form of 012dcef3f058 "mm: move __phys_to_pfn and
__pfn_to_phys to asm/generic/memory_model.h". When the generic macro
was implemented the type cast to phys_addr_t was dropped which caused
boot regressions on ARM platforms with more than 4GB of memory and
LPAE enabled.

It was suggested to use PFN_PHYS() defined in include/linux/pfn.h
as provides the correct logic and avoids further duplication.

Reported-by: kernelci.org bot <bot@kernelci.org>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Tyler Baker <tyler.baker@linaro.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit ae4f976968896f8f41b3a7aa21be6146492211e5)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

pmem: add proper fencing to pmem_rw_page()

Orabug: 22913653

pmem_rw_page() needs to call wmb_pmem() on writes to make sure that the
newly written data is durable. This flow was added to pmem_rw_bytes()
and pmem_make_request() with this commit:

commit 61031952f4c8 ("arch, x86: pmem api for ensuring durability of
persistent memory updates")

...the pmem_rw_page() path was missed.

Cc: <stable@vger.kernel.org>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit ba8fe0f85e15d047686caf8a42463b592c63c98c)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm: pfn_devs: Fix locking in namespace_store

Orabug: 22913653

Always take device_lock() before nvdimm_bus_lock() to prevent deadlock.

Signed-off-by: Axel Lin <axel.lin@ingics.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 4ca8b57a0af145f4e791f21dbca6ad789da9ee8b)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm: btt_devs: Fix locking in namespace_store

Orabug: 22913653

Always take device_lock() before nvdimm_bus_lock() to prevent deadlock.

Cc: <stable@vger.kernel.org>
Signed-off-by: Axel Lin <axel.lin@ingics.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 4be9c1fc3df9c3b03c9bde8aec5e44fc73996a3f)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

dax: fix O_DIRECT I/O to the last block of a blockdev

Orabug: 22913653

commit bbab37ddc20b (block: Add support for DAX reads/writes to
block devices) caused a regression in mkfs.xfs.  That utility
sets the block size of the device to the logical block size
using the BLKBSZSET ioctl, and then issues a single sector read
from the last sector of the device.  This results in the dax_io
code trying to do a page-sized read from 512 bytes from the end
of the device.  The result is -ERANGE being returned to userspace.

The fix is to align the block to the page size before calling
get_block.

Thanks to willy for simplifying my original patch.

Cc: <stable@vger.kernel.org>
Signed-off-by: Jeff Moyer <jmoyer@redhat.com>
Tested-by: Linda Knippers <linda.knippers@hp.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit e94f5a2285fc94202a9efb2c687481f29b64132c)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

checkpatch: add __pmem to $Sparse annotations

Orabug: 22913653

commit 61031952f4c8 ("arch, x86: pmem api for ensuring durability of
persistent memory updates") added a new __pmem annotation for sparse
verification. Add __pmem to the $Sparse variable so checkpatch can
appropriately ignore uses of this attribute too.

Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Acked-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 54507b5183cc4f8e4f1a58a312e1f30c130658b7)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

dax: update PMD fault handler with PMEM API

Orabug: 22913653

As part of the v4.3 merge window the DAX code was updated by Matthew and
Kirill to handle PMD pages. Also as part of the v4.3 merge window we
updated the DAX code to do proper PMEM flushing (commit 2765cfbb342c:
"dax: update I/O path to do proper PMEM flushing").

The additional code added by the DAX PMD patches also needs to be
updated to properly use the PMEM API. This ensures that after a PMD
fault is handled the zeros written to the newly allocated pages are
durable on the DIMMs.

linux/dax.h is included to get rid of a bunch of sparse warnings.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>,
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Kirill Shutemov <kirill@shutemov.name>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit d77e92e270edd79a2218ce0ba5b6179ad0c93175)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

mm, dax: use i_mmap_unlock_write() in do_cow_fault()

Orabug: 22913653

__dax_fault() takes i_mmap_lock for write. Let's pair it with write
unlock on do_cow_fault() side.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 52a2b53ffde6d6018dfc454fbde34383351fb896)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

mm: take i_mmap_lock in unmap_mapping_range() for DAX

Orabug: 22913653

DAX is not so special: we need i_mmap_lock to protect mapping->i_mmap.

__dax_pmd_fault() uses unmap_mapping_range() shoot out zero page from
all mappings. We need to drop i_mmap_lock there to avoid lock deadlock.

Re-aquiring the lock should be fine since we check i_size after the
point.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 46c043ede4711e8d598b9d63c5616c1fedb0605e)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

dax: use linear_page_index()

Orabug: 22913653

I was basically open-coding it (thanks to copying code from do_fault()
which probably also needs to be fixed).

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 3fdd1b479dbc03347e98f904f54133a9cef5521f)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

dax: ensure that zero pages are removed from other processes

Orabug: 22913653

If the first access to a huge page was a store, there would be no existing
zero pmd in this process's page tables. There could be a zero pmd in
another process's page tables, if it had done a load. We can detect this
case by noticing that the buffer_head returned from the filesystem is New,
and ensure that other processes mapping this huge page have their page
tables flushed.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Reported-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 73a6ec47f68787df1b41869def52915da2f4a6b7)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

dax: don't use set_huge_zero_page()

Orabug: 22913653

This is another place where DAX assumed that pgtable_t was a pointer.
Open code the important parts of set_huge_zero_page() in DAX and make
set_huge_zero_page() static again.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit d295e3415a88ae63a37a22652808b20c7fcb970e)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

thp: fix zap_huge_pmd() for DAX

Orabug: 22913653

The original DAX code assumed that pgtable_t was a pointer, which isn't
true on all architectures. Restructure the code to not rely on that
assumption.

[willy@linux.intel.com: further fixes integrated into this patch]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit da146769004e1dd5ed06853e6d009be8ca675d5f)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

thp: decrement refcount on huge zero page if it is split

Orabug: 22913653

The DAX code neglected to put the refcount on the huge zero page.
Also we must notify on splits.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 5b701b846aad7909d20693bcced2522d0ce8d1bc)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

dax: fix race between simultaneous faults

Orabug: 22913653

If two threads write-fault on the same hole at the same time, the winner
of the race will return to userspace and complete their store, only to
have the loser overwrite their store with zeroes.  Fix this for now by
taking the i_mmap_sem for write instead of read, and do so outside the
call to get_block().  Now the loser of the race will see the block has
already been zeroed, and will not zero it again.

This severely limits our scalability.  I have ideas for improving it, but
those can wait for a later patch.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 843172978bb92997310d2f7fbc172ece423cfc02)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

ext4: start transaction before calling into DAX

Orabug: 22913653

Jan Kara pointed out that in the case where we are writing to a hole, we
can end up with a lock inversion between the page lock and the journal
lock. We can avoid this by starting the transaction in ext4 before
calling into DAX. The journal lock nests inside the superblock
pagefault lock, so we have to duplicate that code from dax_fault, like
XFS does.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 01a33b4ace68bc35679a347f21d5ed6e222e30dc)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

ext4: add ext4_get_block_dax()

Orabug: 22913653

DAX wants different semantics from any currently-existing ext4 get_block
callback. Unlike ext4_get_block_write(), it needs to honour the
'create' flag, and unlike ext4_get_block(), it needs to be able to
return unwritten extents. So introduce a new ext4_get_block_dax() which
has those semantics.

We could also change ext4_get_block_write() to honour the 'create' flag,
but that might have consequences on other users that I do not currently
understand.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit ed923b5776a2d2e949bd5b20f3956d68f3c826b7)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

dax: improve comment about truncate race

Orabug: 22913653

Jan Kara pointed out I should be more explicit here about the perils of
racing against truncate. The comment is mostly the same as for the PTE
case.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 84c4e5e675408b6fb7d74eec7da9a4a5698b50af)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

thp: change insert_pfn's return type to void

Orabug: 22913653

It would make more sense to have all the return values from
vmf_insert_pfn_pmd() encoded in one place instead of having to follow
the convention into insert_pfn(). Suggested by Jeff Moyer.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit ae18d6dcf57b56b984ff27fd55b4e2caf5bfbd44)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

ext4: use ext4_get_block_write() for DAX

Orabug: 22913653

DAX relies on the get_block function either zeroing newly allocated
blocks before they're findable by subsequent calls to get_block, or
marking newly allocated blocks as unwritten. ext4_get_block() cannot
create unwritten extents, but ext4_get_block_write() can.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Reported-by: Andy Rudoff <andy.rudoff@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit e676a4c191653787c3fe851fe3b9f1f33d49dac2)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

fs/dax.c: fix typo in #endif comment

Orabug: 22913653

Fix typo s/CONFIG_TRANSPARENT_HUGEPAGES/CONFIG_TRANSPARENT_HUGEPAGE/ in
fault support").

Signed-off-by: Valentin Rothberg <valentinrothberg@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit dd8a2b6c29a3221c19ab475c8408fc2b914ccfab)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

xfs: huge page fault support

Orabug: 22913653

Use DAX to provide support for huge pages.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit acd76e74d80f961553861d9cf49a62cbcf496d28)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

ext4: huge page fault support

Orabug: 22913653

Use DAX to provide support for huge pages.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 11bd1a9ecdd687b8a4b9b360b7e4b74a1a5e2bd5)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

ext2: huge page fault support

Orabug: 22913653

Use DAX to provide support for huge pages.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit e7b1ea2ad6581b83f63246db48aa2c2c9bf2ec8d)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

dax: add huge page fault support

Orabug: 22913653

This is the support code for DAX-enabled filesystems to allow them to
provide huge pages in response to faults.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 844f35db1088dd1a9de37b53d4d823626232bd19)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

mm: add vmf_insert_pfn_pmd()

Orabug: 22913653

Similar to vm_insert_pfn(), but for PMDs rather than PTEs. The 'vmf_'
prefix instead of 'vm_' prefix is intended to indicate that it returns a
VMF_ value rather than an errno (which would only have to be converted
into a VMF_ value anyway).

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 5cad465d7fa646bad3d677df276bfc8e2ad709e3)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

mm: export various functions for the benefit of DAX

Orabug: 22913653

To use the huge zero page in DAX, we need these functions exported.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit fc43704437ebe40f642ac53f7ee73661fe74e6b8)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

mm: add a pmd_fault handler

Orabug: 22913653

Allow non-anonymous VMAs to provide huge pages in response to a page fault.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit b96375f74a6d4f39fc6cbdc0bce5175115c7f96f)
Signed-off-by: Dan Duval <dan.duval@oracle.com>
NOTE: Moved new member to the end of the vm_operations_struct and
surrounded their definitions with #ifdef __GENKSYMS__/#endif so as
to maintain kABI. - Dan Duval <dan.duval@oracle.com>

thp: prepare for DAX huge pages

Orabug: 22913653

Add a vma_is_dax() helper macro to test whether the VMA is DAX, and use it
in zap_huge_pmd() and __split_huge_page_pmd().

[akpm@linux-foundation.org: fix build]
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 4897c7655d9419ba7e62bac145ec6a1847134d93)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

mm: clarify that the function operates on hugepage pte

We have confusing functions to clear pmd, pmd_clear_* and pmd_clear. Add
_huge_ to pmdp_clear functions so that we are clear that they operate on
hugepage pte.

We don't bother about other functions like pmdp_set_wrprotect,
pmdp_clear_flush_young, because they operate on PTE bits and hence
indicate they are operating on hugepage ptes

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Dan Duval <dan.duval@oracle.com>
(cherry picked from commit 8809aa2d28d74111ff2f1928edaa4e9845c97a7d)

Conflict:

arch/sparc/include/asm/pgtable_64.h

powerpc/mm: use generic version of pmdp_clear_flush()

Orabug: 22913653

Also move the pmd_trans_huge check to generic code.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit f28b6ff8c3d48af21354ef30164c8ccea69d5928)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

mm/thp: split out pmd collapse flush into separate functions

Orabug: 22913653

Architectures like ppc64 [1] need to do special things while clearing pmd
before a collapse.  For them this operation is largely different from a
normal hugepage pte clear.  Hence add a separate function to clear pmd
before collapse.  After this patch pmdp_* functions operate only on
hugepage pte, and not on regular pmd_t values pointing to page table.

[1] ppc64 needs to invalidate all the normal page pte mappings we already
have inserted in the hardware hash page table.  But before doing that we
need to make sure there are no parallel hash page table insert going on.
So we need to do a kick_all_cpus_sync() before flushing the older hash
table entries.  By moving this to a separate function we capture these
details and mention how it is different from a hugepage pte clear.

This patch is a cleanup and only does code movement for clarity.  There
should not be any change in functionality.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 15a25b2ead5f97c5a63c169186e294b41ce03f9a)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

dax: move DAX-related functions to a new header

Orabug: 22913653

In order to handle the !CONFIG_TRANSPARENT_HUGEPAGES case, we need to
return VM_FAULT_FALLBACK from the inlined dax_pmd_fault(), which is
defined in linux/mm.h. Given that we don't want to include <linux/mm.h>
in <linux/fs.h>, the easiest solution is to move the DAX-related
functions to a new header, <linux/dax.h>. We could also have moved
VM_FAULT_* definitions to a new header, or a different header that isn't
quite such a boil-the-ocean header as <linux/mm.h>, but this felt like
the best option.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Cc: Hillf Danton <dhillf@gmail.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit c94c2acf84dc16cf4b989bb0bc849785b7ff52f5)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

thp: vma_adjust_trans_huge(): adjust file-backed VMA too

Orabug: 22913653

This series of patches adds support for using PMD page table entries to
map DAX files. We expect NV-DIMMs to start showing up that are many
gigabytes in size and the memory consumption of 4kB PTEs will be
astronomical.

The patch series leverages much of the Transparant Huge Pages
infrastructure, going so far as to borrow one of Kirill's patches from
his THP page cache series.

This patch (of 10):

Since we're going to have huge pages in page cache, we need to call adjust
file-backed VMA, which potentially can contain huge pages.

For now we call it for all VMAs.

Probably later we will need to introduce a flag to indicate that the VMA
has huge pages.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Acked-by: Hillf Danton <dhillf@gmail.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit e1b9996b85ba3ff143ded04523cd015762d20f03)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

arch/*/io.h: Add ioremap_wt() to all architectures

Add ioremap_wt() to all arch-specific asm/io.h headers which
define ioremap_wc() locally. These headers do not include
<asm-generic/iomap.h>. Some of them include <asm-generic/io.h>,
but ioremap_wt() is defined for consistency since they define
all ioremap_xxx locally.

In all architectures without Write-Through support, ioremap_wt()
is defined indentical to ioremap_nocache().

frv and m68k already have ioremap_writethrough(). On those we
add ioremap_wt() indetical to ioremap_writethrough() and defines
ARCH_HAS_IOREMAP_WT in both architectures.

The ioremap_wt() interface is exported to drivers.

Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Elliott@hp.com
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: arnd@arndb.de
Cc: hch@lst.de
Cc: hmh@hmh.eng.br
Cc: jgross@suse.com
Cc: konrad.wilk@oracle.com
Cc: linux-mm <linux-mm@kvack.org>
Cc: linux-nvdimm@lists.01.org
Cc: stefan.bader@canonical.com
Cc: yigal@plexistor.com
Link: http://lkml.kernel.org/r/1433436928-31903-9-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 556269c138a8b2d3f5714b8105fa6119ecc505f2)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm, pmem: direct map legacy pmem by default

Orabug: 22913653

The expectation is that the legacy / non-standard pmem discovery method
(e820 type-12) will only ever be used to describe small quantities of
persistent memory. Larger capacities will be described via the ACPI
NFIT. When "allocate struct page from pmem" support is added this default
policy can be overridden by assigning a legacy pmem namespace to a pfn
device, however this would be only be necessary if a platform used the
legacy mechanism to define a very large range.

Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 004f1afbe199e6ab20805b95aefd83ccd24bc5c7)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm, pmem: 'struct page' for pmem

Orabug: 22913653

Enable the pmem driver to handle PFN device instances.  Attaching a pmem
namespace to a pfn device triggers the driver to allocate and initialize
struct page entries for pmem.  Memory capacity for this allocation comes
exclusively from RAM for now which is suitable for low PMEM to RAM
ratios.  This mechanism will be expanded later for setting an "allocate
from PMEM" policy.

Note: I've renamed the parameter "size" of routine
"pmem_direct_access()" to "not_used".  This parameter was not being used
by any caller, so upstream removed it, but UEK needed to retain it to
maintain kABI compatibility.  In this commit, though, upstream added
a local variable called "size", which caused a name conflict with the
parameter.  - ldd

Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 32ab0a3f51701cb37ab960635254d5f84ec3de0a)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm, pfn: 'struct page' provider infrastructure

Orabug: 22913653

Implement the base infrastructure for libnvdimm PFN devices. Similar to
BTT devices they take a namespace as a backing device and layer
functionality on top. In this case the functionality is reserving space
for an array of 'struct page' entries to be handed out through
pfn_to_page(). For now this is just the basic libnvdimm-device-model for
configuring the base PFN device.

As the namespace claiming mechanism for PFN devices is mostly identical
to BTT devices drivers/nvdimm/claim.c is created to house the common
bits.

Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit e1455744b27c9e6115c3508a7b2902157c2c4347)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

x86, pmem: clarify that ARCH_HAS_PMEM_API implies PMEM mapped WB

Orabug: 22913653

Given that a write-back (WB) mapping plus non-temporal stores is
expected to be the most efficient way to access PMEM, update the
definition of ARCH_HAS_PMEM_API to imply arch support for
WB-mapped-PMEM.  This is needed as a pre-requisite for adding PMEM to
the direct map and mapping it with struct page.

The above clarification for X86_64 means that memcpy_to_pmem() is
permitted to use the non-temporal arch_memcpy_to_pmem() rather than
needlessly fall back to default_memcpy_to_pmem() when the pcommit
instruction is not available.  When arch_memcpy_to_pmem() is not
guaranteed to flush writes out of cache, i.e. on older X86_32
implementations where non-temporal stores may just dirty cache,
ARCH_HAS_PMEM_API is simply disabled.

The default fall back for persistent memory handling remains.  Namely,
map it with the WT (write-through) cache-type and hope for the best.

arch_has_pmem_api() is updated to only indicate whether the arch
provides the proper helpers to meet the minimum "writes are visible
outside the cache hierarchy after memcpy_to_pmem() + wmb_pmem()".  Code
that cares whether wmb_pmem() actually flushes writes to pmem must now
call arch_has_wmb_pmem() directly.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
[hch: set ARCH_HAS_PMEM_API=n on x86_32]
Reviewed-by: Christoph Hellwig <hch@lst.de>
[toshi: x86_32 compile fixes]
Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 96601adb745186ccbcf5b078d4756f13381ec2af)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

add devm_memremap_pages

Orabug: 22913653

This behaves like devm_memremap except that it ensures we have page
structures available that can back the region.

Signed-off-by: Christoph Hellwig <hch@lst.de>
[djbw: catch attempts to remap RAM, drop flags]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dan Duval <dan.duval@oracle.com>
(cherry picked from commit 41e94a851304f7acac840adec4004f8aeee53ad4)

Conflict:

include/linux/io.h

mm: ZONE_DEVICE for "device memory"

Orabug: 22913653

While pmem is usable as a block device or via DAX mappings to userspace
there are several usage scenarios that can not target pmem due to its
lack of struct page coverage. In preparation for "hot plugging" pmem
into the vmemmap add ZONE_DEVICE as a new zone to tag these pages
separately from the ones that are subject to standard page allocations.
Importantly "device memory" can be removed at will by userspace
unbinding the driver of the device.

Having a separate zone prevents allocation and otherwise marks these
pages that are distinct from typical uniform memory. Device memory has
different lifetime and performance characteristics than RAM. However,
since we have run out of ZONES_SHIFT bits this functionality currently
depends on sacrificing ZONE_DMA.

Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Jerome Glisse <j.glisse@gmail.com>
[hch: various simplifications in the arch interface]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dan Duval <dan.duval@oracle.com>
(cherry picked from commit 033fbae988fcb67e5077203512181890848b8e90)

Conflicts:

include/linux/memory_hotplug.h
mm/Kconfig

mm: move __phys_to_pfn and __pfn_to_phys to asm/generic/memory_model.h

Orabug: 22913653

Three architectures already define these, and we'll need them genericly
soon.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 012dcef3f058385268630c0003e9b7f8dcafbeb4)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

nd_blk: change aperture mapping from WC to WB

Orabug: 22913653

This should result in a pretty sizeable performance gain for reads.  For
rough comparison I did some simple read testing using PMEM to compare
reads of write combining (WC) mappings vs write-back (WB).  This was
done on a random lab machine.

PMEM reads from a write combining mapping:
# dd of=/dev/null if=/dev/pmem0 bs=4096 count=100000
100000+0 records in
100000+0 records out
409600000 bytes (410 MB) copied, 9.2855 s, 44.1 MB/s

PMEM reads from a write-back mapping:
# dd of=/dev/null if=/dev/pmem0 bs=4096 count=1000000
1000000+0 records in
1000000+0 records out
4096000000 bytes (4.1 GB) copied, 3.44034 s, 1.2 GB/s

To be able to safely support a write-back aperture I needed to add
support for the "read flush" _DSM flag, as outlined in the DSM spec:

http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf

This flag tells the ND BLK driver that it needs to flush the cache lines
associated with the aperture after the aperture is moved but before any
new data is read.  This ensures that any stale cache lines from the
previous contents of the aperture will be discarded from the processor
cache, and the new data will be read properly from the DIMM.  We know
that the cache lines are clean and will be discarded without any
writeback because either a) the previous aperture operation was a read,
and we never modified the contents of the aperture, or b) the previous
aperture operation was a write and we must have written back the dirtied
contents of the aperture to the DIMM before the I/O was completed.

In order to add support for the "read flush" flag I needed to add a
generic routine to invalidate cache lines, mmio_flush_range().  This is
protected by the ARCH_HAS_MMIO_FLUSH Kconfig variable, and is currently
only supported on x86.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dan Duval <dan.duval@oracle.com>
(cherry picked from commit 67a3e8fe90156d41cd480d3dfbb40f3bc007c262)

Conflicts:

arch/x86/Kconfig
drivers/acpi/nfit.c

pmem, dax: have direct_access use __pmem annotation

Orabug: 22913653

Update the annotation for the kaddr pointer returned by direct_access()
so that it is a __pmem pointer. This is consistent with the PMEM driver
and with how this direct_access() pointer is used in the DAX code.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit e2e05394e4a3420dab96f728df4531893494e15d)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

dax: update I/O path to do proper PMEM flushing

Orabug: 22913653

Update the DAX I/O path so that all operations that store data (I/O
writes, zeroing blocks, punching holes, etc.) properly synchronize the
stores to media using the PMEM API. This ensures that the data DAX is
writing is durable on media before the operation completes.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 2765cfbb342c727c3fd47b165196cb16da158022)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

pmem: add copy_from_iter_pmem() and clear_pmem()

Orabug: 22913653

Add support for two new PMEM APIs, copy_from_iter_pmem() and
clear_pmem().  copy_from_iter_pmem() is used to copy data from an
iterator into a PMEM buffer.  clear_pmem() zeros a PMEM memory range.

Both of these new APIs must be explicitly ordered using a wmb_pmem()
function call and are implemented in such a way that the wmb_pmem()
will make the stores to PMEM durable.  Because both APIs are unordered
they can be called as needed without introducing any unwanted memory
barriers.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 5de490daec8b6354b90d5c9d3e2415b195f5adb6)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

pmem, x86: clean up conditional pmem includes

Orabug: 22913653

Prior to this change x86_64 used the pmem defines in
arch/x86/include/asm/pmem.h, and UM used the default ones at the
top of include/linux/pmem.h. The inclusion or exclusion in linux/pmem.h
was controlled by CONFIG_ARCH_HAS_PMEM_API, but the ones in asm/pmem.h
were controlled by ARCH_HAS_NOCACHE_UACCESS.

Instead, control them both with CONFIG_ARCH_HAS_PMEM_API so that it's
clear that they are related and we don't run into the possibility where
they are both included or excluded. Also remove a bunch of stale
function prototypes meant for UM in asm/pmem.h - these just conflicted
with the inline defaults in linux/pmem.h and gave compile errors.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 4a370df5534ef727cba9a9d74bf22e0609f91d6e)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

pmem: remove layer when calling arch_has_wmb_pmem()

Orabug: 22913653

Prior to this change arch_has_wmb_pmem() was only called by
arch_has_pmem_api(). Both arch_has_wmb_pmem() and arch_has_pmem_api()
checked to make sure that CONFIG_ARCH_HAS_PMEM_API was enabled.

Instead, remove the old arch_has_wmb_pmem() wrapper to be rid of one
extra layer of indirection and the redundant CONFIG_ARCH_HAS_PMEM_API
check. Rename __arch_has_wmb_pmem() to arch_has_wmb_pmem() since we no
longer have a wrapper, and just have arch_has_pmem_api() call the
architecture specific arch_has_wmb_pmem() directly.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 18279b467a9d89afe44afbc19d768e834dbf4545)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

pmem, x86: move x86 PMEM API to new pmem.h header

Orabug: 22913653

Move the x86 PMEM API implementation out of asm/cacheflush.h and into
its own header asm/pmem.h. This will allow members of the PMEM API to
be more easily identified on this and other architectures.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 40603526569b304dd92f720f2f8ab11e828ea145)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

pmem: switch to devm_ allocations

Orabug: 22913653

Signed-off-by: Christoph Hellwig <hch@lst.de>
[djbw: tools/testing/nvdimm/ and memunmap_pmem support]
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 708ab62bef1ed3a3cf065a4138bd87f5d083cfeb)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

devres: add devm_memremap

Orabug: 22913653

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 7d3dcf26a6559fa82af3f53e2c8b163cec95fdaf)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

arch: introduce memremap()

Orabug: 22913653

Existing users of ioremap_cache() are mapping memory that is known in
advance to not have i/o side effects.  These users are forced to cast
away the __iomem annotation, or otherwise neglect to fix the sparse
errors thrown when dereferencing pointers to this memory.  Provide
memremap() as a non __iomem annotated ioremap_*() in the case when
ioremap is otherwise a pointer to cacheable memory. Empirically,
ioremap_<cacheable-type>() call sites are seeking memory-like semantics
(e.g.  speculative reads, and prefetching permitted).

memremap() is a break from the ioremap implementation pattern of adding
a new memremap_<type>() for each mapping type and having silent
compatibility fall backs.  Instead, the implementation defines flags
that are passed to the central memremap() and if a mapping type is not
supported by an arch memremap returns NULL.

We introduce a memremap prototype as a trivial wrapper of
ioremap_cache() and ioremap_wt().  Later, once all ioremap_cache() and
ioremap_wt() usage has been removed from drivers we teach archs to
implement arch_memremap() with the ability to strictly enforce the
mapping type.

Cc: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 92281dee825f6d2eb07c441437e4196a44b0861c)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

pmem: convert to generic memremap

Orabug: 22913653

Kill arch_memremap_pmem() and just let the architecture specify the
flags to be passed to memremap(). Default to writethrough by default.

Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dan Duval <dan.duval@oracle.com>
(cherry picked from commit e836a256e8fd579c9d7a3685f22981225a1ca451)

Conflict:

include/linux/pmem.h

mm: enhance region_is_ram() to region_intersects()

Orabug: 22913653

region_is_ram() is used to prevent the establishment of aliased mappings
to physical "System RAM" with incompatible cache settings. However, it
uses "-1" to indicate both "unknown" memory ranges (ranges not described
by platform firmware) and "mixed" ranges (where the parameters describe
a range that partially overlaps "System RAM").

Fix this up by explicitly tracking the "unknown" vs "mixed" resource
cases and returning REGION_INTERSECTS, REGION_MIXED, or REGION_DISJOINT.
This re-write also adds support for detecting when the requested region
completely eclipses all of a resource. Note, the implementation treats
overlaps between "unknown" and the requested memory type as
REGION_INTERSECTS.

Finally, other memory types can be passed in by name, for now the only
usage "System RAM".

Suggested-by: Luis R. Rodriguez <mcgrof@suse.com>
Reviewed-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 124fe20d94630b6f173dae5eb815e6e6e350c72d)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

nvdimm: change to use generic kvfree()

Orabug: 22913653

Signed-off-by: yalin wang <yalin.wang2010@gmail.com>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit a06a7576526e10a99ea7721533e7f2df3e26baad)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option

Orabug: 22913653

We currently register a platform device for e820 type-12 memory and
register a nvdimm bus beneath it.  Registering the platform device
triggers the device-core machinery to probe for a driver, but that
search currently comes up empty.  Building the nvdimm-bus registration
into the e820_pmem platform device registration in this way forces
libnvdimm to be built-in.  Instead, convert the built-in portion of
CONFIG_X86_PMEM_LEGACY to simply register a platform device and move the
rest of the logic to the driver for e820_pmem, for the following
reasons:

1/ Letting e820_pmem support be a module allows building and testing
   libnvdimm.ko changes without rebooting

2/ All the normal policy around modules can be applied to e820_pmem
   (unbind to disable and/or blacklisting the module from loading by
   default)

3/ Moving the driver to a generic location and converting it to scan
   "iomem_resource" rather than "e820.map" means any other architecture can
   take advantage of this simple nvdimm resource discovery mechanism by
   registering a resource named "Persistent Memory (legacy)"

Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 7a67832c7e44c20935c5d6f2264035a0f7bf0d8f)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm, btt: write and validate parent_uuid

Orabug: 22913653

When a BTT is instantiated on a namespace it must validate the namespace
uuid matches the 'parent_uuid' stored in the btt superblock. This
property enforces that changing the namespace UUID invalidates all
former BTT instances on that storage. For "IO namespaces" that don't
have a label or UUID, the parent_uuid is set to zero, and this
validation is skipped. For such cases, old BTTs have to be invalidated
by forcing the namespace to raw mode, and overwriting the BTT info
blocks.

Based on a patch by Dan Williams <dan.j.williams@intel.com>

Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 6ec689542b5bc516187917d49b112847dfb75b0b)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm, btt: consolidate arena validation

Orabug: 22913653

Use arena_is_valid as a common routine for checking the validity of an
info block from both discover_arenas, and nd_btt_probe.

As a result, don't check for validity of the BTT's UUID, and lbasize.
The checksum in the BTT info block guarantees self-consistency, and when
we're called from nd_btt_probe, we don't have a valid uuid or lbasize
available to check against.

Also cleanup to return a bool instead of an int.

Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit ab45e7632717b811e0786e46ca5ad279cb731b66)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm, btt: clean up internal interfaces

Orabug: 22913653

Consolidate the parameters passed to arena_is_valid into just nd_btt,
and an info block to increase re-usability.

Similarly, btt_arena_write_layout doesn't need to be passed a uuid, as
it can be obtained from arena->nd_btt.

Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit fbde1414acc0440024083bf0c391b259bcfc4826)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

nvdimm: fix inline function return type warning

Orabug: 22913653

Fix multiple build warnings when CONFIG_BTT is not enabled:

In file included from ../drivers/nvdimm/bus.c:29:0:
../drivers/nvdimm/nd.h:169:15: warning: return type defaults to 'int' [-Wreturn-type]
static inline nd_btt_probe(struct nd_namespace_common *ndns, void *drvdata)
^

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: linux-nvdimm@lists.01.org
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit f6ef5a2a50816b58e3126206de13d0b9fdf89df5)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

nfit: Don't check _STA on NVDIMM devices

Orabug: 22913653

The _STA only applies to the root device, not the individual NVDIMMS,
so don't check here. NVDIMM device state flags are checked elsewhere.

Signed-off-by: Linda Knippers <linda.knippers@hp.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 60e95f43fc8573e81f54b0c1e0bc542c2260d956)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm, pmem: Change pmem physical sector size to PAGE_SIZE

Orabug: 22913653

Based on a patch: c8fa317 brd: Request from fdisk 4k alignment by Boaz
Harrosh, allow fdisk to create properly aligned partitions for DAX. This
will also cause mkfs.ext4 to emit a warning if using a file system block
size of less than PAGE_SIZE.

Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Elliott, Robert <Elliott@hp.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Acked-by: Boaz Harrosh <boaz@plexistor.com>
Acked-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 6b47496a6fc81816e7edaf8224dfb88e402a05f5)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm: Add DSM support for Address Range Scrub commands

Orabug: 22913653

Add support for the three ARS DSM commands:
- Query ARS Capabilities - Queries the firmware to check if a given
  range supports scrub, and if so, which type (persistent vs. volatile)
- Start ARS - Starts a scrub for a given range/type
- Query ARS Status - Checks status of a previously started scrub, and
  provides the error logs if any.

  The commands are described by the example DSM spec at:
  http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf

Also add these commands to the nfit_test test framework, and return
canned data.

Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 39c686b862cdb2049b90e095b6c6c727b2a7ab60)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm: Update name of the ars_status_record mask field

Orabug: 22913653

The spec suggests that this is a simple 'length' field, not a mask.
Update the name accordingly.

Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit ec92777f2ba93c00387b8fe53780c25adc57c744)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm, btt: sparse fix

Orabug: 22913653

Fix:
drivers/nvdimm/btt.c:635:29: warning: restricted __le64 degrades to integer

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 5e32940621eb62064d98f42c9889db71b0368bde)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

nfit: Clarify memory device state flags strings

Orabug: 22913653

ACPI 6.0 NFIT Memory Device State Flags in Table 5-129 defines
NVDIMM status as follows.  These bits indicate multiple info,
such as failures, pending event, and capability.

  Bit [0] set to 1 to indicate that the previous SAVE to the
  Memory Device failed.
  Bit [1] set to 1 to indicate that the last RESTORE from the
  Memory Device failed.
  Bit [2] set to 1 to indicate that platform flush of data to
  Memory Device failed. As a result, the restored data content
  may be inconsistent even if SAVE and RESTORE do not indicate
  failure.
  Bit [3] set to 1 to indicate that the Memory Device is observed
  to be not armed prior to OSPM hand off. A Memory Device is
  considered armed if it is able to accept persistent writes.
  Bit [4] set to 1 to indicate that the Memory Device observed
  SMART and health events prior to OSPM handoff.

/sys/bus/nd/devices/nmemX/nfit/flags shows this flags info.
The output strings associated with the bits are "save", "restore",
"smart", etc., which can be confusing as they may be interpreted
as positive status, i.e. save succeeded.

Change also the dev_info() message in acpi_nfit_register_dimms()
to be consistent with the sysfs flags strings.

Reported-by: Robert Elliott <elliott@hp.com>
Signed-off-by: Toshi Kani <toshi.kani@hp.com>
[ross: rename 'not_arm' to 'not_armed']
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
[djbw: defer adding bit5, HEALTH_ENABLED, for now]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 402bae597ec68b84498432f5a0069f28bfb807d6)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

nfit, nd_blk: BLK status register is only 32 bits

Orabug: 22913653

Only read 32 bits for the BLK status register in read_blk_stat().

The format and size of this register is defined in the
"NVDIMM Driver Writer's guide":

http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Reported-by: Nicholas Moulin <nicholas.w.moulin@linux.intel.com>
Tested-by: Nicholas Moulin <nicholas.w.moulin@linux.intel.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit de4a196c02a2a2631b516d90da6e8d052ccb07e8)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

xfs: call dax_fault on read page faults for DAX

Orabug: 22913653

When modifying the patch series to handle the XFS MMAP_LOCK nesting
of page faults, I botched the conversion of the read page fault
path, and so it is only every calling through the page cache. Re-add
the necessary __dax_fault() call for such files.

Because the get_blocks callback on read faults may not set up the
mapping buffer correctly to allow unwritten extent completion to be
run, we need to allow callers of __dax_fault() to pass a null
complete_unwritten() callback. The DAX code always zeros the
unwritten page when it is read faulted so there are no stale data
exposure issues with not doing the conversion. The only downside
will be the potential for increased CPU overhead on repeated read
faults of the same page. If this proves to be a problem, then the
filesystem needs to fix it's get_block callback and provide a
convert_unwritten() callback to the read fault path.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Matthew Wilcox <willy@linux.intel.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry picked from commit b2442c5a7fe92cca08437070c8a45a7aa0d1703e)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

mm: Fix bugs in region_is_ram()

Orabug: 22913653

region_is_ram() looks up the iomem_resource table to check if
a target range is in RAM. However, it always returns with -1
due to invalid range checks. It always breaks the loop at the
first entry of the table.

Another issue is that it compares p->flags and flags, but it always
fails. flags is declared as int, which makes it as a negative value
with IORESOURCE_BUSY (0x80000000) set while p->flags is unsigned long.

Fix the range check and flags so that region_is_ram() works as
advertised.

Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: Mike Travis <travis@sgi.com>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Roland Dreier <roland@purestorage.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/1437088996-28511-4-git-send-email-toshi.kani@hp.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit 8c38de992be9aed0b34c4fab8f972c83d3b00dc4)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

x86/mm: Remove region_is_ram() call from ioremap

Orabug: 22913653

__ioremap_caller() calls region_is_ram() to walk through the
iomem_resource table to check if a target range is in RAM, which was
added to improve the lookup performance over page_is_ram() (commit
906e36c5c717 "x86: use optimized ioresource lookup in ioremap
function"). page_is_ram() was no longer used when this change was
added, though.

__ioremap_caller() then calls walk_system_ram_range(), which had
replaced page_is_ram() to improve the lookup performance (commit
c81c8a1eeede "x86, ioremap: Speed up check for RAM pages").

Since both checks walk through the same iomem_resource table for
the same purpose, there is no need to call both functions.

Aside of that walk_system_ram_range() is the only useful check at the
moment because region_is_ram() always returns -1 due to an
implementation bug. That bug in region_is_ram() cannot be fixed
without breaking existing ioremap callers, which rely on the subtle
difference of walk_system_ram_range() versus non page aligned ranges.

Once these offending callers are fixed we can use region_is_ram() and
remove walk_system_ram_range().

[ tglx: Massaged changelog ]

Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: Roland Dreier <roland@purestorage.com>
Cc: Mike Travis <travis@sgi.com>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/1437088996-28511-3-git-send-email-toshi.kani@hp.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Dan Duval <dan.duval@oracle.com>
(cherry picked from commit 9a58eebe1ace609bedf8c5a65e70a097459f5696)

Conflict:

arch/x86/mm/ioremap.c

libnvdimm: fix namespace seed creation

Orabug: 22913653

A new BLK namespace "seed" device is created whenever the current seed
is successfully probed. However, if that namespace is assigned to a BTT
it may never directly experience a successful probe as it is a
subordinate device to a BTT configuration.

The effect of the current code is that no new namespaces can be
instantiated, after the seed namespace, to consume available BLK DPA
capacity. Fix this by treating a successful BTT probe event as a
successful probe event for the backing namespace.

Reported-by: Nicholas Moulin <nicholas.w.moulin@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 8ca243536d21ae2d08f61b1c5af4ac3d4bb697e4)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

nfit: add support for NVDIMM "latch" flag

Orabug: 22913653

Add support in the NFIT BLK I/O path for the "latch" flag
defined in the "Get Block NVDIMM Flags" _DSM function:

http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf

This flag requires the driver to read back the command register after it
is written in the block I/O path. This ensures that the hardware has
fully processed the new command and moved the aperture appropriately.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit f0f2c072cf530d5b8890be5051cc8b36b0c54cce)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

nfit: update block I/O path to use PMEM API

Orabug: 22913653

Update the nfit block I/O path to use the new PMEM API and to adhere to
the read/write flows outlined in the "NVDIMM Block Window Driver
Writer's Guide":

http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf

This includes adding support for targeted NVDIMM flushes called "flush
hints" in the ACPI 6.0 specification:

http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf

For performance and media durability the mapping for a BLK aperture is
moved to a write-combining mapping which is consistent with
memcpy_to_pmem() and wmb_blk().

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit c2ad29540cb913bd9e526fae77c35c7fb45f24a3)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

tools/testing/nvdimm: add mock acpi_nfit_flush_address entries to nfit_test

Orabug: 22913653

In preparation for fixing the BLK path to properly use "directed
pcommit" enable the unit test infrastructure to emit mock "flush"
tables. Writes to these flush addresses trigger a memory controller to
flush its internal buffers to persistent media, similar to the x86
"pcommit" instruction.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 9d27a87ec9e1317d368b1e5e3f4808078baa8c4c)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

tools/testing/nvdimm: fix return code for unimplemented commands

Orabug: 22913653

The implementation for the new "DIMM Flags" DSM relies on the -ENOTTY
return code to indicate that the flags are unimplimented and to fall
back to a safe default. As is the -ENXIO error code erroneoously
indicates to fail enabling a BLK region.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit f7ec83684af020c961d7fab801f8e3ef7ce5de33)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

tools/testing/nvdimm: mock ioremap_wt

Orabug: 22913653

In the 4.2-rc1 merge the default_memremap_pmem() implementation switched
from ioremap_nocache() to ioremap_wt(). Add it to the list of mocked
routines to restore the ability to run the unit tests.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit b1b2e6235a44174151fa3bb22657f74972635618)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

pmem: add maintainer for include/linux/pmem.h

Orabug: 22913653

The file include/linux/pmem.h was recently created to hold the PMEM API,
and is logically part of the PMEM driver. Add an entry for this file to
MAINTAINERS.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit b864bc17f1c326783f2388057e15d3e153125ab9)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

nfit: fix smatch "use after null check" report

Orabug: 22913653

drivers/acpi/nfit.c:1224 acpi_nfit_blk_region_enable()
         error: we previously assumed 'nfit_mem' could be null (see line 1223)

drivers/acpi/nfit.c
  1222          nfit_mem = nvdimm_provider_data(nvdimm);
  1223          if (!nfit_mem || !nfit_mem->dcr || !nfit_mem->bdw) {
                     ^^^^^^^^
Check.

  1224                  dev_dbg(dev, "%s: missing%s%s%s\n", __func__,
  1225                                  nfit_mem ? "" : " nfit_mem",
  1226                                  nfit_mem->dcr ? "" : " dcr",
                                        ^^^^^^^^^^^^^
Unchecked dereference.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 193ccca43850d2355e7690a93ab9d7d78d38f905)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

nvdimm: Fix return value of nvdimm_bus_init() if class_create() fails

Orabug: 22913653

Return proper error if class_create() fails.

Signed-off-by: Axel Lin <axel.lin@ingics.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit daa1dee405d7d3d3e816b84a692e838a5647a02a)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

libnvdimm: smatch cleanups in __nd_ioctl

Orabug: 22913653

Drop use of access_ok() since we are already using copy_{to|from}_user()
which do their own access_ok().

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit af834d457d9ed69e14836b63d0da198fdd2ec706)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

sparse: fix misplaced __pmem definition

Orabug: 22913653

Move the definition of __pmem outside of CONFIG_SPARSE_RCU_POINTER to fix:

drivers/nvdimm/pmem.c:198:17: sparse: too many arguments for function __builtin_expect
drivers/nvdimm/pmem.c:36:33: sparse: expected ; at end of declaration
drivers/nvdimm/pmem.c:48:21: sparse: void declaration

...due to __pmem failing to be defined in some configurations when
CONFIG_SPARSE_RCU_POINTER=y.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
(cherry picked from commit 31f02455455d405320e2f749696bef4e02903b35)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

dax: bdev_direct_access() may sleep

Orabug: 22913653

The brd driver is the only in-tree driver that may sleep currently.
After some discussion on linux-fsdevel, we decided that any driver
may choose to sleep in its ->direct_access method. To ensure that all
callers of bdev_direct_access() are prepared for this, add a call
to might_sleep().

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
(cherry picked from commit 43c3dd08da890e458f670b4fc0630513fb405620)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

block: Add support for DAX reads/writes to block devices

Orabug: 22913653

If a block device supports the ->direct_access methods, bypass the normal
DIO path and use DAX to go straight to memcpy() instead of allocating
a DIO and a BIO.

Includes support for the DIO_SKIP_DIO_COUNT flag in DAX, as is done in
do_blockdev_direct_IO().

Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
(cherry picked from commit bbab37ddc20bae4709bca8745c128c4f46fe63c5)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

dax: Use copy_from_iter_nocache

Orabug: 22913653

When userspace does a write, there's no need for the written data to
pollute the CPU cache. This matches the original XIP code.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
(cherry picked from commit 872eb127e3a6cddcfca1410bb808d9b9bc773dc1)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

dax: Add block size note to documentation

Orabug: 22913653

For block devices which are small enough, mkfs will default to creating
a filesystem with block sizes smaller than page size.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
(cherry picked from commit 44f4c054cae646a9296da05cdfe7d6e786f73d46)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

xfs: saner xfs_trans_commit interface

The flags argument to xfs_trans_commit is not useful for most callers, as
a commit of a transaction without a permanent log reservation must pass
0 here, and all callers for a transaction with a permanent log reservation
except for xfs_trans_roll must pass XFS_TRANS_RELEASE_LOG_RES. So remove
the flags argument from the public xfs_trans_commit interfaces, and
introduce low-level __xfs_trans_commit variant just for xfs_trans_roll
that regrants a log reservation instead of releasing it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Dan Duval <dan.duval@oracle.com>
(cherry picked from commit 70393313dd0b26a6a79e2737b6dff1f1937b936d)

Conflict:

fs/xfs/xfs_attr_inactive.c

xfs: remove the flags argument to xfs_trans_cancel

Orabug: 22913653

xfs_trans_cancel takes two flags arguments: XFS_TRANS_RELEASE_LOG_RES and
XFS_TRANS_ABORT.  Both of them are a direct product of the transaction
state, and can be deducted:

- any dirty transaction needs XFS_TRANS_ABORT to be properly canceled,
   and XFS_TRANS_ABORT is a noop for a transaction that is not dirty.
- any transaction with a permanent log reservation needs
   XFS_TRANS_RELEASE_LOG_RES to be properly canceled, and passing
   XFS_TRANS_RELEASE_LOG_RES for a transaction without a permanent
   log reservation is invalid.

So just remove the flags argument and do the right thing.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Dan Duval <dan.duval@oracle.com>
(cherry picked from commit 4906e21545814e4129595118287a2f1415483c0b)

Conflict:

fs/xfs/xfs_attr_inactive.c

xfs: pass a boolean flag to xfs_trans_free_items

Orabug: 22913653

The flags value always was 0 or XFS_TRANS_ABORT. Switch to a bool
parameter to allow further cleanups.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry picked from commit eacb24e73424bdae4aa139ddd459f86ec46f0ad0)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

xfs: switch remaining xfs_trans_dup users to xfs_trans_roll

Orabug: 22913653

We have three remaining callers of xfs_trans_dup:

- xfs_itruncate_extents which open codes xfs_trans_roll
- xfs_bmap_finish doesn't have an xfs_inode argument and thus leaves
   attaching them to it's callers, but otherwise is identical to
   xfs_trans_roll
- xfs_dir_ialloc looks at the log reservations in the old xfs_trans
   structure instead of the log reservation parameters, but otherwise
   is identical to xfs_trans_roll.

By allowing a NULL xfs_inode argument to xfs_trans_roll we can switch
these three remaining users over to xfs_trans_roll and mark xfs_trans_dup
static.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry picked from commit 2e6db6c4c1f71823472651f162f0b355f5b7951e)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

xfs: add initial DAX support

Orabug: 22913653

Add initial DAX support to XFS. To do this we need a new mount
option to turn DAX on filesystem, and we need to propagate this into
the inode flags whenever an inode is instantiated so that the
per-inode checks throughout the code Do The Right Thing.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry picked from commit cbe4dab119f211ff6642d617f541087894e99e4f)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

xfs: add DAX IO path support

Orabug: 22913653

DAX does not do buffered IO (can't buffer direct access!) and hence
all read/write IO is vectored through the direct IO path. Hence we
need to add the DAX IO path callouts to the direct IO
infrastructure.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry picked from commit 6e1ba0bcb84b3f97616feb07c27f974509ba57be)
Signed-off-by: Dan Duval <dan.duval@oracle.com>

xfs: add DAX truncate support

Orabug: 22913653

When we truncate a DAX file, we need to call through the DAX page
truncation path rather than through block_truncate_page() so that
mappings and block zeroing are all handled correctly. Otherwise,
truncate does not need to change.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
(cherry picked from commit 9969441f9f86a8a7de8c36514fa789e5f5d83145)
Signed-off-by: Dan Duval <dan.duval@oracle.com>