Hillf Danton noticed bugs in the hugetlb_vmtruncate_list routine. The
argument end is of type pgoff_t. It was being converted to a vaddr offset
and passed to unmap_hugepage_range. However, end was also being used as
an argument to the vma_interval_tree_foreach controlling loop. In
addition, the conversion of end to vaddr offset was incorrect.
hugetlb_vmtruncate_list is called as part of a file truncate or fallocate
hole punch operation.
When truncating a hugetlbfs file, this bug could prevent some pages from
being unmapped. This is possible if there are multiple vmas mapping the
file, and there is a sufficiently sized hole between the mappings. The
size of the hole between two vmas (A,B) must be such that the starting
virtual address of B is greater than (ending virtual address of A <<
PAGE_SHIFT). In this case, the pages in B would not be unmapped. If
pages are not properly unmapped during truncate, the following BUG is hit:
kernel BUG at fs/hugetlbfs/inode.c:428!
In the fallocate hole punch case, this bug could prevent pages from being
unmapped as in the truncate case. However, for hole punch the result is
that unmapped pages will not be removed during the operation. For hole
punch, it is also possible that more pages than desired will be unmapped.
This unnecessary unmapping will cause page faults to reestablish the
mappings on subsequent page access.
Fixes: 1bfad99ab (" hugetlbfs: hugetlb_vmtruncate_list() needs to take a range")Reported-by: Hillf Danton <hillf.zj@alibaba-inc.com> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: <stable@vger.kernel.org> [4.3] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 29bfa18ca579419db54d93250c199bcb54d80047) Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Filipe Manana [Sun, 3 May 2015 00:56:00 +0000 (01:56 +0100)]
Btrfs: incremental send, fix clone operations for compressed extents
Marc reported a problem where the receiving end of an incremental send
was performing clone operations that failed with -EINVAL. This happened
because, unlike for uncompressed extents, we were not checking if the
source clone offset and length, after summing the data offset, falls
within the source file's boundaries.
So make sure we do such checks when attempting to issue clone operations
for compressed extents.
Problem reproducible with the following steps:
$ mkfs.btrfs -f /dev/sdb
$ mount -o compress /dev/sdb /mnt
$ mkfs.btrfs -f /dev/sdc
$ mount -o compress /dev/sdc /mnt2
# Create the file with a single extent of 128K. This creates a metadata file
# extent item with a data start offset of 0 and a logical length of 128K.
$ xfs_io -f -c "pwrite -S 0xaa 64K 128K" -c "fsync" /mnt/foo
# Now rewrite the range 64K to 112K of our file. This will make the inode's
# metadata continue to point to the 128K extent we created before, but now
# with an extent item that points to the extent with a data start offset of
# 112K and a logical length of 16K.
# That metadata file extent item is associated with the logical file offset
# at 176K and covers the logical file range 176K to 192K.
$ xfs_io -c "pwrite -S 0xbb 64K 112K" -c "fsync" /mnt/foo
# Now rewrite the range 180K to 12K. This will make the inode's metadata
# continue to point the the 128K extent we created earlier, with a single
# extent item that points to it with a start offset of 112K and a logical
# length of 4K.
# That metadata file extent item is associated with the logical file offset
# at 176K and covers the logical file range 176K to 180K.
$ xfs_io -c "pwrite -S 0xcc 180K 12K" -c "fsync" /mnt/foo
$ btrfs send /mnt/snap1 | btrfs receive /mnt2
At subvol /mnt/snap1
At subvol snap1
$ btrfs send -p /mnt/snap1 /mnt/snap2 | btrfs receive /mnt2
At subvol /mnt/snap2
At snapshot snap2
ERROR: failed to clone extents to bar
Invalid argument
Reported-by: Marc MERLIN <marc@merlins.org> Tested-by: Marc MERLIN <marc@merlins.org> Signed-off-by: Filipe Manana <fdmanana@suse.com> Tested-by: David Sterba <dsterba@suse.cz> Tested-by: Jan Alexander Steffens (heftig) <jan.steffens@gmail.com> Signed-off-by: Chris Mason <clm@fb.com>
(cherry picked from commit 619d8c4ef7c5dd346add55da82c9179cd2e3387e) Signed-off-by: Ashish Samant <ashish.samant@oracle.com>
Preserve any rpcrdma_req that is attached to rpc_rqst's allocated
for the backchannel. Otherwise, after all the pre-allocated
backchannel req's are consumed, incoming backward calls start
writing on freed memory.
Somehow this hunk got lost.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Acked-by: John Sobecki <john.sobecki@oracle.com> Tested-by: Dai Ngo <dai.ngo@oracle.com>
A value of 2560 (1280k) will accommodate a 10-data-disk stripe
write with chunk size 128k. In the testing I've done using
iozone, fio, and aio-stress across a number of different storage
devices, a value of 1280 does not show a big performance
difference from 512, but will hopefully help software RAID
setups using SATA disks, as reported by Christoph.
NOTE: drivers/block/aoe/aoeblk.c sets its own max_hw_sectors_kb to
BLK_DEF_MAX_SECTORS. So, this patch essentially changes aeoblk to
Use a larger maximum sector size, and I did not test this.
EXT4-fs warning (device dm-X): ext4_end_bio:313: I/O error writing to ino
Below is revert text from mainline:
This reverts commit 34b48db66e08ca1c1bc07cf305d672ac940268dc.
That commit caused performance regressions for streaming I/O
workloads on a number of different storage devices, from
SATA disks to external RAID arrays. It also managed to
trip up some buggy firmware in at least one drive, causing
data corruption.
The next patch will bump the default max_sectors_kb value to
1280, which will accommodate a 10-data-disk stripe write
with chunk size 128k. In the testing I've done using iozone,
fio, and aio-stress, a value of 1280 does not show a big
performance difference from 512. This will hopefully still
help the software RAID setup that Christoph saw the original
performance gains with while still not regressing other
storage configurations.
with a usage count of 100 * the number of times the program has been run,
then the kernel is malfunctioning. If leaked-keyring has zero usages or
has been garbage collected, then the problem is fixed.
Reported-by: Yevgeny Pats <yevgeny@perception-point.io> Signed-off-by: David Howells <dhowells@redhat.com>
Orabug: 22563965
CVE: CVE-2016-0728 Acked-by: Chuck Anderson <chuck.anderson@oracle.com>
David Howells [Thu, 15 Oct 2015 16:21:37 +0000 (17:21 +0100)]
KEYS: Fix crash when attempt to garbage collect an uninstantiated keyring
The following sequence of commands:
i=`keyctl add user a a @s`
keyctl request2 keyring foo bar @t
keyctl unlink $i @s
tries to invoke an upcall to instantiate a keyring if one doesn't already
exist by that name within the user's keyring set. However, if the upcall
fails, the code sets keyring->type_data.reject_error to -ENOKEY or some
other error code. When the key is garbage collected, the key destroy
function is called unconditionally and keyring_destroy() uses list_empty()
on keyring->type_data.link - which is in a union with reject_error.
Subsequently, the kernel tries to unlink the keyring from the keyring names
list - which oopses like this:
David Howells [Fri, 25 Sep 2015 15:30:08 +0000 (16:30 +0100)]
KEYS: Fix race between key destruction and finding a keyring by name
There appears to be a race between:
(1) key_gc_unused_keys() which frees key->security and then calls
keyring_destroy() to unlink the name from the name list
(2) find_keyring_by_name() which calls key_permission(), thus accessing
key->security, on a key before checking to see whether the key usage is 0
(ie. the key is dead and might be cleaned up).
Fix this by calling ->destroy() before cleaning up the core key data -
including key->security.
Eric Northup [Tue, 3 Nov 2015 17:03:53 +0000 (18:03 +0100)]
KVM: x86: work around infinite loop in microcode when #AC is delivered
It was found that a guest can DoS a host by triggering an infinite
stream of "alignment check" (#AC) exceptions. This causes the
microcode to enter an infinite loop where the core never receives
another interrupt. The host kernel panics pretty quickly due to the
effects (CVE-2015-5307).
Signed-off-by: Eric Northup <digitaleric@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Orabug: 22333632
CVE: CVE-2015-5307
mainline commit 54a20552e1eae07aa240fa370a0293e006b5faed Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>
V2: The original version of the patch did not correctly handle placeholder
entries before the range to be deleted. The new check is more specific
and only matches placeholders at the start of range.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit fd2e0def3e0954be0453b625ce12c48e4a83bc70) Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Dmitry identified a potential memory leak in the routine region_chg, where
a region descriptor is not free'ed on an error path.
However, the root cause for the above memory leak resides in region_del.
In this specific case, a "placeholder" entry is created in region_chg.
The associated page allocation fails, and the placeholder entry is left in
the reserve map. This is "by design" as the entry should be deleted when
the map is released. The bug is in the region_del routine which is used
to delete entries within a specific range (and when the map is released).
region_del did not handle the case where a placeholder entry exactly
matched the start of the range range to be deleted. In this case, the
entry would not be deleted and leaked. The fix is to take these special
placeholder entries into account in region_del.
The region_chg error path leak is also fixed.
Fixes: feba16e25a57 ("mm/hugetlb: add region_del() to delete a specific range of entries") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: <stable@vger.kernel.org> [4.3+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 8ea87f6c208bd070a2701b53d2479a415f4159f4) Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
port from RH. https://lkml.org/lkml/2015/7/9/724
fs] vfs: Prevent syncing frozen file system (Lukas Czerner)
[12434041241791]
port from RH. https://lkml.org/lkml/2015/7/9/487
From Lukas Czerner <>
Subject [RFC][PATCH] fs: Prevent syncing frozen file system
Date Thu, 9 Jul 2015 19:45:45 +0200
Currently we can end up in a deadlock because of broken
sb_start_write -> s_umount ordering.
The race goes like this:
- write the file
- unlink the file - final_iput will not be calles as file is opened
- freeze the file system
- Now simultaneously close the file and call sync (or syncfs on that
particular file system). Sync will get to wait_sb_inodes() where it will
grab the referece to the inode (__iget()) and later to call iput().
If we manage to close the file and drop the reference in between those
calls sync will attempt to do a iput_final() because the inode is now
unlinked and we're holding the last reference to it. This will
however block on a frozen file system (ext4_delete_inode for
example).
Note that I've not been able to reproduce the issue, I've only seen this
happen once. However with some instrumentation (like msleep() in the
wait_sb_inodes() it can be achieved.
Fix this by properly doing sb_start_write/sb_end_write to prevent us
from fsfreeze.
Note that with this patch syncfs will block on the frozen file system
which is probably ok, but sync will block if any file system happens to
be frozen - not sure if that's a problem, but it's certainly different
from what we've been used to.
V3:
Add more descriptive comments and minor improvements as suggested by
Naoya Horiguchi
v2:
Make remove_inode_hugepages simpler after verifying truncate can not
race with page faults here.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: "Hillf Danton" <hillf.zj@alibaba-inc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit d4d7420e86e29267ff2e20226cbcadcf7e6eab8c) Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Hugh Dickins pointed out problems with the new hugetlbfs fallocate hole
punch code. These problems are in the routine remove_inode_hugepages and
mostly occur in the case where there are holes in the range of pages to be
removed. These holes could be the result of a previous hole punch or
simply sparse allocation. The current code could access pages outside the
specified range.
remove_inode_hugepages handles both hole punch and truncate operations.
Page index handling was fixed/cleaned up so that the loop index always
matches the page being processed. The code now only makes a single pass
through the range of pages as it was determined page faults could not race
with truncate. A cond_resched() was added after removing up to
PAGEVEC_SIZE pages.
Some totally unnecessary code in hugetlbfs_fallocate() that remained from
early development was also removed.
Tested with fallocate tests submitted here:
http://librelist.com/browser//libhugetlbfs/2015/6/25/patch-tests-add-tests-for-fallocate-system-call/
And, some ftruncate tests under development
Fixes: b5cec28d36f5 ("hugetlbfs: truncate_hugepages() takes a range of pages") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: "Hillf Danton" <hillf.zj@alibaba-inc.com> Cc: <stable@vger.kernel.org> [4.3] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit ef9a2b7a46755b6b2d4ab522c2ffa53c6e1a0729) Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
rtnetlink: RTEXT_FILTER_SKIP_STATS support to avoid dumping inet/inet6 stats
Many commonly used functions like getifaddrs() invoke RTM_GETLINK
to dump the interface information, and do not need the
the AF_INET6 statististics that are always returned by default
from rtnl_fill_ifinfo().
Computing the statistics can be an expensive operation that impacts
scaling, so it is desirable to avoid this if the information is
not needed.
This patch adds a the RTEXT_FILTER_SKIP_STATS extended info flag that
can be passed with netlink_request() to avoid statistics computation
for the ifinfo path.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Michal Hocko [Wed, 24 Jun 2015 23:58:06 +0000 (16:58 -0700)]
mm: do not ignore mapping_gfp_mask in page cache allocation paths
page_cache_read, do_generic_file_read, __generic_file_splice_read and
__ntfs_grab_cache_pages currently ignore mapping_gfp_mask when calling
add_to_page_cache_lru which might cause recursion into fs down in the
direct reclaim path if the mapping really relies on GFP_NOFS semantic.
This doesn't seem to be the case now because page_cache_read (page fault
path) doesn't seem to suffer from the reclaim recursion issues and
do_generic_file_read and __generic_file_splice_read also shouldn't be
called under fs locks which would deadlock in the reclaim path. Anyway it
is better to obey mapping gfp mask and prevent from later breakage.
[akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Neil Brown <neilb@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Anton Altaparmakov <anton@tuxera.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 6afdb859b71019143b8eecda02b8b29b03185055)
The enumeration path should leave NumVFs set to zero. But after 4449f079722c ("PCI: Calculate maximum number of buses required for VFs"),
we call virtfn_max_buses() in the enumeration path, which changes NumVFs.
This NumVFs change is visible via lspci and sysfs until a driver enables
SR-IOV.
Set NumVFs to zero after virtfn_max_buses() computes the maximum number of
buses.
Fixes: 4449f079722c ("PCI: Calculate maximum number of buses required for VFs") Based-on-patch-by: Ethan Zhao <ethan.zhao@oracle.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
virtio declares support for NETIF_F_FRAGLIST, but assumes
that there are at most MAX_SKB_FRAGS + 2 fragments which isn't
always true with a fraglist.
A longer fraglist in the skb will make the call to skb_to_sgvec overflow
the sg array, leading to memory corruption.
Drop NETIF_F_FRAGLIST so we only get what we can handle.
Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 48900cb6af4282fa0fb6ff4d72a81aa3dadb5c39) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Jeff Moyer [Tue, 10 Nov 2015 22:22:59 +0000 (14:22 -0800)]
blk-mq: avoid excessive boot delays with large lun counts
Zhangqing Luo reported long boot times on a system with thousands of
LUNs when scsi-mq was enabled. He narrowed the problem down to
blk_mq_add_queue_tag_set, where every queue is frozen in order to set
the BLK_MQ_F_TAG_SHARED flag. Each added device will freeze all queues
added before it in sequence, which involves waiting for an RCU grace
period for each one. We don't need to do this. After the second queue
is added, only new queues need to be initialized with the shared tag.
We can do that by percolating the flag up to the blk_mq_tag_set, and
updating the newly added queue's hctxs if the flag is set.
This problem was introduced by commit 0d2602ca30e41 (blk-mq: improve
support for shared tags maps).
Reported-and-tested-by: Jason Luo <zhangqing.luo@oracle.com> Reviewed-by: Ming Lei <ming.lei@canonical.com> Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 2404e607a9ee36db361bebe32787dafa1f7d6c00)
Orabug: 21879600 Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>
Jana Iyengar found an interesting issue on CUBIC :
The epoch is only updated/reset initially and when experiencing losses.
The delta "t" of now - epoch_start can be arbitrary large after app idle
as well as the bic_target. Consequentially the slope (inverse of
ca->cnt) would be really large, and eventually ca->cnt would be
lower-bounded in the end to 2 to have delayed-ACK slow-start behavior.
This particularly shows up when slow_start_after_idle is disabled
as a dangerous cwnd inflation (1.5 x RTT) after few seconds of idle
time.
Jana initial fix was to reset epoch_start if app limited,
but Neal pointed out it would ask the CUBIC algorithm to recalculate the
curve so that we again start growing steeply upward from where cwnd is
now (as CUBIC does just after a loss). Ideally we'd want the cwnd growth
curve to be the same shape, just shifted later in time by the amount of
the idle period.
Reported-by: Jana Iyengar <jri@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Sangtae Ha <sangtae.ha@gmail.com> Cc: Lawrence Brakmo <lawrence@brakmo.org> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 30927520dbae297182990bb21d08762bcc35ce1d)
Linux in box NVMe driver does not handle 0 MDTS as expected
•0 MDTS - the drive can accept any request size.
•The device driver set up max hardware sector size by
BLK_SAFE_MAX_SECTORS or 124KB.
•Every IO size greater than 124KB is splitted by 124KB and remainder.
HWP previously was only enabled at driver load time, on the boot
CPU, however, HWP must be enabled per package. Move the code to
enable HWP to the cpufreq driver init path so that it will be
called per CPU.
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com> Tested-by: David Zhuang <david.zhuang@oracle.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit ba88d4338f226766f510e207911dde8c1875e072) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/cpufreq/intel_pstate.c Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
The problem is almost exactly the same as the one fixed by feaff8e5b2cf.
We must take a reference to the struct file when running the LOCKU
compound to prevent the final fput from running until the operation is
complete.
Reported-by: Jean Spector <jean@primarydata.com> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Orabug: 21687670
(cherry picked from mainline commit db2efec0caba4f81a22d95a34da640b86c313c8e) Acked-by: Chuck Anderson <chuck.anderson@oracle.com>
Commit 56bb4d795 introduced a build error if CONFIG_NUMA is not
defined. When fallocate preallocation allocates pages, it will
use the defined numa policy. However, if numa is not defined
there is no such policy and no code should reference numa policy.
Create wrappers to isolate policy manipulation code that are a
NOOP in the non-NUMA case.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reported-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 0ce9057732d1dd94ef2bd32c8acb68ae68b08a2f) Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
This is based on the shmem version, but it has diverged quite a bit. We
have no swap to worry about, nor the new file sealing. Add
synchronication via the fault mutex table to coordinate page faults,
fallocate allocation and fallocate hole punch.
What this allows us to do is move physical memory in and out of a
hugetlbfs file without having it mapped. This also gives us the ability
to support MADV_REMOVE since it is currently implemented using
fallocate(). MADV_REMOVE lets madvise() remove pages from the middle of a
hugetlbfs file, which wasn't possible before.
hugetlbfs fallocate only operates on whole huge pages.
Based on code by Dave Hansen.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 79925b07fd22d7c4e4e77cdc26edb26dc4ff2701) Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Currently, there is only a single place where hugetlbfs pages are added to
the page cache. The new fallocate code be adding a second one, so break
the functionality out into its own helper.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 4b153581930e8c61250078efcdcce3e19bc2a45b) Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Areas hole punched by fallocate will not have entries in the
region/reserve map. However, shared mappings with min_size subpool
reservations may still have reserved pages. alloc_huge_page needs to
handle this special case and do the proper accounting.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit b21aa74b8077ce5b9c5fea566fe37af866934746) Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
In vma_has_reserves(), the current assumption is that reserves are always
present for shared mappings. However, this will not be the case with
fallocate hole punch. When punching a hole, the present page will be
deleted as well as the region/reserve map entry (and hence any
reservation). vma_has_reserves is passed "chg" which indicates whether or
not a region/reserve map is present. Use this to determine if reserves
are actually present or were removed via hole punch.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 320fb799e7cedc1edb96fd69c686547c731d5fc8) Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Modify truncate_hugepages() to take a range of pages (start, end) instead
of simply start. If an end value of LLONG_MAX is passed, the current
"truncate" functionality is maintained. Existing callers are modified to
pass LLONG_MAX as end of range. By keying off end == LLONG_MAX, the
routine behaves differently for truncate and hole punch. Page removal is
now synchronized with page allocation via faults by using the fault mutex
table. The hole punch case can experience the rare region_del error and
must handle accordingly.
Add the routine hugetlb_fix_reserve_counts to fix up reserve counts in the
case where region_del returns an error.
Since the routine handles more than just the truncate case, it is renamed
to remove_inode_hugepages(). To be consistent, the routine
truncate_huge_page() is renamed remove_huge_page().
Downstream of remove_inode_hugepages(), the routine
hugetlb_unreserve_pages() is also modified to take a range of pages.
hugetlb_unreserve_pages is modified to detect an error from region_del and
pass it back to the caller.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 6a57804ccdfb77b8f333b736a3ee7cb1bf8732e1) Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
fallocate hole punch will want to unmap a specific range of pages. Modify
the existing hugetlb_vmtruncate_list() routine to take a start/end range.
If end is 0, this indicates all pages after start should be unmapped.
This is the same as the existing truncate functionality. Modify existing
callers to add 0 as end of range.
Since the routine will be used in hole punch as well as truncate
operations, it is more appropriately renamed to hugetlb_vmdelete_list().
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit dea23e2a9e811e5fba895a134f701455908aa0d3) Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
hugetlb page faults are currently synchronized by the table of mutexes
(htlb_fault_mutex_table). fallocate code will need to synchronize with
the page fault code when it allocates or deletes pages. Expose interfaces
so that fallocate operations can be synchronized with page faults. Minor
name changes to be more consistent with other global hugetlb symbols.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit fec73245c33b067c60f520908a93c971003664c8) Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
fallocate hole punch will want to remove a specific range of pages. The
existing region_truncate() routine deletes all region/reserve map entries
after a specified offset. region_del() will provide this same
functionality if the end of region is specified as LONG_MAX. Hence,
region_del() can replace region_truncate().
Unlike region_truncate(), region_del() can return an error in the rare
case where it can not allocate memory for a region descriptor. This ONLY
happens in the case where an existing region must be split. Current
callers passing LONG_MAX as end of range will never experience this error
and do not need to deal with error handling. Future callers of
region_del() (such as fallocate hole punch) will need to handle this
error.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit c2cfad5701106f8ddb0607b9e09d524ef55ef0ec) Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
hugetlbfs is used today by applications that want a high degree of control
over huge page usage. Often, large hugetlbfs files are used to map a
large number huge pages into the application processes. The applications
know when page ranges within these large files will no longer be used, and
ideally would like to release them back to the subpool or global pools for
other uses. The fallocate() system call provides an interface for
preallocation and hole punching within files. This patch set adds
fallocate functionality to hugetlbfs.
fallocate hole punch will want to remove a specific range of pages. When
pages are removed, their associated entries in the region/reserve map will
also be removed. This will break an assumption in the
region_chg/region_add calling sequence. If a new region descriptor must
be allocated, it is done as part of the region_chg processing. In this
way, region_add can not fail because it does not need to attempt an
allocation.
To prepare for fallocate hole punch, create a "cache" of descriptors that
can be used by region_add if necessary. region_chg will ensure there are
sufficient entries in the cache. It will be necessary to track the number
of in progress add operations to know a sufficient number of descriptors
reside in the cache. A new routine region_abort is added to adjust this
in progress count when add operations are aborted. vma_abort_reservation
is also added for callers creating reservations with
vma_needs_reservation/vma_commit_reservation.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 27af163113310a86b6d19bb5693c1a08eb89b0f7) Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
alloc_huge_page and hugetlb_reserve_pages use region_chg to calculate the
number of pages which will be added to the reserve map. Subpool and
global reserve counts are adjusted based on the output of region_chg.
Before the pages are actually added to the reserve map, these routines
could race and add fewer pages than expected. If this happens, the
subpool and global reserve counts are not correct.
Compare the number of pages actually added (region_add) to those expected
to added (region_chg). If fewer pages are actually added, this indicates
a race and adjust counters accordingly.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Cc: David Rientjes <rientjes@google.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 33039678c8da8133e30ea3250d10ae14701dae2b) Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Modify region_add() to keep track of regions(pages) added to the reserve
map and return this value. The return value can be compared to the return
value of region_chg() to determine if the map was modified between calls.
Make vma_commit_reservation() also pass along the return value of
region_add(). In the normal case, we want vma_commit_reservation to
return the same value as the preceding call to vma_needs_reservation.
Create a common __vma_reservation_common routine to help keep the special
case return values in sync
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David Rientjes <rientjes@google.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit cf3ad20bfeadda693e408d85684790714fc29b08) Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
While working on hugetlbfs fallocate support, I noticed the following race
in the existing code. It is unlikely that this race is hit very often in
the current code. However, if more functionality to add and remove pages
to hugetlbfs mappings (such as fallocate) is added the likelihood of
hitting this race will increase.
alloc_huge_page and hugetlb_reserve_pages use information from the reserve
map to determine if there are enough available huge pages to complete the
operation, as well as adjust global reserve and subpool usage counts. The
order of operations is as follows:
- call region_chg() to determine the expected change based on reserve map
- determine if enough resources are available for this operation
- adjust global counts based on the expected change
- call region_add() to update the reserve map
The issue is that reserve map could change between the call to region_chg
and region_add. In this case, the counters which were adjusted based on
the output of region_chg will not be correct.
In order to hit this race today, there must be an existing shared hugetlb
mmap created with the MAP_NORESERVE flag. A page fault to allocate a huge
page via this mapping must occur at the same another task is mapping the
same region without the MAP_NORESERVE flag.
The patch set does not prevent the race from happening. Rather, it adds
simple functionality to detect when the race has occurred. If a race is
detected, then the incorrect counts are adjusted.
Review comments pointed out the need for documentation of the existing
region/reserve map routines. This patch set also adds documentation in
this area.
This patch (of 3):
This is a documentation only patch and does not modify any code.
Descriptions of the routines used for reserve map/region tracking are
added.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David Rientjes <rientjes@google.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 1dd308a7b49d4bdbc17bfa570675ecc8cf7bedb3) Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
When I fixed Linux's NT flag handling, I added an optimization to
Linux 3.19 and up. A malicious 32-bit program might be able to leak
NT into an unrelated task. On a CONFIG_PREEMPT=y kernel, this is a
straightforward DoS. On a CONFIG_PREEMPT=n kernel, it's probably
still exploitable for DoS with some more care.
I believe that this could be used for privilege escalation, too, but
it won't be easy.
Keith Busch [Thu, 6 Aug 2015 18:13:50 +0000 (11:13 -0700)]
NVMe: Fix filesystem deadlock on removal
Move gendisk deletion before controller shutdown so filesystem may sync
dirty pages. Before, this would deadlock trying to allocate requests
on frozen queues that are about to be deleted.
Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Conflicts:
Keith Busch [Thu, 6 Aug 2015 18:07:24 +0000 (11:07 -0700)]
NVMe: Failed controller initialization fixes
This fixes an infinite device reset loop that may occur on devices that
fail initialization. If the drive fails to become ready for any reason
that does not involve an admin command timeout, the probe task should
assume the drive is unavailable and remove it from the topology. In
the case an admin command times out during device probing, the driver's
existing reset action will handle removing the drive.
Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit de3eff2bad56f0a29d3915105223d368f2bbc94e)
Santosh Shilimkar [Thu, 6 Aug 2015 21:48:39 +0000 (14:48 -0700)]
NVMe: Unify controller probe and resume
This unifies probe and resume so they both may be scheduled in the same
way. This is necessary for error handling that may occur during device
initialization since the task to cleanup the device wouldn't be able to
run if it is blocked on device initialization.
Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Conflicts:
Keith Busch [Thu, 6 Aug 2015 21:43:00 +0000 (14:43 -0700)]
NVMe: Automatic namespace rescan
Namespaces may be dynamically allocated and deleted or attached and
detached. This has the driver rescan the device for namespace changes
after each device reset or namespace change asynchronous event.
There could potentially be many detached namespaces that we don't want
polluting /dev/ with unusable block handles, so this will delete disks
if the namespace is not active as indicated by the response from identify
namespace. This also skips adding the disk if no capacity is provisioned
to the namespace in the first place.
Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
Keith Busch [Thu, 6 Aug 2015 17:37:53 +0000 (10:37 -0700)]
NVMe: Fix device cleanup on initialization failure
Don't release block queue and tagging resoureces if the driver never
got them in the first place. This can happen if the controller fails to
become ready, if memory wasn't available to allocate a tagset or admin
queue, or if the resources were released as part of error recovery.
Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 4af0e21caf8e676e57ea27d7cce3426e473e498c)
This adds a sysfs entry that when written to will reset perform an NVMe
controller reset if the controller was successfully initialized in the
first place.
This also adds locking around resetting the device in the async probe
method so the driver can't schedule two resets.
Signed-off-by: Keith Busch <keith.busch@intel.com> Cc: Brandon Schultz <brandon.schulz@hgst.com> Cc: David Sariel <david.sariel@pmcs.com>
Updated by Jens to:
1) Merge this with the ioctl reset patch from David Sariel. The ioctl
path now shares the reset code from the sysfs path.
Keith Busch [Thu, 25 Jun 2015 15:32:38 +0000 (08:32 -0700)]
NVMe: Return busy status on suspended queue
The sync path previously scheduled itself out unconditionally, but
that shouldn't happen if the command wasn't posted to the submission
queue. This patch returns immediately when the suspended queue returns
busy status.
Linus Torvalds [Sat, 20 Jun 2015 20:54:22 +0000 (13:54 -0700)]
Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux
Pull drm fixes from Dave Airlie:
"A smattering of fixes,
mgag200:
don't accept modes that aren't aligned properly as hw can't do it
i915:
two regression fixes
radeon:
one query to allow userspace fixes
one oops fixer for older hw with new options enabled"
* 'drm-fixes' of git://people.freedesktop.org/~airlied/linux:
drm/radeon: don't probe MST on hw we don't support it on
drm/radeon: Add RADEON_INFO_VA_UNMAP_WORKING query
drm/mgag200: Reject non-character-cell-aligned mode widths
Revert "drm/i915: Don't skip request retirement if the active list is empty"
drm/i915: Always reset vma->ggtt_view.pages cache on unbinding
Linus Torvalds [Fri, 19 Jun 2015 17:34:14 +0000 (07:34 -1000)]
Merge tag 'sound-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"Nothing looks scary, just a few usual HD-audio regression fixes and
fixup, in addition to a minor Kconfig dependency fix for the old MIPS
drivers"
* tag 'sound-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: hda - Fix unused label skip_i915
ALSA: hda - Fix noisy outputs on Dell XPS13 (2015 model)
ALSA: mips: let SND_SGI_O2 select SND_PCM
ALSA: hda - Fix audio crackles on Dell Latitude E7x40
ALSA: hda - adding a DAC/pin preference map for a HP Envy TS machine
Boris Brezillon [Fri, 27 Mar 2015 22:53:15 +0000 (23:53 +0100)]
clk: at91: pll: fix input range validity check
The PLL impose a certain input range to work correctly, but it appears that
this input range does not apply on the input clock (or parent clock) but
on the input clock after it has passed the PLL divisor.
Fix the implementation accordingly.
Cc: <stable@vger.kernel.org> # v3.14+ Signed-off-by: Boris Brezillon <boris.brezillon@free-electrons.com> Reported-by: Jonas Andersson <jonas@microbit.se>
Linus Torvalds [Fri, 19 Jun 2015 03:02:27 +0000 (17:02 -1000)]
Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c documentation fix from Wolfram Sang:
"Here is a small documentation fix for I2C.
We already had a user who unsuccessfully tried to get the new slave
framework running with the currently broken example. So, before this
happens again, I'd like to have this how-to-use section fixed for 4.1
already. So that no more hacking time is wasted"
* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: slave: fix the example how to instantiate from userspace
Dave Airlie [Fri, 19 Jun 2015 01:58:39 +0000 (11:58 +1000)]
Merge tag 'drm-intel-fixes-2015-06-18' of git://anongit.freedesktop.org/drm-intel into drm-fixes
one fix, one revert
* tag 'drm-intel-fixes-2015-06-18' of git://anongit.freedesktop.org/drm-intel:
Revert "drm/i915: Don't skip request retirement if the active list is empty"
drm/i915: Always reset vma->ggtt_view.pages cache on unbinding
Dave Airlie [Fri, 19 Jun 2015 01:55:29 +0000 (11:55 +1000)]
Merge branch 'drm-fixes-4.1' of git://people.freedesktop.org/~deathsimple/linux into drm-fixes
two radeon fixes
one MST fix,
one query addition, destined for stable, and to fix a regression
* 'drm-fixes-4.1' of git://people.freedesktop.org/~deathsimple/linux:
drm/radeon: don't probe MST on hw we don't support it on
drm/radeon: Add RADEON_INFO_VA_UNMAP_WORKING query
Linus Torvalds [Thu, 18 Jun 2015 06:56:57 +0000 (20:56 -1000)]
Merge tag 'trace-fix-filter-4.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing filter fix from Steven Rostedt:
"Vince Weaver reported a warning when he added perf event filters into
his fuzzer tests. There's a missing check of balanced operations when
parenthesis are used, and this triggers a WARN_ON() and when reading
the failure, the filter reports no failure occurred.
The operands were not being checked if they match, this adds that"
* tag 'trace-fix-filter-4.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Have filter check for balanced ops
Since when we start discussions about the usage Media Controller for
complex hardware, one thing become clear: the way it is, MC fails to
map anything different than capture/output/m2m video-only streaming.
The point is that MC has entities named as devnodes, but the only
devnode used (before the DVB patches) is MEDIA_ENT_T_DEVNODE_V4L.
Due to the way MC got implemented, however, this entity actually
doesn't represent the devnode, but the hardware I/O engine that
receives data via DMA.
By coincidence, such DMA is associated with the V4L device node
on webcam hardware, but this is not true even for other V4L2
devices. For example, on USB hardware, the DMA is done via the
USB controller. The data passes though a in-kernel filter that
strips off the URB headers. Other V4L2 devices like radio may not
even have DMA. When it have, the DMA is done via ALSA, and not
via the V4L devnode.
In other words, MC is broken as a whole, but tagging it as BROKEN
right now would do more harm than good.
So, instead, let's mark, for now, the DVB part as broken and
block all new changes to MC while we fix this mess, whith
we hopefully will do for the next Kernel version.
Hugh Dickins [Sun, 14 Jun 2015 16:48:09 +0000 (09:48 -0700)]
mm: shmem_zero_setup skip security check and lockdep conflict with XFS
It appears that, at some point last year, XFS made directory handling
changes which bring it into lockdep conflict with shmem_zero_setup():
it is surprising that mmap() can clone an inode while holding mmap_sem,
but that has been so for many years.
Since those few lockdep traces that I've seen all implicated selinux,
I'm hoping that we can use the __shmem_file_setup(,,,S_PRIVATE) which
v3.13's commit c7277090927a ("security: shmem: implement kernel private
shmem inodes") introduced to avoid LSM checks on kernel-internal inodes:
the mmap("/dev/zero") cloned inode is indeed a kernel-internal detail.
This also covers the !CONFIG_SHMEM use of ramfs to support /dev/zero
(and MAP_SHARED|MAP_ANONYMOUS). I thought there were also drivers
which cloned inode in mmap(), but if so, I cannot locate them now.
Wolfram Sang [Mon, 15 Jun 2015 17:51:46 +0000 (19:51 +0200)]
i2c: slave: fix the example how to instantiate from userspace
I copied the wrong shell code into the documentation. Sorry to all who
tried to get sense out of this current example :/ Slight rewording while
we are here.
Reported-by: Tim Bakker <bakkert@mymail.vcu.edu> Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Signed-off-by: Wolfram Sang <wsa@the-dreams.de> Cc: stable@kernel.org
Worse yet, reading the error message (the filter again) it says that
there was no error, when there clearly was. The issue is that the
code that checks the input does not check for balanced ops. That is,
having an op between a closed parenthesis and the next token.
This would only cause a warning, and fail out before doing any real
harm, but it should still not caues a warning, and the error reported
should work:
Takashi Iwai [Tue, 16 Jun 2015 10:23:36 +0000 (12:23 +0200)]
ALSA: hda - Fix unused label skip_i915
When CONFIG_SND_HDA_I915=n, we get a compile warning:
sound/pci/hda/hda_intel.c: In function ‘azx_probe_continue’:
sound/pci/hda/hda_intel.c:1882:2: warning: label ‘skip_i915’ defined but not used [-Wunused-label]
Fix it by putting again ifdef to it. Sigh.
Fixes: bf06848bdbe5 ('ALSA: hda - Continue probing even if i915 binding fails') Reported-by: Borislav Petkov <bp@suse.de> Signed-off-by: Takashi Iwai <tiwai@suse.de>
Steve Cornelius [Mon, 15 Jun 2015 23:52:59 +0000 (16:52 -0700)]
crypto: caam - fix RNG buffer cache alignment
The hwrng output buffers (2) are cast inside of a a struct (caam_rng_ctx)
allocated in one DMA-tagged region. While the kernel's heap allocator
should place the overall struct on a cacheline aligned boundary, the 2
buffers contained within may not necessarily align. Consenquently, the ends
of unaligned buffers may not fully flush, and if so, stale data will be left
behind, resulting in small repeating patterns.
This fix aligns the buffers inside the struct.
Note that not all of the data inside caam_rng_ctx necessarily needs to be
DMA-tagged, only the buffers themselves require this. However, a fix would
incur the expense of error-handling bloat in the case of allocation failure.
Cc: stable@vger.kernel.org Signed-off-by: Steve Cornelius <steve.cornelius@freescale.com> Signed-off-by: Victoria Milhoan <vicki.milhoan@freescale.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Steve Cornelius [Mon, 15 Jun 2015 23:52:56 +0000 (16:52 -0700)]
crypto: caam - improve initalization for context state saves
Multiple function in asynchronous hashing use a saved-state block,
a.k.a. struct caam_hash_state, which holds a stash of information
between requests (init/update/final). Certain values in this state
block are loaded for processing using an inline-if, and when this
is done, the potential for uninitialized data can pose conflicts.
Therefore, this patch improves initialization of state data to
prevent false assignments using uninitialized data in the state block.
This patch addresses the following traceback, originating in
ahash_final_ctx(), although a problem like this could certainly
exhibit other symptoms:
Cc: stable@vger.kernel.org Signed-off-by: Steve Cornelius <steve.cornelius@freescale.com> Signed-off-by: Victoria Milhoan <vicki.milhoan@freescale.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Radim Krčmář [Fri, 5 Jun 2015 18:57:41 +0000 (20:57 +0200)]
KVM: x86: fix lapic.timer_mode on restore
lapic.timer_mode was not properly initialized after migration, which
broke few useful things, like login, by making every sleep eternal.
Fix this by calling apic_update_lvtt in kvm_apic_post_state_restore.
There are other slowpaths that update lvtt, so this patch makes sure
something similar doesn't happen again by calling apic_update_lvtt
after every modification.
Cc: stable@vger.kernel.org Fixes: f30ebc312ca9 ("KVM: x86: optimize some accesses to LVTT and SPIV") Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
The new Dell XPS13 also requires the similar quirk for fixing the
noisy outputs. (But, as the codec was changed, now the fixup for
Latitude is used instead.)
Fixes: 0aedb1626566 ("drm/i915: Don't skip request retirement if the active list is empty") Cc: stable@vger.kernel.org Acked-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Takashi Iwai [Mon, 15 Jun 2015 09:59:32 +0000 (11:59 +0200)]
ALSA: hda - Fix audio crackles on Dell Latitude E7x40
We still got a report that the audio crackles and noises occur with
the recent 4.1 kernels on Dell machines. These machines seem to need
similar workarounds that have been applied to the recent Dell XPS 13
models. Since the codec of these machines (Dell Latitute E7240 and
E7440) is different from XPS 13's one, we need a new fixup entry.
Also, it was confirmed that the previous workaround to disable the
widget power-save (commit [219f47e4f964: ALSA: hda - Disable widget
power-saving for ALC292 & co]) is no longer needed after this fix.
So, this patch includes the partial revert of the commit, too.
Reported-and-tested-by: Mihai Donțu <mihai.dontu@gmail.com> Tested-by: Jonathan McDowell <noodles@earth.li> Signed-off-by: Takashi Iwai <tiwai@suse.de>
Hui Wang [Mon, 15 Jun 2015 09:43:39 +0000 (17:43 +0800)]
ALSA: hda - adding a DAC/pin preference map for a HP Envy TS machine
On a HP Envy TouchSmart laptop, there are 2 speakers (main speaker
and subwoofer speaker), 1 headphone and 2 DACs, without this fixup,
the headphone will be assigned to a DAC and the 2 speakers will be
assigned to another DAC, this assignment makes the surround-2.1
channels invalid.
To fix it, here using a DAC/pin preference map to bind the main
speaker to 1 DAC and the subwoofer speaker will be assigned to another
DAC.
Cc: <stable@vger.kernel.org> Signed-off-by: Hui Wang <hui.wang@canonical.com> Signed-off-by: Takashi Iwai <tiwai@suse.de>
Chris Wilson [Thu, 11 Jun 2015 07:06:08 +0000 (08:06 +0100)]
drm/i915: Always reset vma->ggtt_view.pages cache on unbinding
With the introduction of multiple views of an obj in the same vm, each
vma was taught to cache its copy of the pages (so that different views
could have different page arrangements). However, this missed decoupling
those vma->ggtt_view.pages when the vma released its reference on the
obj->pages. As we don't always free the vma, this leads to a possible
scenario (e.g. execbuffer interrupted by the shrinker) where the vma
points to a stale obj->pages, and explodes.
drm/i915: Infrastructure for supporting different GGTT views per object
Tvrtko says, if someone else will be confused how this can happen, key
is the reservation execbuffer path. That puts the VMA on the exec_list
which prevents i915_vma_unbind and i915_gem_vma_destroy from fully
destroying the VMA. So the VMA is left existing as an empty object in
the list - unbound and disassociated with the backing store. Kind of a
cached memory object. And then re-using it needs to clear the cached
pages pointer which is fixed above.
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1227892 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Michel Thierry <michel.thierry@intel.com> Cc: stable@vger.kernel.org Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
[Jani: Added Tvrtko's explanation to commit message.] Signed-off-by: Jani Nikula <jani.nikula@intel.com>
Linus Torvalds [Mon, 15 Jun 2015 01:38:57 +0000 (15:38 -1000)]
Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus
Pull more MIPS fixes from Ralf Baechle:
"Another round of 4.1 MIPS fixes, one fix to a MIPS-specific #if
condition in lib/mpi, one fix to the MIPS GIC irqchip driver and one
SSB fix.
Details:
- fix handling of clock in chipco SSB driver.
- fix two MIPS-specific #if conditions to correctly work for GCC 5.1.
- fix damage to R6 pgtable bits done by XPA support.
- fix possible crash due to unloading modules that contain statically
defined platform devices.
- fix disabling of the MSA ASE on context switch to also work
correctly when a new thread/process has the CPU for the very first
time.
This is part of linux-next and has been beaten to death on
Imagination's test farm.
While things are not looking too grim this pull request also means the
rate of fixes for 4.1 remains nearly constant so I'd not be unhappy if
you'd delay the release"
* 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus:
MPI: MIPS: Fix compilation error with GCC 5.1
IRQCHIP: mips-gic: Don't nest calls to do_IRQ()
MIPS: MSA: bugfix - disable MSA correctly for new threads/processes.
MIPS: Loongson: Do not register 8250 platform device from module.
MIPS: Cobalt: Do not build MTD platform device registration code as module.
SSB: Fix handling of ssb_pmu_get_alp_clock()
MIPS: pgtable-bits: Fix XPA damage to R6 definitions.
Linus Torvalds [Mon, 15 Jun 2015 00:00:13 +0000 (14:00 -1000)]
Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
"A regression fix for a crash, and a Intel HSW uncore PMU driver fix"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Revert "perf/x86/intel/uncore: Move uncore_box_init() out of driver initialization"
perf/x86/intel/uncore: Fix CBOX bit wide and UBOX reg on Haswell-EP
Linus Torvalds [Sun, 14 Jun 2015 23:55:24 +0000 (13:55 -1000)]
Merge tag 'sound-4.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"Most of commits are regression fixes for HD-audio: a few corner case
fixes for regmap transition, and i915 binding issues.
In addition, a quirk for another USB-audio device supporting DSD"
* tag 'sound-4.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: hda - Abort the probe without i915 binding for HSW/BDW
ALSA: hda - Re-add the lost fake mute support
ALSA: hda - Continue probing even if i915 binding fails
ALSA: hda - Don't actually write registers for caps overwrites
ALSA: hda - fix number of devices query on hotplug
ALSA: usb-audio: add native DSD support for JLsounds I2SoverUSB
Rabin Vincent [Fri, 12 Jun 2015 08:01:56 +0000 (10:01 +0200)]
IRQCHIP: mips-gic: Don't nest calls to do_IRQ()
The GIC chained handlers use do_IRQ() to call the subhandlers. This
means that irq_enter() calls get nested, which leads to preempt count
looking like we're in nested interrupts, which in turn leads to all
system time being accounted as IRQ time in account_system_time().
Fix it by using generic_handle_irq(). Since these same functions are
used in some systems (if cpu_has_veic) from a low-level vectored
interrupt handler which does not go throught do_IRQ(), we need to do it
conditionally.
Signed-off-by: Rabin Vincent <rabin.vincent@axis.com> Reviewed-by: Andrew Bresticker <abrestic@chromium.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mips@linux-mips.org Cc: tglx@linutronix.de Cc: jason@lakedaemon.net
Patchwork: https://patchwork.linux-mips.org/patch/10545/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
1) Fix uninitialized struct station_info in cfg80211_wireless_stats(),
from Johannes Berg.
2) Revert commit attempt to fix ipv6 protocol resubmission, it adds
regressions.
3) Endless loops can be created in bridge port lists, fix from Nikolay
Aleksandrov.
4) Don't WARN_ON() if sk->sk_forward_alloc is non-zero in
sk_clear_memalloc, it is a legal situation during swap deactivation.
Fix from Mel Gorman.
5) Fix order of disabling interrupts and unlocking NAPI in enic driver
to avoid a race. From Govindarajulu Varadarajan.
6) High and low register writes are swapped when programming the start
of periodic output in igb driver. From Richard Cochran.
7) Fix device rename handling in mpls stack, from Robert Shearman.
8) Do not trigger compaction synchronously when optimistically trying
to allocate an order 3 page in alloc_skb_with_frags() and
skb_page_frag_refill(). From Shaohua Li.
9) Authentication with COOKIE_ECHO is not handled properly in SCTP, fix
from Marcelo Ricardo Leitner.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net:
Doc: networking: Fix URL for wiki.wireshark.org in udplite.txt
sctp: allow authenticating DATA chunks that are bundled with COOKIE_ECHO
net: don't wait for order-3 page allocation
mpls: handle device renames for per-device sysctls
net: igb: fix the start time for periodic output signals
enic: fix memory leak in rq_clean
enic: check return value for stat dump
enic: unlock napi busy poll before unmasking intr
net, swap: Remove a warning and clarify why sk_mem_reclaim is required when deactivating swap
bridge: fix multicast router rlist endless loop
tipc: disconnect socket directly after probe failure
Revert "ipv6: Fix protocol resubmission"
cfg80211: wext: clear sinfo struct before calling driver
sctp: allow authenticating DATA chunks that are bundled with COOKIE_ECHO
Currently, we can ask to authenticate DATA chunks and we can send DATA
chunks on the same packet as COOKIE_ECHO, but if you try to combine
both, the DATA chunk will be sent unauthenticated and peer won't accept
it, leading to a communication failure.
This happens because even though the data was queued after it was
requested to authenticate DATA chunks, it was also queued before we
could know that remote peer can handle authenticating, so
sctp_auth_send_cid() returns false.
The fix is whenever we set up an active key, re-check send queue for
chunks that now should be authenticated. As a result, such packet will
now contain COOKIE_ECHO + AUTH + DATA chunks, in that order.
Reported-by: Liu Wei <weliu@redhat.com> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>