Chuck Anderson [Sun, 27 Nov 2016 22:04:53 +0000 (14:04 -0800)]
Merge branch topic/uek-4.1/upstream-cherry-picks of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* topic/uek-4.1/upstream-cherry-picks:
mm/hugetlb: hugetlb_no_page: rate-limit warning message
net/vxlan: Fix kernel unaligned access in __vxlan_find_mac
kexec: align crash_notes allocation to make it be inside one physical page
iommu-common: Fix error code used in iommu_tbl_range_{alloc,free}().
The warning message "killed due to inadequate hugepage pool" simply
indicates that SIGBUS was sent, not that the process was forcibly killed.
If the process has a signal handler installed does not fix the problem,
this message can rapidly spam the kernel log.
On my amd64 dev machine that does not have hugepages configured, I can
reproduce the repeated warnings easily by setting vm.nr_hugepages=2 (i.e.,
4 megabytes of huge pages) and running something that sets a signal
handler and forks, like
Backport of upstream commit 7177a3b037c7 ("net/vxlan: Fix kernel
unaligned access in __vxlan_find_mac")
__vxlan_find_mac invokes ether_addr_equal on the eth_addr field,
which triggers unaligned access messages, so rearrange vxlan_fdb
to avoid this in the most non-intrusive way.
Reviewed-by: Jiri Pirko <jiri@resnulli.us> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit aa51c7b55e350240c2ed5a8a217688d5bfd13424) Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
(cherry picked from commit 45cbd27626d17d485ca0718b28add12513690921) Signed-off-by: Allen Pais <allen.pais@oracle.com>
David S. Miller [Wed, 4 Nov 2015 19:30:57 +0000 (11:30 -0800)]
iommu-common: Fix error code used in iommu_tbl_range_{alloc,free}().
The value returned from iommu_tbl_range_alloc() (and the one passed
in as a fourth argument to iommu_tbl_range_free) is not a DMA address,
it is rather an index into the IOMMU page table.
Therefore using DMA_ERROR_CODE is not appropriate.
Use a more type matching error code define, IOMMU_ERROR_CODE, and
update all users of this interface.
Reported-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Allen Pais <allen.pais@oracle.com>
Chuck Anderson [Sun, 27 Nov 2016 00:51:51 +0000 (16:51 -0800)]
Merge branch topic/uek-4.1/dtrace of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* topic/uek-4.1/dtrace:
dtrace: eliminate need for arg counting in sdt macros
dtrace: augment SDT probes with type information
dtrace: import the sdt type information into per-sdt_probedesc state
dtrace: record SDT and perf probe types in a new ELF section
Chuck Anderson [Sun, 27 Nov 2016 00:50:36 +0000 (16:50 -0800)]
Merge branch 'topic/uek-4.1/ofed' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* topic/uek-4.1/ofed:
mlx4: avoid multiple free on id_map_ent
xsigo: supported SGE's for LSO QP
xsigo: Hardening driver in handling remote QP failures
IB/cm: avoid query device in CM REQ/REP
IB/cm: return original rnr value when RNR WA for PSIF
IB/cm: MBIT needs to be used in network order
IB/core: Issue DREQ when receiving REQ/REP for stale QP
sif: cq: cleanup cqe once a kernel qp is destroyed/reset
sif: cq: sif_poll_cq might not drain cq completely
sif: rq: do not flush rq if it is an srq
sif: cq: use refcnt to disable/enable cq polling
sif: pt: Add support for single thread modified page tables
sif: pqp: Implement handling of PQPs in error.
sif: eps*: initialize each struct in array
sif: query_device: Return correct #SGEs for EoIB
sif: LSO not supported for EoIB queuepairs
sif: pqp: Make setup/teardown function ref sif_pqp_info directly
sif: Move the rest of the pqp setup and teardown to sif_pqp
sif: Move sif_dfs_register beyond base init
sif: Refactor PQP state out of sif_dev.
Chuck Anderson [Sun, 27 Nov 2016 00:49:48 +0000 (16:49 -0800)]
Merge branch 'topic/uek-4.1/drivers' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* topic/uek-4.1/drivers:
NVMe: reduce queue depth as workaround for Samsung EPIC SQ errata
bonding: "primary_reselect" with "failure" is not working properly
IB/core: uverbs: Do not wait for uverbs dev closure during forced system shutdown.
Chuck Anderson [Sun, 27 Nov 2016 00:48:32 +0000 (16:48 -0800)]
Merge branch topic/uek-4.1/upstream-cherry-picks of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* topic/uek-4.1/upstream-cherry-picks:
Bluetooth: Fix potential NULL dereference in RFCOMM bind callback
aacraid: Check size values after double-fetch from user
mm: migrate dirty page without clear_page_dirty_for_io etc
xen-netfront: cast grant table reference first to type int
xen-netfront: do not cast grant table reference to signed short
Chuck Anderson [Sun, 27 Nov 2016 00:47:09 +0000 (16:47 -0800)]
Merge branch topic/uek-4.1/stable-cherry-picks of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* topic/uek-4.1/stable-cherry-picks: (21 commits)
ocfs2: fix not enough credit panic
ocfs2: fix deadlock on mmapped page in ocfs2_write_begin_nolock()
ocfs2/dlm: fix race between convert and migration
ocfs2: solve a problem of crossing the boundary in updating backups
ocfs2: use spinlock_irqsave() to downconvert lock in ocfs2_osb_dump()
ocfs2: access orphan dinode before delete entry in ocfs2_orphan_del
ocfs2/dlm: do not insert a new mle when another process is already migrating
ocfs2: fix slot overwritten if storage link down during mount
ocfs2/dlm: return appropriate value when dlm_grab() returns NULL
ocfs2/dlm: wait until DLM_LOCK_RES_SETREF_INPROG is cleared in dlm_deref_lockres_worker
ocfs2/dlm: fix a race between purge and migration
ocfs2/dlm: clear migration_pending when migration target goes down
ocfs2: fix BUG when calculate new backup super
ocfs2: should reclaim the inode if '__ocfs2_mknod_locked' returns an error
ocfs2: fix race between mount and delete node/cluster
ocfs2/dlm: unlock lockres spinlock before dlm_lockres_put
ocfs2: avoid access invalid address when read o2dlm debug messages
ocfs2: fix a tiny case that inode can not removed
ocfs2: trusted xattr missing CAP_SYS_ADMIN check
ocfs2: set filesytem read-only when ocfs2_delete_entry failed.
...
Double free is found on id_map_ent.
Existing code makes another queue_delayed_work regardless if the worker was
successfully canceled or not. In case it's not, means the worker routine already
started to run, and then the later queued work will do the same work against
the same id_map_ent stucture. The worker routine actually does cleanup/free
on the structure, so the 2nd run of it is dengerous.
Fix is that we check if we successfully canceled previously queued work, if
that's not canceled, we don't queue the work again on the same structure.
Junxiao Bi [Tue, 1 Nov 2016 06:42:20 +0000 (14:42 +0800)]
ocfs2: fix not enough credit panic
The following panic was caught when run ocfs2 disconfig single test
(block size 512 and cluster size 8192). ocfs2_journal_dirty() return
-ENOSPC, that means credits were used up. The total credit should
include 3 times of "num_dx_leaves" from ocfs2_dx_dir_rebalance(),
because 2 times will be consumed in ocfs2_dx_dir_transfer_leaf() and
1 time will be consumed in ocfs2_dx_dir_new_cluster()->
__ocfs2_dx_dir_new_cluster()->ocfs2_dx_dir_format_cluster(). But only
two times is included in ocfs2_dx_dir_rebalance_credits(), fix it.
Eric Ren [Fri, 30 Sep 2016 22:11:32 +0000 (15:11 -0700)]
ocfs2: fix deadlock on mmapped page in ocfs2_write_begin_nolock()
The testcase "mmaptruncate" of ocfs2-test deadlocks occasionally.
In this testcase, we create a 2*CLUSTER_SIZE file and mmap() on it;
there are 2 process repeatedly performing the following operations
respectively: one is doing memset(mmaped_addr + 2*CLUSTER_SIZE - 1, 'a',
1), while the another is playing ftruncate(fd, 2*CLUSTER_SIZE) and then
ftruncate(fd, CLUSTER_SIZE) again and again.
In ocfs2_write_begin_nolock(), we first grab the pages and then allocate
disk space for this write; ocfs2_try_to_free_truncate_log() will be
called if -ENOSPC is returned; if we're lucky to get enough clusters,
which is usually the case, we start over again.
But in ocfs2_free_write_ctxt() the target page isn't unlocked, so we
will deadlock when trying to grab the target page again.
Also, -ENOMEM might be returned in ocfs2_grab_pages_for_write().
Another deadlock will happen in __do_page_mkwrite() if
ocfs2_page_mkwrite() returns non-VM_FAULT_LOCKED, and along with a
locked target page.
These two errors fail on the same path, so fix them by unlocking the
target page manually before ocfs2_free_write_ctxt().
Jan Kara helps me clear out the JBD2 part, and suggest the hint for root
cause.
Changes since v1:
1. Also put ENOMEM error case into consideration.
Link: http://lkml.kernel.org/r/1474173902-32075-1-git-send-email-zren@suse.com Signed-off-by: Eric Ren <zren@suse.com> Reviewed-by: He Gang <ghe@suse.com> Acked-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit c33f0785bf292cf1d15f4fbe42869c63e205b21c)
Joseph Qi [Mon, 19 Sep 2016 21:43:55 +0000 (14:43 -0700)]
ocfs2/dlm: fix race between convert and migration
Commit ac7cf246dfdb ("ocfs2/dlm: fix race between convert and recovery")
checks if lockres master has changed to identify whether new master has
finished recovery or not. This will introduce a race that right after
old master does umount ( means master will change), a new convert
request comes.
In this case, it will reset lockres state to DLM_RECOVERING and then
retry convert, and then fail with lockres->l_action being set to
OCFS2_AST_INVALID, which will cause inconsistent lock level between
ocfs2 and dlm, and then finally BUG.
Since dlm recovery will clear lock->convert_pending in
dlm_move_lockres_to_recovery_list, we can use it to correctly identify
the race case between convert and recovery. So fix it.
Fixes: ac7cf246dfdb ("ocfs2/dlm: fix race between convert and recovery") Link: http://lkml.kernel.org/r/57CE1569.8010704@huawei.com Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Jun Piao <piaojun@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit e6f0c6e6170fec175fe676495f29029aecdf486c)
jiangyiwen [Fri, 25 Mar 2016 21:21:35 +0000 (14:21 -0700)]
ocfs2: solve a problem of crossing the boundary in updating backups
In update_backups() there exists a problem of crossing the boundary as
follows:
we assume that lun will be resized to 1TB(cluster_size is 32kb), it will
include 0~33554431 cluster, in update_backups func, it will backup super
block in location of 1TB which is the 33554432th cluster, so the
phenomenon of crossing the boundary happens.
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Xue jiufei <xuejiufei@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 584dca3440732afa84fbca07567bb66e1453936a)
jiangyiwen [Tue, 15 Mar 2016 21:53:01 +0000 (14:53 -0700)]
ocfs2: use spinlock_irqsave() to downconvert lock in ocfs2_osb_dump()
Commit a75e9ccabd92 ("ocfs2: use spinlock irqsave for downconvert lock")
missed an unmodified place in ocfs2_osb_dump(), so it still exists a
deadlock scenario.
Joseph Qi [Thu, 14 Jan 2016 23:17:44 +0000 (15:17 -0800)]
ocfs2: access orphan dinode before delete entry in ocfs2_orphan_del
In ocfs2_orphan_del, currently it finds and deletes entry first, and
then access orphan dir dinode. This will have a problem once
ocfs2_journal_access_di fails. In this case, entry will be removed from
orphan dir, but in deed the inode hasn't been deleted successfully. In
other words, the file is missing but not actually deleted. So we should
access orphan dinode first like unlink and rename.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Reviewed-by: Jiufei Xue <xuejiufei@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 074a6c655f6da12cb1123c8a84bfd8d781138800)
xuejiufei [Thu, 14 Jan 2016 23:17:41 +0000 (15:17 -0800)]
ocfs2/dlm: do not insert a new mle when another process is already migrating
When two processes are migrating the same lockres,
dlm_add_migration_mle() return -EEXIST, but insert a new mle in hash
list. dlm_migrate_lockres() will detach the old mle and free the new
one which is already in hash list, that will destroy the list.
Signed-off-by: Jiufei Xue <xuejiufei@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 32e493265b2be96404aaa478fb2913be29b06887)
jiangyiwen [Thu, 14 Jan 2016 23:17:33 +0000 (15:17 -0800)]
ocfs2: fix slot overwritten if storage link down during mount
The following case will lead to slot overwritten.
N1 N2
mount ocfs2 volume, find and
allocate slot 0, then set
osb->slot_num to 0, begin to
write slot info to disk
mount ocfs2 volume, wait for super lock
write block fail because of
storage link down, unlock
super lock
got super lock and also allocate slot 0
then unlock super lock
mount fail and then dismount,
since osb->slot_num is 0, try to
put invalid slot to disk. And it
will succeed if storage link
restores.
N2 slot info is now overwritten
Once another node say N3 mount, it will find and allocate slot 0 again,
which will lead to mount hung because journal has already been locked by
N2. so when write slot info failed, invalidate slot in advance to avoid
overwrite slot.
[akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 1247017f43a93eae3d64b7c25f3637dc545f5a47)
Xue jiufei [Thu, 14 Jan 2016 23:17:29 +0000 (15:17 -0800)]
ocfs2/dlm: return appropriate value when dlm_grab() returns NULL
dlm_grab() may return NULL when the node is doing unmount. When doing
code review, we found that some dlm handlers may return error to caller
when dlm_grab() returns NULL and make caller BUG or other problems.
Here is an example:
Node 1 Node 2
receives migration message
from node 3, and send
migrate request to others
start unmounting
receives migrate request
from node 1 and call
dlm_migrate_request_handler()
unmount thread unregisters
domain handlers and removes
dlm_context from dlm_domains
dlm_migrate_request_handlers()
returns -EINVAL to node 1
Exit migration neither clearing the
migration state nor sending
assert master message to node 3 which
cause node 3 hung.
Signed-off-by: Jiufei Xue <xuejiufei@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit c372f2193a2e73d5936bf37259ae63ca388b4cbc)
jiangyiwen [Thu, 14 Jan 2016 23:17:23 +0000 (15:17 -0800)]
ocfs2/dlm: wait until DLM_LOCK_RES_SETREF_INPROG is cleared in dlm_deref_lockres_worker
Commit f3f854648de6 ("ocfs2_dlm: Ensure correct ordering of set/clear
refmap bit on lockres") still exists a race which can't ensure the
ordering is exactly correct.
Node1 Node2 Node3
umount, migrate
lockres to Node2
migrate finished,
send migrate request
to Node3
received migrate request,
create a migration_mle,
respond to Node2.
set DLM_LOCK_RES_SETREF_INPROG
and send assert master to
Node3
delete migration_mle in
assert_master_handler,
Node3 umount without response
dlm_thread purge
this lockres, send drop
deref message to Node2
found the flag of
DLM_LOCK_RES_SETREF_INPROG
is set, dispatch
dlm_deref_lockres_worker to
clear refmap, but in function of
dlm_deref_lockres_worker,
only if node in refmap it wait
DLM_LOCK_RES_SETREF_INPROG
to be cleared. So worker is
done successfully
purge lockres, send
assert master response
to Node1, and finish umount
set Node3 in refmap, and it
won't be cleared forever, thus
lead to umount hung
so wait until DLM_LOCK_RES_SETREF_INPROG is cleared in
dlm_deref_lockres_worker.
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit b5560143385e18b4109ad6951c7719705e3dd995)
Xue jiufei [Thu, 14 Jan 2016 23:17:18 +0000 (15:17 -0800)]
ocfs2/dlm: fix a race between purge and migration
We found a race between purge and migration when doing code review.
Node A put lockres to purgelist before receiving the migrate message
from node B which is the master. Node A call dlm_mig_lockres_handler to
handle this message.
dlm_mig_lockres_handler
dlm_lookup_lockres
>>>>>> race window, dlm_run_purge_list may run and send
deref message to master, waiting the response
spin_lock(&res->spinlock);
res->state |= DLM_LOCK_RES_MIGRATING;
spin_unlock(&res->spinlock);
dlm_mig_lockres_handler returns
>>>>>> dlm_thread receives the response from master for the deref
message and triggers the BUG because the lockres has the state
DLM_LOCK_RES_MIGRATING with the following message:
xuejiufei [Tue, 29 Dec 2015 22:54:29 +0000 (14:54 -0800)]
ocfs2/dlm: clear migration_pending when migration target goes down
We have found a BUG on res->migration_pending when migrating lock
resources. The situation is as follows.
dlm_mark_lockres_migration
res->migration_pending = 1;
__dlm_lockres_reserve_ast
dlm_lockres_release_ast returns with res->migration_pending remains
because other threads reserve asts
wait dlm_migration_can_proceed returns 1
>>>>>>> o2hb found that target goes down and remove target
from domain_map
dlm_migration_can_proceed returns 1
dlm_mark_lockres_migrating returns -ESHOTDOWN with
res->migration_pending still remains.
When reentering dlm_mark_lockres_migrating(), it will trigger the BUG_ON
with res->migration_pending. So clear migration_pending when target is
down.
Signed-off-by: Jiufei Xue <xuejiufei@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit cc28d6d80f6ab494b10f0e2ec949eacd610f66e3)
Joseph Qi [Tue, 29 Dec 2015 22:54:06 +0000 (14:54 -0800)]
ocfs2: fix BUG when calculate new backup super
When resizing, it firstly extends the last gd. Once it should backup
super in the gd, it calculates new backup super and update the
corresponding value.
But it currently doesn't consider the situation that the backup super is
already done. And in this case, it still sets the bit in gd bitmap and
then decrease from bg_free_bits_count, which leads to a corrupted gd and
trigger the BUG in ocfs2_block_group_set_bits:
alex chen [Fri, 6 Nov 2015 02:44:10 +0000 (18:44 -0800)]
ocfs2: should reclaim the inode if '__ocfs2_mknod_locked' returns an error
In ocfs2_mknod_locked if '__ocfs2_mknod_locke d' returns an error, we
should reclaim the inode successfully claimed above, otherwise, the
inode never be reused. The case is described below:
ocfs2_mknod
ocfs2_mknod_locked
ocfs2_claim_new_inode
Successfully claim the inode
__ocfs2_mknod_locked
ocfs2_journal_access_di
Failed because of -ENOMEM or other reasons, the inode
lockres has not been initialized yet.
iput(inode)
ocfs2_evict_inode
ocfs2_delete_inode
ocfs2_inode_lock
ocfs2_inode_lock_full_nested
__ocfs2_cluster_lock
Return -EINVAL because of the inode
lockres has not been initialized.
So the following operations are not performed
ocfs2_wipe_inode
ocfs2_remove_inode
ocfs2_free_dinode
ocfs2_free_suballoc_bits
Signed-off-by: Alex Chen <alex.chen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit b1529a41f777a48f95d4af29668b70ffe3360e1b)
Joseph Qi [Fri, 6 Nov 2015 02:44:07 +0000 (18:44 -0800)]
ocfs2: fix race between mount and delete node/cluster
There is a race case between mount and delete node/cluster, which will
lead o2hb_thread to malfunctioning dead loop.
o2hb_thread
{
o2nm_depend_this_node();
<<<<<< race window, node may have already been deleted, and then
enter the loop, o2hb thread will be malfunctioning
because of no configured nodes found.
while (!kthread_should_stop() &&
!reg->hr_unclean_stop && !reg->hr_aborted_start) {
}
So check the return value of o2nm_depend_this_node() is needed. If node
has been deleted, do not enter the loop and let mount fail.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 0986fe9b50f425ec81f25a1a85aaf3574b31d801)
Joseph Qi [Thu, 22 Oct 2015 20:32:29 +0000 (13:32 -0700)]
ocfs2/dlm: unlock lockres spinlock before dlm_lockres_put
dlm_lockres_put will call dlm_lockres_release if it is the last
reference, and then it may call dlm_print_one_lock_resource and
take lockres spinlock.
So unlock lockres spinlock before dlm_lockres_put to avoid deadlock.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit b67de018b37a97548645a879c627d4188518e907)
ocfs2: avoid access invalid address when read o2dlm debug messages
The following case will lead to a lockres is freed but is still in use.
cat /sys/kernel/debug/o2dlm/locking_state dlm_thread
lockres_seq_start
-> lock dlm->track_lock
-> get resA
resA->refs decrease to 0,
call dlm_lockres_release,
and wait for "cat" unlock.
Although resA->refs is already set to 0,
increase resA->refs, and then unlock
lock dlm->track_lock
-> list_del_init()
-> unlock
-> free resA
In such a race case, invalid address access may occurs. So we should
delete list res->tracking before resA->refs decrease to 0.
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit f57a22ddecd6f26040a67e2c12880f98f88b6e00)
When running dirop_fileop_racer we found a case that inode
can not removed.
Two nodes, say Node A and Node B, mount the same ocfs2 volume. Create
two dirs /race/1/ and /race/2/ in the filesystem.
Node A Node B
rm -r /race/2/
mv /race/1/ /race/2/
call ocfs2_unlink(), get
the EX mode of /race/2/
wait for B unlock /race/2/
decrease i_nlink of /race/2/ to 0,
and add inode of /race/2/ into
orphan dir, unlock /race/2/
got EX mode of /race/2/. because
/race/1/ is dir, so inc i_nlink
of /race/2/ and update into disk,
unlock /race/2/
because i_nlink of /race/2/
is not zero, this inode will
always remain in orphan dir
This patch fixes this case by test whether i_nlink of new dir is zero.
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Xue jiufei <xuejiufei@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 928dda1f9433f024ac48c3d97ae683bf83dd0e42)
The trusted extended attributes are only visible to the process which
hvae CAP_SYS_ADMIN capability but the check is missing in ocfs2
xattr_handler trusted list. The check is important because this will be
used for implementing mechanisms in the userspace for which other
ordinary processes should not have access to.
Signed-off-by: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Taesoo kim <taesoo@gatech.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 0f5e7b41f91814447defc34e915fc5d6e52266d9)
Xue jiufei [Wed, 24 Jun 2015 23:55:20 +0000 (16:55 -0700)]
ocfs2: fix NULL pointer dereference in function ocfs2_abort_trigger()
ocfs2_abort_trigger() use bh->b_assoc_map to get sb. But there's no
function to set bh->b_assoc_map in ocfs2, it will trigger NULL pointer
dereference while calling this function. We can get sb from
bh->b_bdev->bd_super instead of b_assoc_map.
[akpm@linux-foundation.org: update comment, per Joseph] Signed-off-by: joyce.xue <xuejiufei@huawei.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 74e364ad1b13fd518a0bd4e5aec56d5e8706152f)
Handle scenario's where PSIF generates batched
transmit completions.
In case of Remote QP disconnect follow this sequence
-> If there are any pending transmit completion
explicitly transition QP state to ERROR.
-> Wait for a maximum of 10 seconds for all pending
completions
(10 seconds derived from retrycount * local ack timeout)
Destroy QP
Wait for another 10 seconds(max) if completions are
not returned by hardware.
Synchronized calls to poll_tx
Added more efficiency in handling of TX queue full condtion.
Handle scenario where uVNIC removal can come in batches
In aacraid's ioctl_send_fib() we do two fetches from userspace, one the
get the fib header's size and one for the fib itself. Later we use the
size field from the second fetch to further process the fib. If for some
reason the size from the second fetch is different than from the first
fix, we may encounter an out-of- bounds access in aac_fib_send(). We
also check the sender size to insure it is not out of bounds. This was
reported in https://bugzilla.kernel.org/show_bug.cgi?id=116751 and was
assigned CVE-2016-6480.
Reported-by: Pengfei Wang <wpengfeinudt@gmail.com> Fixes: 7c00ffa31 '[SCSI] 2.6 aacraid: Variable FIB size (updated patch)' Cc: stable@vger.kernel.org Signed-off-by: Dave Carroll <david.carroll@microsemi.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit fa00c437eef8dc2e7b25f8cd868cfa405fcc2bb3) Signed-off-by: Dan Duval <dan.duval@oracle.com>
Oracle discovered that the NVMe driver gets SQ completion errors eventually
leading to the device being reset, taken out of the PCI bus tree or kernel
panics when using the default SQ size of 1024 entries (64KB) for Samsung
EPIC NVMe SSDs.
PCIe analyzer tracing by Oracle and Samsung revealed an errata in Samsung's
firmware for EPIC SSDs where these invalid completion entries can occur
when the queues straddle an 8MB DMA address boundary.
This patch works around the errata by detecting these specific devices and
limiting their descriptor queue depth to 64. This is only for the Samsung
NVMe controllers used in Oracle X-series servers.
There was no noticeable performance impact of reducing queue depths to 64
for these Samsung drives, Oracle X6-2 server, and Oracle VM Server 3.4.2.
Signed-off-by: Kyle Fortin <kyle.fortin@oracle.com> Signed-off-by: Bhavesh Davda <bhavesh.davda@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Waiman Long [Tue, 6 Sep 2016 17:22:10 +0000 (13:22 -0400)]
x86/hpet: Reduce HPET counter read contention
On a large system with many CPUs, using HPET as the clock source can
have a significant impact on the overall system performance because
of the following reasons:
1) There is a single HPET counter shared by all the CPUs.
2) HPET counter reading is a very slow operation.
Using HPET as the default clock source may happen when, for example,
the TSC clock calibration exceeds the allowable tolerance. Something
the performance slowdown can be so severe that the system may crash
because of a NMI watchdog soft lockup, for example.
During the TSC clock calibration process, the default clock source
will be set temporarily to HPET. For systems with many CPUs, it is
possible that NMI watchdog soft lockup may occur occasionally during
that short time period where HPET clocking is active as is shown in
the kernel log below:
This patch addresses the above issues by reducing HPET read contention
using the fact that if more than one CPUs are trying to access HPET at
the same time, it will be more efficient when only one CPU in the group
reads the HPET counter and shares it with the rest of the group instead
of each group member trying to read the HPET counter individually.
This is done by using a combination quadword that contains a 32-bit
stored HPET value and a 32-bit spinlock. The CPU that gets the lock
will be responsible for reading the HPET counter and storing it in
the quadword. The others will monitor the change in HPET value and
lock status and grab the latest stored HPET value accordingly. This
change is only enabled on 64-bit SMP configuration.
On a 4-socket Haswell-EX box with 144 threads (HT on), running the
AIM7 compute workload (1500 users) on a 4.8-rc1 kernel (HZ=1000)
with and without the patch has the following performance numbers
(with HPET or TSC as clock source):
The perf profile showed a reduction of the %CPU time consumed by
read_hpet from 11.19% without patch to 1.24% with patch.
[ tglx: It's really sad that we need to have such hacks just to deal with
the fact that cpu vendors have not managed to fix the TSC wreckage
within 15+ years. Were They Forgetting? ]
Signed-off-by: Waiman Long <Waiman.Long@hpe.com> Tested-by: Prarit Bhargava <prarit@redhat.com> Cc: Scott J Norton <scott.norton@hpe.com> Cc: Douglas Hatch <doug.hatch@hpe.com> Cc: Randy Wright <rwright@hpe.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/1473182530-29175-1-git-send-email-Waiman.Long@hpe.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit f99fd22e4d4bc84880a8a3117311bbf0e3a6a9dc)
When "primary_reselect" is set to "failure", primary interface should
not become active until current active slave is down. But if we set first
member of bond device as a "primary" interface and "primary_reselect"
is set to "failure" then whenever primary interface's link gets back(up)
it becomes active slave even if current active slave is still up.
With this patch, "bond_find_best_slave" will not traverse members if
primary interface is not candidate for failover/reselection and current
active slave is still up.
Reviewed-by: John Haxby <john.haxby@oracle.com> Signed-off-by: Mazhar Rana <mazhar.rana@cyberoam.com> Signed-off-by: Jay Vosburgh <j.vosburgh@gmail.com> Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
clear_page_dirty_for_io() has accumulated writeback and memcg subtleties
since v2.6.16 first introduced page migration; and the set_page_dirty()
which completed its migration of PageDirty, later had to be moderated to
__set_page_dirty_nobuffers(); then PageSwapBacked had to skip that too.
No actual problems seen with this procedure recently, but if you look into
what the clear_page_dirty_for_io(page)+set_page_dirty(newpage) is actually
achieving, it turns out to be nothing more than moving the PageDirty flag,
and its NR_FILE_DIRTY stat from one zone to another.
It would be good to avoid a pile of irrelevant decrementations and
incrementations, and improper event counting, and unnecessary descent of
the radix_tree under tree_lock (to set the PAGECACHE_TAG_DIRTY which
radix_tree_replace_slot() left in place anyway).
Do the NR_FILE_DIRTY movement, like the other stats movements, while
interrupts still disabled in migrate_page_move_mapping(); and don't even
bother if the zone is the same. Do the PageDirty movement there under
tree_lock too, where old page is frozen and newpage not yet visible:
bearing in mind that as soon as newpage becomes visible in radix_tree, an
un-page-locked set_page_dirty() might interfere (or perhaps that's just
not possible: anything doing so should already hold an additional
reference to the old page, preventing its migration; but play safe).
But we do still need to transfer PageDirty in migrate_page_copy(), for
those who don't go the mapping route through migrate_page_move_mapping().
Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 42cb14b110a5698ccf26ce59c4441722605a3743) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
mm/migrate.c
Dongli Zhang [Thu, 17 Nov 2016 05:55:27 +0000 (13:55 +0800)]
xen-netfront: cast grant table reference first to type int
IS_ERR_VALUE() in commit 87557efc27f6a50140fb20df06a917f368ce3c66
("xen-netfront: do not cast grant table reference to signed short") would
not return true for error code unless we cast ref first to type int.
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Oracle-Bug: 25138361
upstream commit: 269ebce4531b8edc4224259a02143181a1c1d77c Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Reviewed by: Jack F. Vogel <jack.vogel@oracle.com> Acked-by: Joe Jin <joe.jin@oracle.com>
Dongli Zhang [Thu, 17 Nov 2016 05:54:19 +0000 (13:54 +0800)]
xen-netfront: do not cast grant table reference to signed short
While grant reference is of type uint32_t, xen-netfront erroneously casts
it to signed short in BUG_ON().
This would lead to the xen domU panic during boot-up or migration when it
is attached with lots of paravirtual devices.
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Oracle-Bug: 25138362
upstream commit: 87557efc27f6a50140fb20df06a917f368ce3c66 Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Reviewed by: Jack F. Vogel <jack.vogel@oracle.com> Acked-by: Joe Jin <joe.jin@oracle.com>
Rama Nichanamatlu [Thu, 10 Nov 2016 08:30:07 +0000 (00:30 -0800)]
IB/core: uverbs: Do not wait for uverbs dev closure during forced system shutdown.
Waiting for uverbs dev file closure in the system forced shutdown [e.g
reboot -f] path locks up the reboot process forever preventing system
shutdown as application closure of this device file is very unlikely at
this time. This closure happens in orderly shutdown [e.g reboot] as
processes are sent SIGTERM.
Linn Crosetto [Wed, 16 Nov 2016 20:33:52 +0000 (12:33 -0800)]
acpi: Disable ACPI table override if securelevel is set
From the kernel documentation (initrd_table_override.txt):
If the ACPI_INITRD_TABLE_OVERRIDE compile option is true, it is possible
to override nearly any ACPI table provided by the BIOS with an
instrumented, modified one.
When securelevel is set, the kernel should disallow any unauthenticated
changes to kernel space. ACPI tables contain code invoked by the kernel, so
do not allow ACPI tables to be overridden if securelevel is set.
Signed-off-by: Linn Crosetto <linn@hpe.com>
Orabug: 25058372
CVE: CVE-2016-3699 Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com> Reviewed-by: Guru Anbalagane <guru.anbalagane@oracle.com>
Francisco Triviño [Mon, 14 Nov 2016 16:46:14 +0000 (08:46 -0800)]
IB/cm: avoid query device in CM REQ/REP
The query device needed in CM REQ/REP is a bit expensive since
it involves a MAD query and also it is not saved in the cache.
When the driver that holds the local device is different from
sif then there is no need to go through the query device. If
sif driver is identified, then we still need to go through the
query device in order to get the specific vendor id. This last
is to make sure the software workaround is applied only to the
PSIF revisions that are affected.
This patch filters those cases and avoids unnecessary MAD queries
when driver is different from sif.
Francisco Triviño [Mon, 14 Nov 2016 16:46:14 +0000 (08:46 -0800)]
IB/cm: return original rnr value when RNR WA for PSIF
With this patch, the original min_rnr_value set by the user is saved
in case it is later queried. The ib_qp flag has been re-purposed to
store the value in addition. This patch makes the RNR WA implementation
total transparent for the user.
Hans Westgaard Ry [Tue, 4 Oct 2016 12:09:17 +0000 (14:09 +0200)]
IB/core: Issue DREQ when receiving REQ/REP for stale QP
from "InfiBand Architecture Specifications Volume 1":
A QP is said to have a stale connection when only one side has
connection information. A stale connection may result if the remote CM
had dropped the connection and sent a DREQ but the DREQ was never
received by the local CM. Alternatively the remote CM may have lost
all record of past connections because its node crashed and rebooted,
while the local CM did not become aware of the remote node's reboot
and therefore did not clean up stale connections.
and:
A local CM may receive a REQ/REP for a stale connection. It shall
abort the connection issuing REJ to the REQ/REP. It shall then issue
DREQ with "DREQ:remote QPNâ\80\9d set to the remote QPN from the REQ/REP.
This patch solves a problem with reuse of QPN. Current codebase, that
is IPoIB, relies on a REAP-mechanism to do cleanup of the structures
in CM. A problem with this is the timeconstants governing this
mechanism; they are up to 768 seconds and the interface may look
inresponsive in that period. Issuing a DREQ (and receiving a DREP)
does the necessary cleanup and the interface comes up.
Wei Lin Guay [Mon, 31 Oct 2016 18:06:28 +0000 (19:06 +0100)]
sif: cq: cleanup cqe once a kernel qp is destroyed/reset
To ease the sqflush/rqflush workaround, sifdrv removes
all the associated cqes once a kernel qp is destroy/reset.
Even though IB specification 10.2.4.4 mentioned that
"Destroying a QP does not guarantee that CQEs of that
QP are deallocated from the CQ upon destruction.",
it also stated that "Even if the CQEs are already on
the CQ, it might not be possible to retrieve them"
Thus, IB spec is indicating that it is vendor specific
implementation and the ULP should not assume that the
cqes are in the cq once a qp is destroyed/reset.
Wei Lin Guay [Thu, 3 Nov 2016 09:46:22 +0000 (10:46 +0100)]
sif: cq: sif_poll_cq might not drain cq completely
In a scenario where a duplicate completion is detected, the sif_poll_cq
skips the remaining cqes in the cq. This code bug is introduced in
commit "sif: sqflush: Handle duplicate completions in poll_cq".
Wei Lin Guay [Mon, 31 Oct 2016 20:25:26 +0000 (21:25 +0100)]
sif: rq: do not flush rq if it is an srq
sifdrv needs to flush a regular rq (non-srq) once it
detects that a qp is transitioned into ERR state.
Nevertheless, if a qp is created with srq and
with no event handler, the srq might be accidentally
flushed once the qp is transitioned into ERR sate.
Wei Lin Guay [Mon, 31 Oct 2016 19:38:25 +0000 (20:38 +0100)]
sif: cq: use refcnt to disable/enable cq polling
Due to a hardware bug, sifdrv must clean up the cq
before it is being polled by the user. Thus, sifdrv
uses CQ_POLLING_NOT_ALLOWED bitmask to disable/enable
cq polling.
Nevertheless, the bit mask operation is not sufficient in
a shared cq scenario (many qps to one cq). The cq clean up
is performed by each qp and it might be performed concurrently.
As a result, the cq polling might be enabled before all qps
have clean up the cq.
Thus, this patch uses refcnt to disable/enable cq polling.
The CQ_POLLING_NOT_ALLOWED bitmask is kept in order to have
backward compatibility in the user library.
Knut Omang [Fri, 14 Oct 2016 11:27:26 +0000 (13:27 +0200)]
sif: pt: Add support for single thread modified page tables
Modifications to the page tables in sif_pt is protected by
a lock to allow multiple threads to add and subtract regions
to/from the page table in parallel. This functionality is
currently only needed/used by the special sq_cmpl page table
handling. In the future we might however need this also for
other cases, for instance to optimize further on page table
memory usage.
The kernel documentation for infiniband midlayer locking
requires that map_phys_fmr should be callable from any context.
This prevents us from blocking on a lock, something that happens
if there are contention for the lock (eg. more than one thread
involved in modifying the page table)
Implement another flag: thread_safe in a pt that determines
if a page table is going to need to be modified from multiple
threads simultaneously. For now keep a BUG_ON if the code
is attempted accessed in parallel for memory types
that should not ever see parallel access.
Knut Omang [Wed, 26 Oct 2016 09:44:00 +0000 (11:44 +0200)]
sif: pqp: Implement handling of PQPs in error.
The assumption is that any such situation that can arise in
production is due to an application that causes it's CQ to go
to error and where the PQP subsequently tries to post a CQ
operation that affects the CQ that is in error. In these cases,
the PQP itself goes to error and an event is generated.
This commit refactors the modify_qp logic slightly, as well as
implementing a modification cycle to bring a privileged QP
back up again. It also adds a new pqp debugfs file and some statistics
to help monitoring the new PQP specific state as well.
The resurrect operation is queued on the sif workqueue
by the new handle_pqp_event function, which is now properly
wired up to accept all PQP events. When a PQP is detected as
being in error, its last_set_state is updated, and in addition
the write_only flag is set, which causes new send reqs not to
touch any collect buffer as part of the operation.
This flag was introduced to allow the resurrect to set the PQP in
RTS again while still not triggering any sends.
This way the implementation allows clients to continue to
post requests to the PQP while it is in error or in transition
back to RTS again by just accepting these requests into the PQP
send queue without any writes to the collect buffer.
When in the INIT state, the resurrect worker updates
the SQ pointers to skip the request that triggered
the PQP error.
Once back in RTS, the resurrect worker can take the single SQ lock
which serializes posts, check the size of the send queue
and if >= 0, trigger the send queue scheduler to start processing these.
Once the QP is in SQS mode, or just idle if the queue was empty,
it is safe for ordering purposes to let normal posting with
collect buffer writes commence.
Hakon Bugge [Tue, 1 Nov 2016 13:16:30 +0000 (14:16 +0100)]
sif: query_device: Return correct #SGEs for EoIB
ULPs that use LSO, can only create QPs with #SGE entries being one
less than what is supported in HW. This because PSIF uses one entry
for the LSO stencil. The driver attempts to detect the ULP and from
that derive if LSO will be used.
This commit adds support for XVE (Xsigo Virtual Ethernet), and a
query_device() from the XVE driver (not the connected mode part), will
return one less #SGEs.
Further, since the XVE ULP is detected, we amend the sysfs listing of
QPs with EoIB (for XVE datagram mode QPs) and EoIB_CM (for XVE
connected mode QPs).
Knut Omang [Sun, 23 Oct 2016 16:22:35 +0000 (18:22 +0200)]
sif: pqp: Make setup/teardown function ref sif_pqp_info directly
Take advantage of the new, cleaner and more separate PQP data structures:
Simplify/abstract pqp setup/teardown by pointing into sdev
for a direct ref to the sif_pqp_info data structure.
Knut Omang [Sun, 23 Oct 2016 15:35:14 +0000 (17:35 +0200)]
sif: Move the rest of the pqp setup and teardown to sif_pqp
This finishes the restructure started in the previous commit,
to consolidate PQP handling logic to make it simpler
to extend without spreading the complexity to other parts
of the code.
Knut Omang [Fri, 28 Oct 2016 03:29:28 +0000 (05:29 +0200)]
sif: Move sif_dfs_register beyond base init
The debugfs setup must be initialized prior to PQP operation,
but must also be deinitialized before base table takedown,
otherwise we are exposed to faults due to a race condition between
a user accessing debugfs tables and driver unload.
This commit moves the dfs init/deinit from sif_probe to
sif_hw_init to achieve this order.
Knut Omang [Sun, 23 Oct 2016 07:31:26 +0000 (09:31 +0200)]
sif: Refactor PQP state out of sif_dev.
Prepare handling of PQPs for the additional complexity
of resurrect upon errors and query for undetected
error conditions by moving some of the state
now directly in sif_dev into a new sif_pqp_info
struct.
Already complex logic for PQP handling is going to
increase in complexity. We need to consolidate
to be able to keep clean and easily maintainable
interfaces.
Chuck Anderson [Thu, 10 Nov 2016 14:27:04 +0000 (06:27 -0800)]
Merge branch topic/uek-4.1/upstream-cherry-picks of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* topic/uek-4.1/upstream-cherry-picks:
ecryptfs: don't allow mmap when the lower fs doesn't support it
Revert "ecryptfs: forbid opening files without mmap handler"
Jeff Mahoney [Tue, 5 Jul 2016 21:32:30 +0000 (17:32 -0400)]
ecryptfs: don't allow mmap when the lower fs doesn't support it
There are legitimate reasons to disallow mmap on certain files, notably
in sysfs or procfs. We shouldn't emulate mmap support on file systems
that don't offer support natively.
Chuck Anderson [Wed, 9 Nov 2016 22:19:53 +0000 (14:19 -0800)]
Merge branch topic/uek-4.1/upstream-cherry-picks of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* topic/uek-4.1/upstream-cherry-picks:
percpu: fix synchronization between synchronous map extension and chunk destruction
percpu: fix synchronization between chunk->map_extend_work and chunk destruction
ALSA: timer: Fix leak in events via snd_timer_user_tinterrupt
ALSA: timer: Fix leak in events via snd_timer_user_ccallback
ALSA: timer: Fix leak in SNDRV_TIMER_IOCTL_PARAMS
Tejun Heo [Wed, 25 May 2016 15:48:25 +0000 (11:48 -0400)]
percpu: fix synchronization between synchronous map extension and chunk destruction
For non-atomic allocations, pcpu_alloc() can try to extend the area
map synchronously after dropping pcpu_lock; however, the extension
wasn't synchronized against chunk destruction and the chunk might get
freed while extension is in progress.
This patch fixes the bug by putting most of non-atomic allocations
under pcpu_alloc_mutex to synchronize against pcpu_balance_work which
is responsible for async chunk management including destruction.
Tejun Heo [Wed, 25 May 2016 15:48:25 +0000 (11:48 -0400)]
percpu: fix synchronization between chunk->map_extend_work and chunk destruction
Atomic allocations can trigger async map extensions which is serviced
by chunk->map_extend_work. pcpu_balance_work which is responsible for
destroying idle chunks wasn't synchronizing properly against
chunk->map_extend_work and may end up freeing the chunk while the work
item is still in flight.
This patch fixes the bug by rolling async map extension operations
into pcpu_balance_work.
Kangjie Lu [Tue, 3 May 2016 20:44:32 +0000 (16:44 -0400)]
ALSA: timer: Fix leak in events via snd_timer_user_tinterrupt
The stack object “r1” has a total size of 32 bytes. Its field
“event” and “val” both contain 4 bytes padding. These 8 bytes
padding bytes are sent to user without being initialized.
Signed-off-by: Kangjie Lu <kjlu@gatech.edu> Signed-off-by: Takashi Iwai <tiwai@suse.de>
Orabug: 25059885
CVE: CVE-2016-4578
Mainline v4.7 commit e4ec8cc8039a7063e24204299b462bd1383184a5 Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>
Kangjie Lu [Tue, 3 May 2016 20:44:20 +0000 (16:44 -0400)]
ALSA: timer: Fix leak in events via snd_timer_user_ccallback
The stack object “r1” has a total size of 32 bytes. Its field
“event” and “val” both contain 4 bytes padding. These 8 bytes
padding bytes are sent to user without being initialized.
Signed-off-by: Kangjie Lu <kjlu@gatech.edu> Signed-off-by: Takashi Iwai <tiwai@suse.de>
Orabug: 25059885
CVE: CVE-2016-4578
Mainline v4.7 commit 9a47e9cff994f37f7f0dbd9ae23740d0f64f9fe6 Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>
Kangjie Lu [Tue, 3 May 2016 20:44:07 +0000 (16:44 -0400)]
ALSA: timer: Fix leak in SNDRV_TIMER_IOCTL_PARAMS
The stack object “tread” has a total size of 32 bytes. Its field
“event” and “val” both contain 4 bytes padding. These 8 bytes
padding bytes are sent to user without being initialized.
Signed-off-by: Kangjie Lu <kjlu@gatech.edu> Signed-off-by: Takashi Iwai <tiwai@suse.de>
Orabug: 25059408
CVE: CVE-2016-4569
Mainline v4.7 commit cec8f96e49d9be372fdb0c3836dcf31ec71e457e Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>
Chuck Anderson [Fri, 4 Nov 2016 12:33:13 +0000 (05:33 -0700)]
uek-rpm ol7: change uek-rpm/ol7/update-el release value from 7.1 to 7.3
Change release value in uek-rpm/ol7/update-el to 7.3 so that manual builds
will pick up the new OL7.3 secure boot key.
uek-rpm/ol6/update-el is not affected.
Chuck Anderson [Thu, 3 Nov 2016 17:42:10 +0000 (10:42 -0700)]
Merge branch topic/uek-4.1/upstream-cherry-picks of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* topic/uek-4.1/upstream-cherry-picks: (23 commits)
NFS: Fix an LOCK/OPEN race when unlinking an open file
intel_idle: correct BXT support
intel_idle: re-work bxt_idle_state_table_update() and its helper
x86/intel_idle: Use Intel family macros for intel_idle
x86/cpu/intel: Introduce macros for Intel family numbers
intel_idle: add BXT support
intel_idle: Add KBL support
intel_idle: Add SKX support
intel_idle: Clean up all registered devices on exit.
intel_idle: Propagate hot plug errors.
intel_idle: Don't overreact to a cpuidle registration failure.
intel_idle: Setup the timer broadcast only on successful driver load.
intel_idle: Avoid a double free of the per-CPU data.
intel_idle: Fix dangling registration on error path.
intel_idle: Fix deallocation order on the driver exit path.
intel_idle: Remove redundant initialization calls.
intel_idle: Fix a helper function's return value.
intel_idle: remove useless return from void function.
intel_idle: Support for Intel Xeon Phi Processor x200 Product Family
intel_idle: prevent SKL-H boot failure when C8+C9+C10 enabled
...
Sometime uVNIC removal on OFOS won't trigger a actual removal
of Vstar interface, in that case uVNIC driver has to send NACK
code so that XCM will start cleaning its database.
When path->users becomes zero uVNIC driver starts
cleaning up the Forwarding table entries.
In some corner cases the call is invoked from transmit
function which is in interrupt context and that results
in a hard LOCKUP.
With new changes path->users is decremented in transmit
function to allow cleanup to happen from other thread.
Proper care is taken to avoid race between these
two contexts.
At Connectathon 2016, we found that recent upstream Linux clients
would occasionally send a LOCK operation with a zero stateid. This
appeared to happen in close proximity to another thread returning
a delegation before unlinking the same file while it remained open.
Earlier, the client received a write delegation on this file and
returned the open stateid. Now, as it is getting ready to unlink the
file, it returns the write delegation. But there is still an open
file descriptor on that file, so the client must OPEN the file
again before it returns the delegation.
Since commit 24311f884189 ('NFSv4: Recovery of recalled read
delegations is broken'), nfs_open_delegation_recall() clears the
NFS_DELEGATED_STATE flag _before_ it sends the OPEN. This allows a
racing LOCK on the same inode to be put on the wire before the OPEN
operation has returned a valid open stateid.
To eliminate this race, serialize delegation return with the
acquisition of a file lock on the same file. Adopt the same approach
as is used in the unlock path.
This patch also eliminates a similar race seen when sending a LOCK
operation at the same time as returning a delegation on the same file.
Fixes: 24311f884189 ('NFSv4: Recovery of recalled read ... ') Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
[Anna: Add sentence about LOCK / delegation race] Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
(cherry picked from commit 11476e9dec39d90fe1e9bf12abc6f3efe35a073d) Signed-off-by: Todd Vierling <todd.vierling@oracle.com>
Commit 5dcef69486 ("intel_idle: add BXT support") added an 8-element
lookup array with just a 2-bit value used for lookups. As per the SDM
that bit field is really 3 bits wide. While this is supposedly benign
here, future re-use of the code for other CPUs might expose the issue.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit bef450962597ff39a7f9d53a30523aae9eb55843) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Since irtl_ns_units[] has itself zero entries, make sure the caller
recognized those cases along with the MSR read returning zero, as zero
is not a valid value for exit_latency and target_residency.
Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 3451ab3ebf92b12801878d8b5c94845afd4219f0) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Use the new INTEL_FAM6_* macros for intel_idle.c. Also fix up
some of the macros to be consistent with how some of the
intel_idle code refers to the model.
There's on oddity here: model 0x1F is uniquely referred to here
and nowhere else that I could find. 0x1E/0x1F are just spelled
out as "Intel Core i7 and i5 Processors" in the SDM or as "Intel
processors based on the Nehalem, Westmere microarchitectures" in
the RDPMC section. Comments between tables 19-19 and 19-20 in
the SDM seem to point to 0x1F being some kind of Westmere, so
let's call it "WESTMERE2".
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Len Brown <lenb@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: jacob.jun.pan@intel.com Cc: linux-pm@vger.kernel.org Link: http://lkml.kernel.org/r/20160603001932.EE978EB9@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit db73c5a8c80decbb6ddf208e58f3865b4df5384d) Signed-off-by: Brian Maly <brian.maly@oracle.com>
We have a boatload of open-coded family-6 model numbers. Half of
them have these model numbers in hex and the other half in
decimal. This makes grepping for them tons of fun, if you were
to try.
Solution:
Consolidate all the magic numbers. Put all the definitions in
one header.
The names here are closely derived from the comments describing
the models from arch/x86/events/intel/core.c. We could easily
make them shorter by doing things like s/SANDYBRIDGE/SNB/, but
they seemed fine even with the longer versions to me.
Do not take any of these names too literally, like "DESKTOP"
or "MOBILE". These are all colloquial names and not precise
descriptions of everywhere a given model will show up.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Darren Hart <dvhart@infradead.org> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Doug Thompson <dougthompson@xmission.com> Cc: Eduardo Valentin <edubezval@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Len Brown <lenb@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Rajneesh Bhardwaj <rajneesh.bhardwaj@intel.com> Cc: Souvik Kumar Chakravarty <souvik.k.chakravarty@intel.com> Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Ulf Hansson <ulf.hansson@linaro.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Vishwanath Somayaji <vishwanath.somayaji@intel.com> Cc: Zhang Rui <rui.zhang@intel.com> Cc: jacob.jun.pan@intel.com Cc: linux-acpi@vger.kernel.org Cc: linux-edac@vger.kernel.org Cc: linux-mmc@vger.kernel.org Cc: linux-pm@vger.kernel.org Cc: platform-driver-x86@vger.kernel.org Link: http://lkml.kernel.org/r/20160603001927.F2A7D828@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 970442c599b22ccd644ebfe94d1d303bf6f87c05) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Broxton has all the HSW C-states, except C3.
BXT C-state timing is slightly different.
Here we trust the IRTL MSRs as authority
on maximum C-state latency, and override the driver's tables
with the values found in the associated IRTL MSRs.
Further we set the target_residency to 1x maximum latency,
trusting the hardware demotion logic.
Signed-off-by: Len Brown <len.brown@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 5dcef694860100fd16885f052591b1268b764d21) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
arch/x86/include/asm/msr-index.h
Signed-off-by: Len Brown <len.brown@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 3ce093d4de753d6c92cc09366e29d0618a62f542) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Signed-off-by: Len Brown <len.brown@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit f9e71657c2c0a8f1c50884ab45794be2854e158e) Signed-off-by: Brian Maly <brian.maly@oracle.com>
This driver registers cpuidle devices when a CPU comes online, but it
leaves the registrations in place when a CPU goes offline. The module
exit code only unregisters the currently online CPUs, leaving the
devices for offline CPUs dangling.
This patch changes the driver to clean up all registrations on exit,
even those from CPUs that are offline.
Signed-off-by: Richard Cochran <rcochran@linutronix.de> Signed-off-by: Len Brown <len.brown@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 3e66a9ab53641a0f7a440e56f7b35bf5d77494b3) Signed-off-by: Brian Maly <brian.maly@oracle.com>
If a cpuidle registration error occurs during the hot plug notifier
callback, we should really inform the hot plug machinery instead of
just ignoring the error. This patch changes the callback to properly
return on error.
Signed-off-by: Richard Cochran <rcochran@linutronix.de> Signed-off-by: Len Brown <len.brown@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 08820546e4c30c84d0a1f1a49df055e1719c07ea) Signed-off-by: Brian Maly <brian.maly@oracle.com>
The helper function, intel_idle_cpu_init, registers one new device
with the cpuidle layer. If the registration should fail, that
function immediately calls intel_idle_cpuidle_devices_uninit() to
unregister every last CPU's device. However, it makes no sense to do
so, when called from the hot plug notifier callback.
This patch moves the call to intel_idle_cpuidle_devices_uninit()
outside of the helper function to the one call site that actually
needs to perform the de-registrations.
Signed-off-by: Richard Cochran <rcochran@linutronix.de> Signed-off-by: Len Brown <len.brown@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit b69ef2c099c3e5f11bd5c33a9530d6522f72c9aa) Signed-off-by: Brian Maly <brian.maly@oracle.com>
This driver sets the broadcast tick quite early on during probe and does
not clean up again in cast of failure. This patch moves the setup call
after the registration, placing the on_each_cpu() calls within the global
CPU lock region.
Signed-off-by: Richard Cochran <rcochran@linutronix.de> Signed-off-by: Len Brown <len.brown@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 2259a819a8d37e472f08c88bc0dd22194754adb4) Signed-off-by: Brian Maly <brian.maly@oracle.com>