Chuck Anderson [Sun, 27 Nov 2016 22:06:13 +0000 (14:06 -0800)]
Merge branch topic/uek-4.1/sparc of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* topic/uek-4.1/sparc: (21 commits)
SPARC64: PORT LDMVSW DRIVER TO UEK4
SPARC64: Fix bad FP register calculation
SPARC64: Respect no-fault ASI for floating exceptions
sparc64: Fixes NUMA node cpulist sysfs file in single NUMA node case.
sparc64: Cleans up PRIQ error and debugging messages.
sparc: Remove console spam during kdump
sparc64: kdump: set crashing_cpu for panic
sparc: kexec: Don't mess with the tl register
sparc64: VDS should try indefinitely to allocate IO pages
sparc64: Use block layer BIO-based interface for VDC IO requests
sparc64: Enable virtual disk protocol out of order execution
ipmi: Fix NULL pointer access and double free panic.
ipmi: Update ipmi driver as per new vldc interface
ipmi: Fix ipmi driver for ilom reset scenario
sparc64: vcc fixes
sparc64: Fix kernel panic due to erroneous #ifdef surrounding pmd_write()
sparc64: Initialize xl_hugepage_shift to 0
sparc64:mm/hugetlb: Set correct huge_pte_count index for 8M hugepages
sparc64: Fix accounting issues used to size TSBs
sparc64: Fix irq stack bootmem allocation.
...
Chuck Anderson [Sun, 27 Nov 2016 22:04:53 +0000 (14:04 -0800)]
Merge branch topic/uek-4.1/upstream-cherry-picks of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* topic/uek-4.1/upstream-cherry-picks:
mm/hugetlb: hugetlb_no_page: rate-limit warning message
net/vxlan: Fix kernel unaligned access in __vxlan_find_mac
kexec: align crash_notes allocation to make it be inside one physical page
iommu-common: Fix error code used in iommu_tbl_range_{alloc,free}().
The warning message "killed due to inadequate hugepage pool" simply
indicates that SIGBUS was sent, not that the process was forcibly killed.
If the process has a signal handler installed does not fix the problem,
this message can rapidly spam the kernel log.
On my amd64 dev machine that does not have hugepages configured, I can
reproduce the repeated warnings easily by setting vm.nr_hugepages=2 (i.e.,
4 megabytes of huge pages) and running something that sets a signal
handler and forks, like
Backport of upstream commit 7177a3b037c7 ("net/vxlan: Fix kernel
unaligned access in __vxlan_find_mac")
__vxlan_find_mac invokes ether_addr_equal on the eth_addr field,
which triggers unaligned access messages, so rearrange vxlan_fdb
to avoid this in the most non-intrusive way.
Reviewed-by: Jiri Pirko <jiri@resnulli.us> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit aa51c7b55e350240c2ed5a8a217688d5bfd13424) Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
(cherry picked from commit 45cbd27626d17d485ca0718b28add12513690921) Signed-off-by: Allen Pais <allen.pais@oracle.com>
David S. Miller [Wed, 4 Nov 2015 19:30:57 +0000 (11:30 -0800)]
iommu-common: Fix error code used in iommu_tbl_range_{alloc,free}().
The value returned from iommu_tbl_range_alloc() (and the one passed
in as a fourth argument to iommu_tbl_range_free) is not a DMA address,
it is rather an index into the IOMMU page table.
Therefore using DMA_ERROR_CODE is not appropriate.
Use a more type matching error code define, IOMMU_ERROR_CODE, and
update all users of this interface.
Reported-by: Andre Przywara <andre.przywara@arm.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Allen Pais <allen.pais@oracle.com>
ixgbevf: Change the relaxed order settings in VF driver for sparc
We noticed performance issues with VF interface on sparc compared
to PF. Setting the RX to IXGBE_DCA_RXCTRL_DATA_WRO_EN brings it
on far with PF. Also this matches to the default sparc settings in
PF driver.
Chuck Anderson [Sun, 27 Nov 2016 00:51:51 +0000 (16:51 -0800)]
Merge branch topic/uek-4.1/dtrace of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* topic/uek-4.1/dtrace:
dtrace: eliminate need for arg counting in sdt macros
dtrace: augment SDT probes with type information
dtrace: import the sdt type information into per-sdt_probedesc state
dtrace: record SDT and perf probe types in a new ELF section
Chuck Anderson [Sun, 27 Nov 2016 00:50:36 +0000 (16:50 -0800)]
Merge branch 'topic/uek-4.1/ofed' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* topic/uek-4.1/ofed:
mlx4: avoid multiple free on id_map_ent
xsigo: supported SGE's for LSO QP
xsigo: Hardening driver in handling remote QP failures
IB/cm: avoid query device in CM REQ/REP
IB/cm: return original rnr value when RNR WA for PSIF
IB/cm: MBIT needs to be used in network order
IB/core: Issue DREQ when receiving REQ/REP for stale QP
sif: cq: cleanup cqe once a kernel qp is destroyed/reset
sif: cq: sif_poll_cq might not drain cq completely
sif: rq: do not flush rq if it is an srq
sif: cq: use refcnt to disable/enable cq polling
sif: pt: Add support for single thread modified page tables
sif: pqp: Implement handling of PQPs in error.
sif: eps*: initialize each struct in array
sif: query_device: Return correct #SGEs for EoIB
sif: LSO not supported for EoIB queuepairs
sif: pqp: Make setup/teardown function ref sif_pqp_info directly
sif: Move the rest of the pqp setup and teardown to sif_pqp
sif: Move sif_dfs_register beyond base init
sif: Refactor PQP state out of sif_dev.
Chuck Anderson [Sun, 27 Nov 2016 00:49:48 +0000 (16:49 -0800)]
Merge branch 'topic/uek-4.1/drivers' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* topic/uek-4.1/drivers:
NVMe: reduce queue depth as workaround for Samsung EPIC SQ errata
bonding: "primary_reselect" with "failure" is not working properly
IB/core: uverbs: Do not wait for uverbs dev closure during forced system shutdown.
Chuck Anderson [Sun, 27 Nov 2016 00:48:32 +0000 (16:48 -0800)]
Merge branch topic/uek-4.1/upstream-cherry-picks of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* topic/uek-4.1/upstream-cherry-picks:
Bluetooth: Fix potential NULL dereference in RFCOMM bind callback
aacraid: Check size values after double-fetch from user
mm: migrate dirty page without clear_page_dirty_for_io etc
xen-netfront: cast grant table reference first to type int
xen-netfront: do not cast grant table reference to signed short
Chuck Anderson [Sun, 27 Nov 2016 00:47:09 +0000 (16:47 -0800)]
Merge branch topic/uek-4.1/stable-cherry-picks of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* topic/uek-4.1/stable-cherry-picks: (21 commits)
ocfs2: fix not enough credit panic
ocfs2: fix deadlock on mmapped page in ocfs2_write_begin_nolock()
ocfs2/dlm: fix race between convert and migration
ocfs2: solve a problem of crossing the boundary in updating backups
ocfs2: use spinlock_irqsave() to downconvert lock in ocfs2_osb_dump()
ocfs2: access orphan dinode before delete entry in ocfs2_orphan_del
ocfs2/dlm: do not insert a new mle when another process is already migrating
ocfs2: fix slot overwritten if storage link down during mount
ocfs2/dlm: return appropriate value when dlm_grab() returns NULL
ocfs2/dlm: wait until DLM_LOCK_RES_SETREF_INPROG is cleared in dlm_deref_lockres_worker
ocfs2/dlm: fix a race between purge and migration
ocfs2/dlm: clear migration_pending when migration target goes down
ocfs2: fix BUG when calculate new backup super
ocfs2: should reclaim the inode if '__ocfs2_mknod_locked' returns an error
ocfs2: fix race between mount and delete node/cluster
ocfs2/dlm: unlock lockres spinlock before dlm_lockres_put
ocfs2: avoid access invalid address when read o2dlm debug messages
ocfs2: fix a tiny case that inode can not removed
ocfs2: trusted xattr missing CAP_SYS_ADMIN check
ocfs2: set filesytem read-only when ocfs2_delete_entry failed.
...
Double free is found on id_map_ent.
Existing code makes another queue_delayed_work regardless if the worker was
successfully canceled or not. In case it's not, means the worker routine already
started to run, and then the later queued work will do the same work against
the same id_map_ent stucture. The worker routine actually does cleanup/free
on the structure, so the 2nd run of it is dengerous.
Fix is that we check if we successfully canceled previously queued work, if
that's not canceled, we don't queue the work again on the same structure.
Junxiao Bi [Tue, 1 Nov 2016 06:42:20 +0000 (14:42 +0800)]
ocfs2: fix not enough credit panic
The following panic was caught when run ocfs2 disconfig single test
(block size 512 and cluster size 8192). ocfs2_journal_dirty() return
-ENOSPC, that means credits were used up. The total credit should
include 3 times of "num_dx_leaves" from ocfs2_dx_dir_rebalance(),
because 2 times will be consumed in ocfs2_dx_dir_transfer_leaf() and
1 time will be consumed in ocfs2_dx_dir_new_cluster()->
__ocfs2_dx_dir_new_cluster()->ocfs2_dx_dir_format_cluster(). But only
two times is included in ocfs2_dx_dir_rebalance_credits(), fix it.
Eric Ren [Fri, 30 Sep 2016 22:11:32 +0000 (15:11 -0700)]
ocfs2: fix deadlock on mmapped page in ocfs2_write_begin_nolock()
The testcase "mmaptruncate" of ocfs2-test deadlocks occasionally.
In this testcase, we create a 2*CLUSTER_SIZE file and mmap() on it;
there are 2 process repeatedly performing the following operations
respectively: one is doing memset(mmaped_addr + 2*CLUSTER_SIZE - 1, 'a',
1), while the another is playing ftruncate(fd, 2*CLUSTER_SIZE) and then
ftruncate(fd, CLUSTER_SIZE) again and again.
In ocfs2_write_begin_nolock(), we first grab the pages and then allocate
disk space for this write; ocfs2_try_to_free_truncate_log() will be
called if -ENOSPC is returned; if we're lucky to get enough clusters,
which is usually the case, we start over again.
But in ocfs2_free_write_ctxt() the target page isn't unlocked, so we
will deadlock when trying to grab the target page again.
Also, -ENOMEM might be returned in ocfs2_grab_pages_for_write().
Another deadlock will happen in __do_page_mkwrite() if
ocfs2_page_mkwrite() returns non-VM_FAULT_LOCKED, and along with a
locked target page.
These two errors fail on the same path, so fix them by unlocking the
target page manually before ocfs2_free_write_ctxt().
Jan Kara helps me clear out the JBD2 part, and suggest the hint for root
cause.
Changes since v1:
1. Also put ENOMEM error case into consideration.
Link: http://lkml.kernel.org/r/1474173902-32075-1-git-send-email-zren@suse.com Signed-off-by: Eric Ren <zren@suse.com> Reviewed-by: He Gang <ghe@suse.com> Acked-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit c33f0785bf292cf1d15f4fbe42869c63e205b21c)
Joseph Qi [Mon, 19 Sep 2016 21:43:55 +0000 (14:43 -0700)]
ocfs2/dlm: fix race between convert and migration
Commit ac7cf246dfdb ("ocfs2/dlm: fix race between convert and recovery")
checks if lockres master has changed to identify whether new master has
finished recovery or not. This will introduce a race that right after
old master does umount ( means master will change), a new convert
request comes.
In this case, it will reset lockres state to DLM_RECOVERING and then
retry convert, and then fail with lockres->l_action being set to
OCFS2_AST_INVALID, which will cause inconsistent lock level between
ocfs2 and dlm, and then finally BUG.
Since dlm recovery will clear lock->convert_pending in
dlm_move_lockres_to_recovery_list, we can use it to correctly identify
the race case between convert and recovery. So fix it.
Fixes: ac7cf246dfdb ("ocfs2/dlm: fix race between convert and recovery") Link: http://lkml.kernel.org/r/57CE1569.8010704@huawei.com Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Jun Piao <piaojun@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit e6f0c6e6170fec175fe676495f29029aecdf486c)
jiangyiwen [Fri, 25 Mar 2016 21:21:35 +0000 (14:21 -0700)]
ocfs2: solve a problem of crossing the boundary in updating backups
In update_backups() there exists a problem of crossing the boundary as
follows:
we assume that lun will be resized to 1TB(cluster_size is 32kb), it will
include 0~33554431 cluster, in update_backups func, it will backup super
block in location of 1TB which is the 33554432th cluster, so the
phenomenon of crossing the boundary happens.
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Xue jiufei <xuejiufei@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 584dca3440732afa84fbca07567bb66e1453936a)
jiangyiwen [Tue, 15 Mar 2016 21:53:01 +0000 (14:53 -0700)]
ocfs2: use spinlock_irqsave() to downconvert lock in ocfs2_osb_dump()
Commit a75e9ccabd92 ("ocfs2: use spinlock irqsave for downconvert lock")
missed an unmodified place in ocfs2_osb_dump(), so it still exists a
deadlock scenario.
Joseph Qi [Thu, 14 Jan 2016 23:17:44 +0000 (15:17 -0800)]
ocfs2: access orphan dinode before delete entry in ocfs2_orphan_del
In ocfs2_orphan_del, currently it finds and deletes entry first, and
then access orphan dir dinode. This will have a problem once
ocfs2_journal_access_di fails. In this case, entry will be removed from
orphan dir, but in deed the inode hasn't been deleted successfully. In
other words, the file is missing but not actually deleted. So we should
access orphan dinode first like unlink and rename.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Reviewed-by: Jiufei Xue <xuejiufei@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 074a6c655f6da12cb1123c8a84bfd8d781138800)
xuejiufei [Thu, 14 Jan 2016 23:17:41 +0000 (15:17 -0800)]
ocfs2/dlm: do not insert a new mle when another process is already migrating
When two processes are migrating the same lockres,
dlm_add_migration_mle() return -EEXIST, but insert a new mle in hash
list. dlm_migrate_lockres() will detach the old mle and free the new
one which is already in hash list, that will destroy the list.
Signed-off-by: Jiufei Xue <xuejiufei@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 32e493265b2be96404aaa478fb2913be29b06887)
jiangyiwen [Thu, 14 Jan 2016 23:17:33 +0000 (15:17 -0800)]
ocfs2: fix slot overwritten if storage link down during mount
The following case will lead to slot overwritten.
N1 N2
mount ocfs2 volume, find and
allocate slot 0, then set
osb->slot_num to 0, begin to
write slot info to disk
mount ocfs2 volume, wait for super lock
write block fail because of
storage link down, unlock
super lock
got super lock and also allocate slot 0
then unlock super lock
mount fail and then dismount,
since osb->slot_num is 0, try to
put invalid slot to disk. And it
will succeed if storage link
restores.
N2 slot info is now overwritten
Once another node say N3 mount, it will find and allocate slot 0 again,
which will lead to mount hung because journal has already been locked by
N2. so when write slot info failed, invalidate slot in advance to avoid
overwrite slot.
[akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 1247017f43a93eae3d64b7c25f3637dc545f5a47)
Xue jiufei [Thu, 14 Jan 2016 23:17:29 +0000 (15:17 -0800)]
ocfs2/dlm: return appropriate value when dlm_grab() returns NULL
dlm_grab() may return NULL when the node is doing unmount. When doing
code review, we found that some dlm handlers may return error to caller
when dlm_grab() returns NULL and make caller BUG or other problems.
Here is an example:
Node 1 Node 2
receives migration message
from node 3, and send
migrate request to others
start unmounting
receives migrate request
from node 1 and call
dlm_migrate_request_handler()
unmount thread unregisters
domain handlers and removes
dlm_context from dlm_domains
dlm_migrate_request_handlers()
returns -EINVAL to node 1
Exit migration neither clearing the
migration state nor sending
assert master message to node 3 which
cause node 3 hung.
Signed-off-by: Jiufei Xue <xuejiufei@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit c372f2193a2e73d5936bf37259ae63ca388b4cbc)
jiangyiwen [Thu, 14 Jan 2016 23:17:23 +0000 (15:17 -0800)]
ocfs2/dlm: wait until DLM_LOCK_RES_SETREF_INPROG is cleared in dlm_deref_lockres_worker
Commit f3f854648de6 ("ocfs2_dlm: Ensure correct ordering of set/clear
refmap bit on lockres") still exists a race which can't ensure the
ordering is exactly correct.
Node1 Node2 Node3
umount, migrate
lockres to Node2
migrate finished,
send migrate request
to Node3
received migrate request,
create a migration_mle,
respond to Node2.
set DLM_LOCK_RES_SETREF_INPROG
and send assert master to
Node3
delete migration_mle in
assert_master_handler,
Node3 umount without response
dlm_thread purge
this lockres, send drop
deref message to Node2
found the flag of
DLM_LOCK_RES_SETREF_INPROG
is set, dispatch
dlm_deref_lockres_worker to
clear refmap, but in function of
dlm_deref_lockres_worker,
only if node in refmap it wait
DLM_LOCK_RES_SETREF_INPROG
to be cleared. So worker is
done successfully
purge lockres, send
assert master response
to Node1, and finish umount
set Node3 in refmap, and it
won't be cleared forever, thus
lead to umount hung
so wait until DLM_LOCK_RES_SETREF_INPROG is cleared in
dlm_deref_lockres_worker.
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit b5560143385e18b4109ad6951c7719705e3dd995)
Xue jiufei [Thu, 14 Jan 2016 23:17:18 +0000 (15:17 -0800)]
ocfs2/dlm: fix a race between purge and migration
We found a race between purge and migration when doing code review.
Node A put lockres to purgelist before receiving the migrate message
from node B which is the master. Node A call dlm_mig_lockres_handler to
handle this message.
dlm_mig_lockres_handler
dlm_lookup_lockres
>>>>>> race window, dlm_run_purge_list may run and send
deref message to master, waiting the response
spin_lock(&res->spinlock);
res->state |= DLM_LOCK_RES_MIGRATING;
spin_unlock(&res->spinlock);
dlm_mig_lockres_handler returns
>>>>>> dlm_thread receives the response from master for the deref
message and triggers the BUG because the lockres has the state
DLM_LOCK_RES_MIGRATING with the following message:
xuejiufei [Tue, 29 Dec 2015 22:54:29 +0000 (14:54 -0800)]
ocfs2/dlm: clear migration_pending when migration target goes down
We have found a BUG on res->migration_pending when migrating lock
resources. The situation is as follows.
dlm_mark_lockres_migration
res->migration_pending = 1;
__dlm_lockres_reserve_ast
dlm_lockres_release_ast returns with res->migration_pending remains
because other threads reserve asts
wait dlm_migration_can_proceed returns 1
>>>>>>> o2hb found that target goes down and remove target
from domain_map
dlm_migration_can_proceed returns 1
dlm_mark_lockres_migrating returns -ESHOTDOWN with
res->migration_pending still remains.
When reentering dlm_mark_lockres_migrating(), it will trigger the BUG_ON
with res->migration_pending. So clear migration_pending when target is
down.
Signed-off-by: Jiufei Xue <xuejiufei@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit cc28d6d80f6ab494b10f0e2ec949eacd610f66e3)
Joseph Qi [Tue, 29 Dec 2015 22:54:06 +0000 (14:54 -0800)]
ocfs2: fix BUG when calculate new backup super
When resizing, it firstly extends the last gd. Once it should backup
super in the gd, it calculates new backup super and update the
corresponding value.
But it currently doesn't consider the situation that the backup super is
already done. And in this case, it still sets the bit in gd bitmap and
then decrease from bg_free_bits_count, which leads to a corrupted gd and
trigger the BUG in ocfs2_block_group_set_bits:
alex chen [Fri, 6 Nov 2015 02:44:10 +0000 (18:44 -0800)]
ocfs2: should reclaim the inode if '__ocfs2_mknod_locked' returns an error
In ocfs2_mknod_locked if '__ocfs2_mknod_locke d' returns an error, we
should reclaim the inode successfully claimed above, otherwise, the
inode never be reused. The case is described below:
ocfs2_mknod
ocfs2_mknod_locked
ocfs2_claim_new_inode
Successfully claim the inode
__ocfs2_mknod_locked
ocfs2_journal_access_di
Failed because of -ENOMEM or other reasons, the inode
lockres has not been initialized yet.
iput(inode)
ocfs2_evict_inode
ocfs2_delete_inode
ocfs2_inode_lock
ocfs2_inode_lock_full_nested
__ocfs2_cluster_lock
Return -EINVAL because of the inode
lockres has not been initialized.
So the following operations are not performed
ocfs2_wipe_inode
ocfs2_remove_inode
ocfs2_free_dinode
ocfs2_free_suballoc_bits
Signed-off-by: Alex Chen <alex.chen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit b1529a41f777a48f95d4af29668b70ffe3360e1b)
Joseph Qi [Fri, 6 Nov 2015 02:44:07 +0000 (18:44 -0800)]
ocfs2: fix race between mount and delete node/cluster
There is a race case between mount and delete node/cluster, which will
lead o2hb_thread to malfunctioning dead loop.
o2hb_thread
{
o2nm_depend_this_node();
<<<<<< race window, node may have already been deleted, and then
enter the loop, o2hb thread will be malfunctioning
because of no configured nodes found.
while (!kthread_should_stop() &&
!reg->hr_unclean_stop && !reg->hr_aborted_start) {
}
So check the return value of o2nm_depend_this_node() is needed. If node
has been deleted, do not enter the loop and let mount fail.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 0986fe9b50f425ec81f25a1a85aaf3574b31d801)
Joseph Qi [Thu, 22 Oct 2015 20:32:29 +0000 (13:32 -0700)]
ocfs2/dlm: unlock lockres spinlock before dlm_lockres_put
dlm_lockres_put will call dlm_lockres_release if it is the last
reference, and then it may call dlm_print_one_lock_resource and
take lockres spinlock.
So unlock lockres spinlock before dlm_lockres_put to avoid deadlock.
Signed-off-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit b67de018b37a97548645a879c627d4188518e907)
ocfs2: avoid access invalid address when read o2dlm debug messages
The following case will lead to a lockres is freed but is still in use.
cat /sys/kernel/debug/o2dlm/locking_state dlm_thread
lockres_seq_start
-> lock dlm->track_lock
-> get resA
resA->refs decrease to 0,
call dlm_lockres_release,
and wait for "cat" unlock.
Although resA->refs is already set to 0,
increase resA->refs, and then unlock
lock dlm->track_lock
-> list_del_init()
-> unlock
-> free resA
In such a race case, invalid address access may occurs. So we should
delete list res->tracking before resA->refs decrease to 0.
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit f57a22ddecd6f26040a67e2c12880f98f88b6e00)
When running dirop_fileop_racer we found a case that inode
can not removed.
Two nodes, say Node A and Node B, mount the same ocfs2 volume. Create
two dirs /race/1/ and /race/2/ in the filesystem.
Node A Node B
rm -r /race/2/
mv /race/1/ /race/2/
call ocfs2_unlink(), get
the EX mode of /race/2/
wait for B unlock /race/2/
decrease i_nlink of /race/2/ to 0,
and add inode of /race/2/ into
orphan dir, unlock /race/2/
got EX mode of /race/2/. because
/race/1/ is dir, so inc i_nlink
of /race/2/ and update into disk,
unlock /race/2/
because i_nlink of /race/2/
is not zero, this inode will
always remain in orphan dir
This patch fixes this case by test whether i_nlink of new dir is zero.
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Xue jiufei <xuejiufei@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 928dda1f9433f024ac48c3d97ae683bf83dd0e42)
The trusted extended attributes are only visible to the process which
hvae CAP_SYS_ADMIN capability but the check is missing in ocfs2
xattr_handler trusted list. The check is important because this will be
used for implementing mechanisms in the userspace for which other
ordinary processes should not have access to.
Signed-off-by: Sanidhya Kashyap <sanidhya.gatech@gmail.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Taesoo kim <taesoo@gatech.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 0f5e7b41f91814447defc34e915fc5d6e52266d9)
Xue jiufei [Wed, 24 Jun 2015 23:55:20 +0000 (16:55 -0700)]
ocfs2: fix NULL pointer dereference in function ocfs2_abort_trigger()
ocfs2_abort_trigger() use bh->b_assoc_map to get sb. But there's no
function to set bh->b_assoc_map in ocfs2, it will trigger NULL pointer
dereference while calling this function. We can get sb from
bh->b_bdev->bd_super instead of b_assoc_map.
[akpm@linux-foundation.org: update comment, per Joseph] Signed-off-by: joyce.xue <xuejiufei@huawei.com> Cc: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 74e364ad1b13fd518a0bd4e5aec56d5e8706152f)
Handle scenario's where PSIF generates batched
transmit completions.
In case of Remote QP disconnect follow this sequence
-> If there are any pending transmit completion
explicitly transition QP state to ERROR.
-> Wait for a maximum of 10 seconds for all pending
completions
(10 seconds derived from retrycount * local ack timeout)
Destroy QP
Wait for another 10 seconds(max) if completions are
not returned by hardware.
Synchronized calls to poll_tx
Added more efficiency in handling of TX queue full condtion.
Handle scenario where uVNIC removal can come in batches
In aacraid's ioctl_send_fib() we do two fetches from userspace, one the
get the fib header's size and one for the fib itself. Later we use the
size field from the second fetch to further process the fib. If for some
reason the size from the second fetch is different than from the first
fix, we may encounter an out-of- bounds access in aac_fib_send(). We
also check the sender size to insure it is not out of bounds. This was
reported in https://bugzilla.kernel.org/show_bug.cgi?id=116751 and was
assigned CVE-2016-6480.
Reported-by: Pengfei Wang <wpengfeinudt@gmail.com> Fixes: 7c00ffa31 '[SCSI] 2.6 aacraid: Variable FIB size (updated patch)' Cc: stable@vger.kernel.org Signed-off-by: Dave Carroll <david.carroll@microsemi.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit fa00c437eef8dc2e7b25f8cd868cfa405fcc2bb3) Signed-off-by: Dan Duval <dan.duval@oracle.com>
Oracle discovered that the NVMe driver gets SQ completion errors eventually
leading to the device being reset, taken out of the PCI bus tree or kernel
panics when using the default SQ size of 1024 entries (64KB) for Samsung
EPIC NVMe SSDs.
PCIe analyzer tracing by Oracle and Samsung revealed an errata in Samsung's
firmware for EPIC SSDs where these invalid completion entries can occur
when the queues straddle an 8MB DMA address boundary.
This patch works around the errata by detecting these specific devices and
limiting their descriptor queue depth to 64. This is only for the Samsung
NVMe controllers used in Oracle X-series servers.
There was no noticeable performance impact of reducing queue depths to 64
for these Samsung drives, Oracle X6-2 server, and Oracle VM Server 3.4.2.
Signed-off-by: Kyle Fortin <kyle.fortin@oracle.com> Signed-off-by: Bhavesh Davda <bhavesh.davda@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Waiman Long [Tue, 6 Sep 2016 17:22:10 +0000 (13:22 -0400)]
x86/hpet: Reduce HPET counter read contention
On a large system with many CPUs, using HPET as the clock source can
have a significant impact on the overall system performance because
of the following reasons:
1) There is a single HPET counter shared by all the CPUs.
2) HPET counter reading is a very slow operation.
Using HPET as the default clock source may happen when, for example,
the TSC clock calibration exceeds the allowable tolerance. Something
the performance slowdown can be so severe that the system may crash
because of a NMI watchdog soft lockup, for example.
During the TSC clock calibration process, the default clock source
will be set temporarily to HPET. For systems with many CPUs, it is
possible that NMI watchdog soft lockup may occur occasionally during
that short time period where HPET clocking is active as is shown in
the kernel log below:
This patch addresses the above issues by reducing HPET read contention
using the fact that if more than one CPUs are trying to access HPET at
the same time, it will be more efficient when only one CPU in the group
reads the HPET counter and shares it with the rest of the group instead
of each group member trying to read the HPET counter individually.
This is done by using a combination quadword that contains a 32-bit
stored HPET value and a 32-bit spinlock. The CPU that gets the lock
will be responsible for reading the HPET counter and storing it in
the quadword. The others will monitor the change in HPET value and
lock status and grab the latest stored HPET value accordingly. This
change is only enabled on 64-bit SMP configuration.
On a 4-socket Haswell-EX box with 144 threads (HT on), running the
AIM7 compute workload (1500 users) on a 4.8-rc1 kernel (HZ=1000)
with and without the patch has the following performance numbers
(with HPET or TSC as clock source):
The perf profile showed a reduction of the %CPU time consumed by
read_hpet from 11.19% without patch to 1.24% with patch.
[ tglx: It's really sad that we need to have such hacks just to deal with
the fact that cpu vendors have not managed to fix the TSC wreckage
within 15+ years. Were They Forgetting? ]
Signed-off-by: Waiman Long <Waiman.Long@hpe.com> Tested-by: Prarit Bhargava <prarit@redhat.com> Cc: Scott J Norton <scott.norton@hpe.com> Cc: Douglas Hatch <doug.hatch@hpe.com> Cc: Randy Wright <rwright@hpe.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/1473182530-29175-1-git-send-email-Waiman.Long@hpe.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit f99fd22e4d4bc84880a8a3117311bbf0e3a6a9dc)
When "primary_reselect" is set to "failure", primary interface should
not become active until current active slave is down. But if we set first
member of bond device as a "primary" interface and "primary_reselect"
is set to "failure" then whenever primary interface's link gets back(up)
it becomes active slave even if current active slave is still up.
With this patch, "bond_find_best_slave" will not traverse members if
primary interface is not candidate for failover/reselection and current
active slave is still up.
Reviewed-by: John Haxby <john.haxby@oracle.com> Signed-off-by: Mazhar Rana <mazhar.rana@cyberoam.com> Signed-off-by: Jay Vosburgh <j.vosburgh@gmail.com> Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
clear_page_dirty_for_io() has accumulated writeback and memcg subtleties
since v2.6.16 first introduced page migration; and the set_page_dirty()
which completed its migration of PageDirty, later had to be moderated to
__set_page_dirty_nobuffers(); then PageSwapBacked had to skip that too.
No actual problems seen with this procedure recently, but if you look into
what the clear_page_dirty_for_io(page)+set_page_dirty(newpage) is actually
achieving, it turns out to be nothing more than moving the PageDirty flag,
and its NR_FILE_DIRTY stat from one zone to another.
It would be good to avoid a pile of irrelevant decrementations and
incrementations, and improper event counting, and unnecessary descent of
the radix_tree under tree_lock (to set the PAGECACHE_TAG_DIRTY which
radix_tree_replace_slot() left in place anyway).
Do the NR_FILE_DIRTY movement, like the other stats movements, while
interrupts still disabled in migrate_page_move_mapping(); and don't even
bother if the zone is the same. Do the PageDirty movement there under
tree_lock too, where old page is frozen and newpage not yet visible:
bearing in mind that as soon as newpage becomes visible in radix_tree, an
un-page-locked set_page_dirty() might interfere (or perhaps that's just
not possible: anything doing so should already hold an additional
reference to the old page, preventing its migration; but play safe).
But we do still need to transfer PageDirty in migrate_page_copy(), for
those who don't go the mapping route through migrate_page_move_mapping().
Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 42cb14b110a5698ccf26ce59c4441722605a3743) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
mm/migrate.c
Dongli Zhang [Thu, 17 Nov 2016 05:55:27 +0000 (13:55 +0800)]
xen-netfront: cast grant table reference first to type int
IS_ERR_VALUE() in commit 87557efc27f6a50140fb20df06a917f368ce3c66
("xen-netfront: do not cast grant table reference to signed short") would
not return true for error code unless we cast ref first to type int.
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Oracle-Bug: 25138361
upstream commit: 269ebce4531b8edc4224259a02143181a1c1d77c Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Reviewed by: Jack F. Vogel <jack.vogel@oracle.com> Acked-by: Joe Jin <joe.jin@oracle.com>
Dongli Zhang [Thu, 17 Nov 2016 05:54:19 +0000 (13:54 +0800)]
xen-netfront: do not cast grant table reference to signed short
While grant reference is of type uint32_t, xen-netfront erroneously casts
it to signed short in BUG_ON().
This would lead to the xen domU panic during boot-up or migration when it
is attached with lots of paravirtual devices.
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Oracle-Bug: 25138362
upstream commit: 87557efc27f6a50140fb20df06a917f368ce3c66 Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Reviewed by: Jack F. Vogel <jack.vogel@oracle.com> Acked-by: Joe Jin <joe.jin@oracle.com>
Rama Nichanamatlu [Thu, 10 Nov 2016 08:30:07 +0000 (00:30 -0800)]
IB/core: uverbs: Do not wait for uverbs dev closure during forced system shutdown.
Waiting for uverbs dev file closure in the system forced shutdown [e.g
reboot -f] path locks up the reboot process forever preventing system
shutdown as application closure of this device file is very unlikely at
this time. This closure happens in orderly shutdown [e.g reboot] as
processes are sent SIGTERM.
Linn Crosetto [Wed, 16 Nov 2016 20:33:52 +0000 (12:33 -0800)]
acpi: Disable ACPI table override if securelevel is set
From the kernel documentation (initrd_table_override.txt):
If the ACPI_INITRD_TABLE_OVERRIDE compile option is true, it is possible
to override nearly any ACPI table provided by the BIOS with an
instrumented, modified one.
When securelevel is set, the kernel should disallow any unauthenticated
changes to kernel space. ACPI tables contain code invoked by the kernel, so
do not allow ACPI tables to be overridden if securelevel is set.
Signed-off-by: Linn Crosetto <linn@hpe.com>
Orabug: 25058372
CVE: CVE-2016-3699 Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com> Reviewed-by: Guru Anbalagane <guru.anbalagane@oracle.com>
Francisco Triviño [Mon, 14 Nov 2016 16:46:14 +0000 (08:46 -0800)]
IB/cm: avoid query device in CM REQ/REP
The query device needed in CM REQ/REP is a bit expensive since
it involves a MAD query and also it is not saved in the cache.
When the driver that holds the local device is different from
sif then there is no need to go through the query device. If
sif driver is identified, then we still need to go through the
query device in order to get the specific vendor id. This last
is to make sure the software workaround is applied only to the
PSIF revisions that are affected.
This patch filters those cases and avoids unnecessary MAD queries
when driver is different from sif.
Francisco Triviño [Mon, 14 Nov 2016 16:46:14 +0000 (08:46 -0800)]
IB/cm: return original rnr value when RNR WA for PSIF
With this patch, the original min_rnr_value set by the user is saved
in case it is later queried. The ib_qp flag has been re-purposed to
store the value in addition. This patch makes the RNR WA implementation
total transparent for the user.
Hans Westgaard Ry [Tue, 4 Oct 2016 12:09:17 +0000 (14:09 +0200)]
IB/core: Issue DREQ when receiving REQ/REP for stale QP
from "InfiBand Architecture Specifications Volume 1":
A QP is said to have a stale connection when only one side has
connection information. A stale connection may result if the remote CM
had dropped the connection and sent a DREQ but the DREQ was never
received by the local CM. Alternatively the remote CM may have lost
all record of past connections because its node crashed and rebooted,
while the local CM did not become aware of the remote node's reboot
and therefore did not clean up stale connections.
and:
A local CM may receive a REQ/REP for a stale connection. It shall
abort the connection issuing REJ to the REQ/REP. It shall then issue
DREQ with "DREQ:remote QPNâ\80\9d set to the remote QPN from the REQ/REP.
This patch solves a problem with reuse of QPN. Current codebase, that
is IPoIB, relies on a REAP-mechanism to do cleanup of the structures
in CM. A problem with this is the timeconstants governing this
mechanism; they are up to 768 seconds and the interface may look
inresponsive in that period. Issuing a DREQ (and receiving a DREP)
does the necessary cleanup and the interface comes up.
Wei Lin Guay [Mon, 31 Oct 2016 18:06:28 +0000 (19:06 +0100)]
sif: cq: cleanup cqe once a kernel qp is destroyed/reset
To ease the sqflush/rqflush workaround, sifdrv removes
all the associated cqes once a kernel qp is destroy/reset.
Even though IB specification 10.2.4.4 mentioned that
"Destroying a QP does not guarantee that CQEs of that
QP are deallocated from the CQ upon destruction.",
it also stated that "Even if the CQEs are already on
the CQ, it might not be possible to retrieve them"
Thus, IB spec is indicating that it is vendor specific
implementation and the ULP should not assume that the
cqes are in the cq once a qp is destroyed/reset.
Wei Lin Guay [Thu, 3 Nov 2016 09:46:22 +0000 (10:46 +0100)]
sif: cq: sif_poll_cq might not drain cq completely
In a scenario where a duplicate completion is detected, the sif_poll_cq
skips the remaining cqes in the cq. This code bug is introduced in
commit "sif: sqflush: Handle duplicate completions in poll_cq".
Wei Lin Guay [Mon, 31 Oct 2016 20:25:26 +0000 (21:25 +0100)]
sif: rq: do not flush rq if it is an srq
sifdrv needs to flush a regular rq (non-srq) once it
detects that a qp is transitioned into ERR state.
Nevertheless, if a qp is created with srq and
with no event handler, the srq might be accidentally
flushed once the qp is transitioned into ERR sate.
Wei Lin Guay [Mon, 31 Oct 2016 19:38:25 +0000 (20:38 +0100)]
sif: cq: use refcnt to disable/enable cq polling
Due to a hardware bug, sifdrv must clean up the cq
before it is being polled by the user. Thus, sifdrv
uses CQ_POLLING_NOT_ALLOWED bitmask to disable/enable
cq polling.
Nevertheless, the bit mask operation is not sufficient in
a shared cq scenario (many qps to one cq). The cq clean up
is performed by each qp and it might be performed concurrently.
As a result, the cq polling might be enabled before all qps
have clean up the cq.
Thus, this patch uses refcnt to disable/enable cq polling.
The CQ_POLLING_NOT_ALLOWED bitmask is kept in order to have
backward compatibility in the user library.
Knut Omang [Fri, 14 Oct 2016 11:27:26 +0000 (13:27 +0200)]
sif: pt: Add support for single thread modified page tables
Modifications to the page tables in sif_pt is protected by
a lock to allow multiple threads to add and subtract regions
to/from the page table in parallel. This functionality is
currently only needed/used by the special sq_cmpl page table
handling. In the future we might however need this also for
other cases, for instance to optimize further on page table
memory usage.
The kernel documentation for infiniband midlayer locking
requires that map_phys_fmr should be callable from any context.
This prevents us from blocking on a lock, something that happens
if there are contention for the lock (eg. more than one thread
involved in modifying the page table)
Implement another flag: thread_safe in a pt that determines
if a page table is going to need to be modified from multiple
threads simultaneously. For now keep a BUG_ON if the code
is attempted accessed in parallel for memory types
that should not ever see parallel access.
Knut Omang [Wed, 26 Oct 2016 09:44:00 +0000 (11:44 +0200)]
sif: pqp: Implement handling of PQPs in error.
The assumption is that any such situation that can arise in
production is due to an application that causes it's CQ to go
to error and where the PQP subsequently tries to post a CQ
operation that affects the CQ that is in error. In these cases,
the PQP itself goes to error and an event is generated.
This commit refactors the modify_qp logic slightly, as well as
implementing a modification cycle to bring a privileged QP
back up again. It also adds a new pqp debugfs file and some statistics
to help monitoring the new PQP specific state as well.
The resurrect operation is queued on the sif workqueue
by the new handle_pqp_event function, which is now properly
wired up to accept all PQP events. When a PQP is detected as
being in error, its last_set_state is updated, and in addition
the write_only flag is set, which causes new send reqs not to
touch any collect buffer as part of the operation.
This flag was introduced to allow the resurrect to set the PQP in
RTS again while still not triggering any sends.
This way the implementation allows clients to continue to
post requests to the PQP while it is in error or in transition
back to RTS again by just accepting these requests into the PQP
send queue without any writes to the collect buffer.
When in the INIT state, the resurrect worker updates
the SQ pointers to skip the request that triggered
the PQP error.
Once back in RTS, the resurrect worker can take the single SQ lock
which serializes posts, check the size of the send queue
and if >= 0, trigger the send queue scheduler to start processing these.
Once the QP is in SQS mode, or just idle if the queue was empty,
it is safe for ordering purposes to let normal posting with
collect buffer writes commence.
Hakon Bugge [Tue, 1 Nov 2016 13:16:30 +0000 (14:16 +0100)]
sif: query_device: Return correct #SGEs for EoIB
ULPs that use LSO, can only create QPs with #SGE entries being one
less than what is supported in HW. This because PSIF uses one entry
for the LSO stencil. The driver attempts to detect the ULP and from
that derive if LSO will be used.
This commit adds support for XVE (Xsigo Virtual Ethernet), and a
query_device() from the XVE driver (not the connected mode part), will
return one less #SGEs.
Further, since the XVE ULP is detected, we amend the sysfs listing of
QPs with EoIB (for XVE datagram mode QPs) and EoIB_CM (for XVE
connected mode QPs).
Knut Omang [Sun, 23 Oct 2016 16:22:35 +0000 (18:22 +0200)]
sif: pqp: Make setup/teardown function ref sif_pqp_info directly
Take advantage of the new, cleaner and more separate PQP data structures:
Simplify/abstract pqp setup/teardown by pointing into sdev
for a direct ref to the sif_pqp_info data structure.
Knut Omang [Sun, 23 Oct 2016 15:35:14 +0000 (17:35 +0200)]
sif: Move the rest of the pqp setup and teardown to sif_pqp
This finishes the restructure started in the previous commit,
to consolidate PQP handling logic to make it simpler
to extend without spreading the complexity to other parts
of the code.
Knut Omang [Fri, 28 Oct 2016 03:29:28 +0000 (05:29 +0200)]
sif: Move sif_dfs_register beyond base init
The debugfs setup must be initialized prior to PQP operation,
but must also be deinitialized before base table takedown,
otherwise we are exposed to faults due to a race condition between
a user accessing debugfs tables and driver unload.
This commit moves the dfs init/deinit from sif_probe to
sif_hw_init to achieve this order.
Knut Omang [Sun, 23 Oct 2016 07:31:26 +0000 (09:31 +0200)]
sif: Refactor PQP state out of sif_dev.
Prepare handling of PQPs for the additional complexity
of resurrect upon errors and query for undetected
error conditions by moving some of the state
now directly in sif_dev into a new sif_pqp_info
struct.
Already complex logic for PQP handling is going to
increase in complexity. We need to consolidate
to be able to keep clean and easily maintainable
interfaces.
Chuck Anderson [Thu, 10 Nov 2016 14:27:04 +0000 (06:27 -0800)]
Merge branch topic/uek-4.1/upstream-cherry-picks of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* topic/uek-4.1/upstream-cherry-picks:
ecryptfs: don't allow mmap when the lower fs doesn't support it
Revert "ecryptfs: forbid opening files without mmap handler"
Jeff Mahoney [Tue, 5 Jul 2016 21:32:30 +0000 (17:32 -0400)]
ecryptfs: don't allow mmap when the lower fs doesn't support it
There are legitimate reasons to disallow mmap on certain files, notably
in sysfs or procfs. We shouldn't emulate mmap support on file systems
that don't offer support natively.
Chuck Anderson [Wed, 9 Nov 2016 22:19:53 +0000 (14:19 -0800)]
Merge branch topic/uek-4.1/upstream-cherry-picks of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* topic/uek-4.1/upstream-cherry-picks:
percpu: fix synchronization between synchronous map extension and chunk destruction
percpu: fix synchronization between chunk->map_extend_work and chunk destruction
ALSA: timer: Fix leak in events via snd_timer_user_tinterrupt
ALSA: timer: Fix leak in events via snd_timer_user_ccallback
ALSA: timer: Fix leak in SNDRV_TIMER_IOCTL_PARAMS
Tejun Heo [Wed, 25 May 2016 15:48:25 +0000 (11:48 -0400)]
percpu: fix synchronization between synchronous map extension and chunk destruction
For non-atomic allocations, pcpu_alloc() can try to extend the area
map synchronously after dropping pcpu_lock; however, the extension
wasn't synchronized against chunk destruction and the chunk might get
freed while extension is in progress.
This patch fixes the bug by putting most of non-atomic allocations
under pcpu_alloc_mutex to synchronize against pcpu_balance_work which
is responsible for async chunk management including destruction.
Tejun Heo [Wed, 25 May 2016 15:48:25 +0000 (11:48 -0400)]
percpu: fix synchronization between chunk->map_extend_work and chunk destruction
Atomic allocations can trigger async map extensions which is serviced
by chunk->map_extend_work. pcpu_balance_work which is responsible for
destroying idle chunks wasn't synchronizing properly against
chunk->map_extend_work and may end up freeing the chunk while the work
item is still in flight.
This patch fixes the bug by rolling async map extension operations
into pcpu_balance_work.
Kangjie Lu [Tue, 3 May 2016 20:44:32 +0000 (16:44 -0400)]
ALSA: timer: Fix leak in events via snd_timer_user_tinterrupt
The stack object “r1” has a total size of 32 bytes. Its field
“event” and “val” both contain 4 bytes padding. These 8 bytes
padding bytes are sent to user without being initialized.
Signed-off-by: Kangjie Lu <kjlu@gatech.edu> Signed-off-by: Takashi Iwai <tiwai@suse.de>
Orabug: 25059885
CVE: CVE-2016-4578
Mainline v4.7 commit e4ec8cc8039a7063e24204299b462bd1383184a5 Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>
Kangjie Lu [Tue, 3 May 2016 20:44:20 +0000 (16:44 -0400)]
ALSA: timer: Fix leak in events via snd_timer_user_ccallback
The stack object “r1” has a total size of 32 bytes. Its field
“event” and “val” both contain 4 bytes padding. These 8 bytes
padding bytes are sent to user without being initialized.
Signed-off-by: Kangjie Lu <kjlu@gatech.edu> Signed-off-by: Takashi Iwai <tiwai@suse.de>
Orabug: 25059885
CVE: CVE-2016-4578
Mainline v4.7 commit 9a47e9cff994f37f7f0dbd9ae23740d0f64f9fe6 Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>
Kangjie Lu [Tue, 3 May 2016 20:44:07 +0000 (16:44 -0400)]
ALSA: timer: Fix leak in SNDRV_TIMER_IOCTL_PARAMS
The stack object “tread” has a total size of 32 bytes. Its field
“event” and “val” both contain 4 bytes padding. These 8 bytes
padding bytes are sent to user without being initialized.
Signed-off-by: Kangjie Lu <kjlu@gatech.edu> Signed-off-by: Takashi Iwai <tiwai@suse.de>
Orabug: 25059408
CVE: CVE-2016-4569
Mainline v4.7 commit cec8f96e49d9be372fdb0c3836dcf31ec71e457e Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>
Allen Pais [Mon, 30 May 2016 07:42:15 +0000 (13:12 +0530)]
SPARC64: PORT LDMVSW DRIVER TO UEK4
Port of the new ldmvsw (Ldoms Virtual Switch) driver to UEK4.
This code has already been submitted and accepted
into the mainline Linux kernel.
The ldmvsw is very similar in function to the existing sunvnet driver. The
sunvnet driver is therefore split to put the code common to both drivers
into the kernel for use by both drivers when loaded (see sunvnet_common.c/h).
Signed-off-by: Aaron Young <aaron.young@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com>
(cherry picked from commit 361afffe35368dc23d2c9df6d7797ccf9af8fe57)
Rob Gardner [Sun, 1 Nov 2015 23:51:34 +0000 (16:51 -0700)]
SPARC64: Fix bad FP register calculation
An additional problem was found in handle_ldf_stq
after adding the fix for the SIGFPE on no-fault
load. The calculation for freg is incorrect when
a single precision load is being handled. This
causes %f1 to be seen as %f32 etc, and the incorrect
register ends up being overwritten. This code
sequence demonstrates the problem:
ldd [%g1], %f32 ! g1 = valid address
lda [%i3] ASI_PNF, %f1 ! i3 = invalid address
std %f32, [%g1] ! %f32 is mangled
This is corrected by basing the freg calculation on
the load size.
Rob Gardner [Fri, 30 Oct 2015 19:18:00 +0000 (13:18 -0600)]
SPARC64: Respect no-fault ASI for floating exceptions
Floating point load instructions using ASI_PNF or other
no-fault ASIs should never cause a SIGFPE. A store-quad
instruction should naturally fault if a non-quad register
is given, but this constraint should not apply to loads,
which may be single precision, double, or quad, and the
only constraint should be that the target register type
be appropriate for the precision of the load. A bug in
handle_ldf_stq() unnecessarily restricts no-fault loads
to quad registers, and causes a floating point exception
if one is not given. This restriction is removed.
Signed-off-by: Rob Gardner <rob.gardner@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
The sysfs file /sys/devices/system/node/node0/cpulist is incorrect in the
single node case on sun4v machines as the machine description record in this
case does not contain any NUMA information. A default list from 0 to NR_CPUS
was used prior. This file is read by utilities such as 'numactl --hardware'
and lscpu to show CPU-to-node assignment.
In order to fix this issue, the numa_cpumask_lookup_table is cleared at
bootup. Whenever an extra cpu is bringup via __cpu_up, the corresponding
cpu mask is set in the numa_cpumask_lookup_table.
chris hyser [Fri, 23 Sep 2016 16:27:07 +0000 (09:27 -0700)]
sparc64: Cleans up PRIQ error and debugging messages.
Given that the lowest level arch dependent interrupt routines cannot actually
propagate any error back to the calling driver in the case of irq
request/enable/disable and setting affinity, PRIQ error messages need to
communicate failures in a more traceable way. The original error messages which
were more for internal debugging than regular usage have also been improved as
well as made controllable via a command line parameter priq=dbg.
Signed-off-by: Chris Hyser <chris.hyser@oracle.com>
(cherry picked from commit 89c31d4dd664cd2edc1f6d14aa62c75acfb0d172) Signed-off-by: Allen Pais <allen.pais@oracle.com>
Dave Kleikamp [Fri, 17 Jun 2016 14:51:04 +0000 (09:51 -0500)]
sparc: Remove console spam during kdump
Before executing the crash kernel, the panicking kernel cleans up the
irq state of the machine. This code contains a warning when cleaning up
unbound MSIs. Repeating this warning for each one floods the console and
can cause a waiting thread to time out before the other cpus have
completed.
This patch removes the warning and increases the time allowed for all
the cpus to complete the machine_capture_other_strands() function.
Dave Kleikamp [Mon, 27 Jun 2016 16:30:17 +0000 (11:30 -0500)]
sparc64: kdump: set crashing_cpu for panic
crashing_cpu was only being set in die_if_kernel() but not when a
crash dump is initiated from panic(). Move the initialization to
machine_crash_shutdown().
Also call bust_spinlocks() from die_if_kernel() to get rid of a warning
in smp_call_function_many(). It's already called in the panic path.
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
(cherry picked from commit 1b588be700fac73edd07c015ff53aecba5d92bec) Signed-off-by: Allen Pais <allen.pais@oracle.com>
Dave Kleikamp [Thu, 23 Jun 2016 19:04:30 +0000 (14:04 -0500)]
sparc: kexec: Don't mess with the tl register
I meddled with things I didn't fully understand while implementing commit b43bc8f0 - "sparc64: add missing code for crash_setup_regs()"
I had changed the tl register in order to read tstate, tpc, etc. without
really knowing what I was doing. This can be a disaster if the crashing
thread takes another interrupt. Currently, the crash utitility doesn't
even use those values. They are found on the stack instead.
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
(cherry picked from commit 80eb7e28d3c719bbe3af56de5a5a8c68b764dbb9) Signed-off-by: Allen Pais <allen.pais@oracle.com>
In case of a ldom/hardware not supporting ldc, ipmi_si module
will set the smi interface pointer to NULL after ldc channel
detection failure. However, ipmi_si module will crash during
unload due to absence of NULL check.
Add the smi interface null check and assign the workqueue to
NULL during cleanup to avoid double free panic.
Signed-off-by: Atish Patra <atish.patra@oracle.com> Reviewed-by: David Aldridge <david.j.aldridge@oracle.com> Reviewed-by: Vijay Kumar <vijay.ac.kumar@oracle.com> Reviewed-by: Rob Gardner <rob.gardner@oracle.com>
(cherry picked from commit f2546771efb0c6402a5ea65dac9c5dbce18150e6) Signed-off-by: Allen Pais <allen.pais@oracle.com>
Currently, ipmi driver fakes it self as a userland process
to access ipmi vldc channel.
This patch uses new cleaner vldc kernel interface that is added
for ipmi driver.
Signed-off-by: Atish Patra <atish.patra@oracle.com> Reviewed-by: Rob Gardner <rob.gardner@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com>
(cherry picked from commit 7a0d1deac3289130680a5ab1626c609b76c9f053)
IPMI driver will have a stale vldc file pointer if ILOM resets.
Thus, IPMI drivers failed to work after the reset is complete.
IPMI driver need to close that file pointer and open another after
ilom reset is complete.
This is achieved by trying to open vldc file in every 15 seconds
in a process context. As vldc or ldc can not detect a ILOM reset,
this is the best possible approach for the problem.
This is based on Rob's patch for mc reset fix. Signed-off-by: Atish Patra <atish.patra@oracle.com> Reviewed-by: Rob Gardner <rob.gardner@oracle.com>
(cherry picked from commit cf5139791a8241fcab1f59c1da0a9058def661f2) Signed-off-by: Allen Pais <allen.pais@oracle.com>
Vijay Kumar [Sun, 2 Oct 2016 21:40:18 +0000 (15:40 -0600)]
sparc64:mm/hugetlb: Set correct huge_pte_count index for 8M hugepages
Both set_huge_pte_at(...) and huge_ptep_get_and_clear(...)
call real_hugepage_size_to_pte_count_idx(hugepage_size) when adjusting
huge_pte_count. For 8MB/4MB the huge_pte_count index computed is 1(one).
This is incorrect because this index is for xl_hugepages. So the tsb
grow code in the mm fault path does not grow the tsb for 8MB/4MB
hugepages.
Signed-off-by: Bob Picco <bob.picco@oracle.com> Signed-off-by: Vijay Kumar <vijay.ac.kumar@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Tested-by: Mike Kravetz <mike.kravetz@oracle.com>
Orabug: 24490586
(cherry picked from commit c928d6fccaa59bd4b6cffc904144fa67a4726ff6) Signed-off-by: Allen Pais <allen.pais@oracle.com>
As pages are allocated by a task, counters in the mm and mm_context
structures are used to track these allocations. These counters are
then used to size the task's TSBs. This patch addresses issues where
counts are not maintained properly, and TSBs of the incorrect size
are created for the task.
- hugetlb pages are not included in a task's RSS calculations. However,
the routine do_sparc64_fault() calculates the size of base TSB block
by subtracting total size of hugetlb pages from RSS. Since hugetlb
size is likely larger than RSS, a negative value is passed as an
unsigned value to the routine which allocates the TSB block. The
'negative unsigned' value appears as a really big value and results in
a maximum sized base TSB being allocated. This is the case for almost
all tasks using hugetlb pages.
THP pages are also counted in huge_pte_count[MM_PTES_HUGE]. And
unlike hugetlb pages, THP pages are included in a task's RSS.
Therefore, both hugetlb and THP can not be counted for in
huge_pte_count[MM_PTES_HUGE].
Add a new counter thp_pte_count for THP pages, and use this value for
adjusting RSS to size the base TSB.
- In order to save memory, THP makes use of a huge zero page. This huge
zero page does not count against a task's RSS, but it does consume TSB
entries. Therefore, count huge zero page entries in
huge_pte_count[MM_PTES_HUGE].
- Accounting of THP pages is done in the routine set_pmd_at().
Unfortunately, this does not catch the case where a THP page is split.
To handle this case, decrement the count in pmdp_invalidate().
pmdp_invalidate is only called when splitting a THP. However, 'sanity
checks' are added in case it is ever called for other purposes.
- huge_pte_count[MM_PTES_HUGE] tracks the number of HPAGE_SIZE (8M) pages
used by the task. This value is used to size the TSB for HPAGE_SIZE
pages. However, for each HPAGE_SIZE (8M) there are two REAL_HPAGE_SIZE
(4M) pages. The TSB contains an entry for each REAL_HPAGE_SIZE page.
Therefore, the number of REAL_HPAGE_SIZE pages used by the task should
be used to size the MM_PTES_HUGE TSB. A new compile time constant
REAL_HPAGE_PER_HPAGE is used to multiply huge_pte_count[MM_PTES_HUGE]
before sizing the TSB.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Vijay Kumar <vijay.ac.kumar@oracle.com> Tested-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
(cherry picked from commit 417fc85e759b6d4c4602fbdbdd5375ec5ddf2cb0) Signed-off-by: Allen Pais <allen.pais@oracle.com>