Linus Torvalds [Thu, 13 Oct 2016 20:07:36 +0000 (13:07 -0700)]
mm: remove gup_flags FOLL_WRITE games from __get_user_pages()
This is an ancient bug that was actually attempted to be fixed once
(badly) by me eleven years ago in commit 4ceb5db9757a ("Fix
get_user_pages() race for write access") but that was then undone due to
problems on s390 by commit f33ea7f404e5 ("fix get_user_pages bug").
In the meantime, the s390 situation has long been fixed, and we can now
fix it by checking the pte_dirty() bit properly (and do it better). The
s390 dirty bit was implemented in abf09bed3cce ("s390/mm: implement
software dirty bits") which made it into v3.9. Earlier kernels will
have to look at the page state itself.
Also, the VM has become more scalable, and what used a purely
theoretical race back then has become easier to trigger.
To fix it, we introduce a new internal FOLL_COW flag to mark the "yes,
we already did a COW" rather than play racy games with FOLL_WRITE that
is very fundamental, and then use the pte dirty flag to validate that
the FOLL_COW flag is still valid.
Reported-and-tested-by: Phil "not Paul" Oester <kernel@linuxace.com> Acked-by: Hugh Dickins <hughd@google.com> Reviewed-by: Michal Hocko <mhocko@suse.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Willy Tarreau <w@1wt.eu> Cc: Nick Piggin <npiggin@gmail.com> Cc: Greg Thelen <gthelen@google.com> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 19be0eaffa3ac7d8eb6784ad9bdbc7d67ed8e619)
Orabug: 24926639
Conflicts:
include/linux/mm.h
mm/gup.c Signed-off-by: Chuck Anderson <chuck.anderson@oracle.com>
The current NVME driver allocates I/O queue per-cpu and alloctes IRQ
per-queue for the devices, a large number of IRQs will be allotted to
the I/O queues on a large NUMA/SMP system with multiple NVME devices
installed because of this design.
It would cause failure of CPU hotplug operations on the above mentioned
system, the problem is that the CPU cores could not be hotplugged after
certain number of them are offlined because the remaining online CPU cores
have not enough IRQ vectors to accept the large number of migrated IRQs.
This patch fixes it by providing a way to reduce the NVME queue IRQs to
an acceptable number.
Signed-off-by: Shan Hai <shan.hai@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Jens Axboe [Wed, 24 Aug 2016 21:38:01 +0000 (15:38 -0600)]
blk-mq: improve warning for running a queue on the wrong CPU
__blk_mq_run_hw_queue() currently warns if we are running the queue on a
CPU that isn't set in its mask. However, this can happen if a CPU is
being offlined, and the workqueue handling will place the work on CPU0
instead. Improve the warning so that it only triggers if the batch cpu
in the hardware queue is currently online. If it triggers for that
case, then it's indicative of a flow problem in blk-mq, so we want to
retain it for that case.
There are several race conditions while freezing queue.
When unfreezing queue, there is a small window between decrementing
q->mq_freeze_depth to zero and percpu_ref_reinit() call with
q->mq_usage_counter. If the other calls blk_mq_freeze_queue_start()
in the window, q->mq_freeze_depth is increased from zero to one and
percpu_ref_kill() is called with q->mq_usage_counter which is already
killed. percpu refcount should be re-initialized before killed again.
Also, there is a race condition while switching to percpu mode.
percpu_ref_switch_to_percpu() and percpu_ref_kill() must not be
executed at the same time as the following scenario is possible:
1. q->mq_usage_counter is initialized in atomic mode.
(atomic counter: 1)
2. After the disk registration, a process like systemd-udev starts
accessing the disk, and successfully increases refcount successfully
by percpu_ref_tryget_live() in blk_mq_queue_enter().
(atomic counter: 2)
3. In the final stage of initialization, q->mq_usage_counter is being
switched to percpu mode by percpu_ref_switch_to_percpu() in
blk_mq_finish_init(). But if CONFIG_PREEMPT_VOLUNTARY is enabled,
the process is rescheduled in the middle of switching when calling
wait_event() in __percpu_ref_switch_to_percpu().
(atomic counter: 2)
4. CPU hotplug handling for blk-mq calls percpu_ref_kill() to freeze
request queue. q->mq_usage_counter is decreased and marked as
DEAD. Wait until all requests have finished.
(atomic counter: 1)
5. The process rescheduled in the step 3. is resumed and finishes
all remaining work in __percpu_ref_switch_to_percpu().
A bias value is added to atomic counter of q->mq_usage_counter.
(atomic counter: PERCPU_COUNT_BIAS + 1)
6. A request issed in the step 2. is finished and q->mq_usage_counter
is decreased by blk_mq_queue_exit(). q->mq_usage_counter is DEAD,
so atomic counter is decreased and no release handler is called.
(atomic counter: PERCPU_COUNT_BIAS)
7. CPU hotplug handling in the step 4. will wait forever as
q->mq_usage_counter will never be zero.
Also, percpu_ref_reinit() and percpu_ref_kill() must not be executed
at the same time. Because both functions could call
__percpu_ref_switch_to_percpu() which adds the bias value and
initialize percpu counter.
Fix those races by serializing with per-queue mutex.
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Ming Lei <tom.leiming@gmail.com>
(cherry picked from https://patchwork.kernel.org/patch/7269471/)
Signed-off-by: Shan Hai <shan.hai@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Ashok Vairavan [Mon, 10 Oct 2016 20:31:35 +0000 (13:31 -0700)]
nvme: Remove RCU namespace protection
We can't sleep with RCU read lock held, but we need to do potentially
blocking stuff to namespace queues when iterating the list. This patch
removes the RCU locking and holds a mutex instead.
To prevent deadlocks, this patch removes holding the mutex during
namespace scanning and removal. The unlocked namespace scanning is made
safe by holding a reference to the namespace being scanned.
List iteration that does IO has to be unlocked to allow error recovery.
The caller must ensure the list can not be manipulated during such an
event, so this patch adds a comment explaining this requirement to the
only function that iterates an unlocked list. All callers currently
meet this requirement, so no further changes required.
List iterations that do not do IO can safely use the lock since it couldn't
block recovery from missing forced IO completions.
Reported-by: Ming Lin <mlin at kernel.org>
[fixes 0bf77e9 nvme: switch to RCU freeing the namespace] Signed-off-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 32f0c4afb4363e31dad49202f1554ba591d649f2)
Ashok Vairavan [Mon, 10 Oct 2016 19:13:01 +0000 (12:13 -0700)]
NVMe: Implement namespace list scanning
The NVMe 1.1 specification provides an identify mode to return a
list of active namespaces. This is more efficient to discover which
namespace identifiers are active on a controller, providing potentially
significant improvement in scan time for controllers with sparesly
populated namespaces.
Signed-off-by: Keith Busch <keith.busch@intel.com>
[hch: add quirk for the broken Qemu Identify implementation. To be relaxed
later] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 540c801c65eb58e05e0ca38b6fd644a83d7e2b33)
dtrace: ensure new SDT info generation works on sparc64
Due to addresses on sparc64 being in the lower range of the memory map, the
calculated addresses for SDT probe points were represented with less hex
characters then their function base address counterparts (obtained from
objdump output). This messed up the sorting based on address values, and
resulted in all probe points being associated with the last function in the
list.
This commit ensures that all addresses are 16 hex characters long.
Currently, GRO can do unlimited recursion through the gro_receive
handlers. This was fixed for tunneling protocols by limiting tunnel GRO
to one level with encap_mark, but both VLAN and TEB still have this
problem. Thus, the kernel is vulnerable to a stack overflow, if we
receive a packet composed entirely of VLAN headers.
This patch adds a recursion counter to the GRO layer to prevent stack
overflow. When a gro_receive function hits the recursion limit, GRO is
aborted for this skb and it is processed normally.
Fixes: 9b174d88c257 ("net: Add Transparent Ethernet Bridging GRO support.") Fixes: 66e5133f19e9 ("vlan: Add GRO support for non hardware accelerated vlan") Signed-off-by: Sabrina Dubroca <sd () queasysnail net> Reviewed-by: Jiri Benc <jbenc () redhat com> Acked-by: Hannes Frederic Sowa <hannes () stressinduktion org>
(cherry picked from commit e71f3b1fca2ae5d0ae9d9c1a02a93d52beaae322)
Root cause of this issue is the same with the one fixed by last patch,
but this time credits for allocator inode and group descriptor may not
be consumed before trans extend.
Every time, ocfs2_extend_trans() included a credit for truncate log inode,
but as that inode had been managed by jbd2 running transaction first time,
it will not consume that credit until jbd2_journal_restart(). Since total
credits to extend always included the un-consumed ones, there will be more
and more un-consumed credit, at last jbd2_journal_restart() will fail due
to credit number over the half of max transction credit.
The following error was caught when unlink a large file with many extents.
Now function ocfs2_replay_truncate_records() first modifies tl_used,
then calls ocfs2_extend_trans() to extend transactions for gd and alloc
inode used for freeing clusters. jbd2_journal_restart() may be called
and it may happen that tl_used in truncate log is decreased but the
clusters are not freed, which means these clusters are lost. So we
should avoid extending transactions in these two operations.
Signed-off-by: joyce.xue <xuejiufei@huawei.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Acked-by: Joseph Qi <joseph.qi@huawei.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 102c2595aa193f598c0f4b1bf2037d168c80e551)
Johannes Thumshirn [Tue, 5 Apr 2016 09:50:45 +0000 (11:50 +0200)]
Revert "scsi: fix soft lockup in scsi_remove_target() on module removal"
Now that we've done a more comprehensive fix with the intermediate
target state we can remove the previous hack introduced with commit 90a88d6ef88e ("scsi: fix soft lockup in scsi_remove_target() on module
removal").
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Cc: stable@vger.kernel.org Reviewed-by: Ewan D. Milne <emilne@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 305c2e71b3d733ec065cb716c76af7d554bd5571)
ashok.vairavan [Sat, 8 Oct 2016 01:47:01 +0000 (21:47 -0400)]
scsi: Add intermediate STARGET_REMOVE state to scsi_target_state
Add intermediate STARGET_REMOVE state to scsi_target_state to avoid
running into the BUG_ON() in scsi_target_reap(). The STARGET_REMOVE
state is only valid in the path from scsi_remove_target() to
scsi_target_destroy() indicating this target is going to be removed.
This re-fixes the problem introduced in commits bc3f02a795d3 ("[SCSI]
scsi_remove_target: fix softlockup regression on hot remove") and 40998193560d ("scsi: restart list search after unlock in
scsi_remove_target") in a more comprehensive way.
[mkp: Included James' fix for scsi_target_destroy()]
Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Fixes: 40998193560dab6c3ce8d25f4fa58a23e252ef38 Cc: stable@vger.kernel.org Reported-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Reviewed-by: Ewan D. Milne <emilne@redhat.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: James Bottomley <jejb@linux.vnet.ibm.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Orabug: 24844559
Mainline commit: f05795d3d771f30a7bdc3a138bf714b06d42aa95
Conflicts:
The following changes are done to fix kabi breakage.
What appears to be happening is that something has pinned the target
so it can't go into STARGET_DEL via final release and the loop in
scsi_remove_target spins endlessly until that happens.
The fix for this soft lockup is to not keep looping over a device that
we've called remove on but which hasn't gone into DEL state. This
patch will retain a simplistic memory of the last target and not keep
looping over it.
Reported-by: Sebastian Herbszt <herbszt@gmx.de> Tested-by: Sebastian Herbszt <herbszt@gmx.de> Fixes: 40998193560dab6c3ce8d25f4fa58a23e252ef38 Cc: stable@vger.kernel.org Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
(cherry picked from commit 90a88d6ef88edcfc4f644dddc7eef4ea41bccf8b)
Christoph Hellwig [Mon, 19 Oct 2015 14:35:46 +0000 (16:35 +0200)]
scsi: restart list search after unlock in scsi_remove_target
When dropping a lock while iterating a list we must restart the search
as other threads could have manipulated the list under us. Without this
we can get stuck in an endless loop. This bug was introduced by
Thanks go to Dan Williams <dan.j.williams@intel.com> for tracking all this
prior history down.
Reported-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Fixes: bc3f02a795d3b4faa99d37390174be2a75d091bd Cc: stable@vger.kernel.org Signed-off-by: James Bottomley <JBottomley@Odin.com>
(cherry picked from commit 40998193560dab6c3ce8d25f4fa58a23e252ef38)
RDS: add reconnect retry scheme for stalled connections
RDS IB connections gets stalled at times and letting the connections
take its sweet time to reconnect. On passive side, we wait for 15 seconds
for such stalled connections which is too slow based on application
IO timeouts. IB connections are established in milliseconds so we better
drop these stuck connections early and retry.
The retry timeout is kept tunable via reconnect_retry_ms sysctl. The
upper bound for retries is tunbale via rds_sysctl_reconnect_max_retries.
Lower IP and exponential back-off scheme was added to save the
SM queries because of races but it doesn't do what its intended.
The exponential back-off scheme does a good job of backing off
for races. The code just falls back to the original scheme.
RDS: IB: suppress log prints for FLUSH_ERR/RETRY_EXC
Flush errors are normal while draining the QP and retry exceeded
errors are normal for RC connections with finite transport retry.
No need so flood the log file. RDS in-memory trace already logs it.
Santosh Shilimkar [Sat, 27 Aug 2016 02:32:57 +0000 (19:32 -0700)]
RDS: use c_wq for all activities on a connection
RDS connection work, send work, recv work etc events have been
serialised by use of a single threaded work queue. For loopback
connections, we created a separate thread but only connection
work is moved on it. This actually under utilises the thread
and creates un-necessary contention for send/recv work and
connection work for loopback connections.
We move remainder loopback work as well on the rds_local_wq
which garantees serialisation as well as delinks the loopback
and non loopback work(s).
Santosh Shilimkar [Fri, 5 Aug 2016 22:02:55 +0000 (15:02 -0700)]
RDS: make the rds_{local_}wq part of rds_connection
Instead of sprinkling if/else for loopback all over the place,
lets just add c_wq as part of rds_connection. This will prevent
missing cases like 'commit edca33be359c ("RDS: move more queing for
loopback connections to separate queue"), 'commit 8502173071b6
("rds: schedule local connection activity in proper workqueue")
or any future changes.
Santosh Shilimkar [Fri, 5 Aug 2016 21:22:53 +0000 (14:22 -0700)]
RDS: make rds_conn_drop() take reason argument
This removes the need of modifying the conn all over the place
and moves it inside rds_conn_drop(). Actually there is almost
no need to carry the 'c_drop_source' information as part of
rds_connection but since shutdown thread wants to log this
info for debug, the field is left as is for now.
Santosh Shilimkar [Thu, 26 May 2016 23:22:29 +0000 (16:22 -0700)]
RDS: IB: use address change event for failover/failback
The RDS active bonding code had few fundamental bugs which
have been addressed so the workaround(s) added can be removed.
These workaround are creating problems and races in connection
management code leading to occasional connections stalls.
Removal of these makes the code behavior predictable and clean.
Few notables fixes to mention here:
- Taking care of all layers of events for ports before
marking them up/down.
- ARP cache related fixes.
- Local loopback connection hang fix
Patch almost make RDS active bonding failover/failback path
as intended from the beginning.
i.e On failover/failback, the address change events needs to
be sent which then triggers all the necessary events to
re-establish the connection(s).
Santosh Shilimkar [Tue, 10 May 2016 05:51:11 +0000 (22:51 -0700)]
RDS: IB: drop workaround for loopback connection hangs
There is no need to modify the ARP cache directly under the
assumption that it helps to speed up the failover/failback for
loopback connections. The ARP cache is properly updated by core
IPv4 code and one of the issue with RDS active bonding code
with arp has been addressed as part of
'commit 42a7becc725f ("RDS: IB: Make use of ARPOP_REQUEST instead
of ARPOP_REPLY in bonding code")'. Remove the workaround added as
part of bug 16979994 with patch "RDS: Local address resolution may
be delayed after IP has moved"
This patch validates the num_values parameter from userland during the
HIDIOCGUSAGES and HIDIOCSUSAGES commands. Previously, if the report id was set
to HID_REPORT_ID_UNKNOWN, we would fail to validate the num_values parameter
leading to a heap overflow.
Cc: stable@vger.kernel.org Signed-off-by: Scott Bauer <sbauer@plzdonthack.me> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
(cherry picked from commit 93a2001bdfd5376c3dc2158653034c20392d15c5) Signed-off-by: Brian Maly <brian.maly@oracle.com>
1) Enable EoiB QP property for uVNIC's created on Titan
card using is_eoib variable.
2) Use qkey provided by OFOS for EoIB QP, this QKEY comes
as part of VNIC Install message.
3) If Titan Card has TSO enabled then enable TSO using
NETIF_F_TSO flag and notify upstream network stack about this
4) If Titan has Checksum offload feature then notify network stack.
about this and on each WR set IB_SEND_IP_CSUM bit.
5) Added some Debug Flags
6) Printing Qkey information and EoiB information in proc stats
In Case of Heart Beat loss due to saturn or Multicast issues
uVNIC driver needs to send XVE_NOTIFY_HBEAT_LOST as a part of
Operational Request to OFOS . This will enable OFOS to perform
appropriate actions
Konrad Rzeszutek Wilk [Mon, 3 Oct 2016 13:34:21 +0000 (09:34 -0400)]
Merge branch 'uek4/4.7-xen-backport' into topic/uek-4.1/xen
* uek4/4.7-xen-backport:
xenbus: simplify xenbus_dev_request_and_reply()
xenbus: don't bail early from xenbus_dev_request_and_reply()
xenbus: don't BUG() on user mode induced condition
xen-pciback: return proper values during BAR sizing
x86/xen: avoid m2p lookup when setting early page table entries
xen/pciback: Fix conf_space read/write overlap check.
x86/xen: fix upper bound of pmd loop in xen_cleanhighmap()
xen/balloon: Fix declared-but-not-defined warning
xen-blkfront: fix resume issues after a migration
xen-blkfront: don't call talk_to_blkback when already connected to blkback
xen: use same main loop for counting and remapping pages
Xen: don't warn about 2-byte wchar_t in efi
xen/gntdev: reduce copy batch size to 16
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
OraBug: 23585393
Hans Westgaard Ry [Fri, 30 Sep 2016 08:36:49 +0000 (10:36 +0200)]
sif: Retest last allocated entry with roundrobin allocation
Current codebase tests next element if it is unused(freed) when allocating
round-robin. In certain cases where an element is allocated and then
deallocated immediately it is convenient to reuse the element.
This commit introduces a new state variable 'in_error'
in the CQ state. When a CQ error event is detected
cq->in_error is set.
This state is then checked before
* posting a req_notify_cq
* invoking any of the completion generating
workarounds.
Also set qp->last_set_state to ERR if a fatal QP event
is detected to reduce further posting on that QP.
This commit reduces the risk of getting into a privileged
QP error scenario for some common cases of misbehaved
applications, and also enables the driver to terminate more
quickly if a priv.QP error has occurred.
The CQ full is calculated via last posted seq number (last_seq)
minus the last completion seq number (head_seq). Both last_seq
and head_seq are defined as u16. However, in the calculation to
verify that the CQ is not full, a wrong casting is performed.
This causes a false negative of CQ full in the wrapped case.
Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com> Reviewed-by: Knut Omang <knut.omang@oracle.com>
When a QP is created, the HW state is zeroed, then certain fields are
initialized. After modify_qp(), other fields are set. When a QP is
handed over to HW, HW will potentially modify parts of the QP
state. The QP will eventually be transitioned into the RESET state.
From RESET, it is legitimate to resurrect the QP.
It is imperative that the QP state is equal the state when it was
created when transitioned to RESET, in the case it will be resurrected
For any state to RESET transition, the IBTA specification states: "QP
attributes are reset to the same values after the QP was created."
This commit re-factors create_qp() and reset_qp() so the driver
adheres to the specification.
Due to two HW bugs, SIF needs to add additional cqes apart from
the requested N cqes. First issue is that the CQ cannot be full
when it is being invalidated. Hence, we need 1 extra entry.
The second issue is that HW might generate duplicate completions.
Thus, SIF needs 768 extra entries to cater for these duplicate
completions and 1 additional entry for the fence completion.
Then, SIF driver rounds up to the nearest 2^N. The number of cqes
available to the ulp/user, will be the above 2^N - (768 + 1 + 1).
Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com> Reviewed-by: HÃ¥kon Bugge <haakon.bugge@oracle.com>
sif: qp: Clear the QP state cq_int_err bit upon reset
During reset of a QP, the cq_int_err bit was not reset.
This would cause a subsequent transition to RTR to fail
and hardware to set the QP back in error again.
sif: qp: Collapsed two log statements + removed incorrect port number print
With debug level bit 0x2000 (SIF_QP) set, the driver logs QP creation
info. Two log statements logging the same information are
collapsed. Also, incorrect logging of port number is removed for all
QPs except QP0/1 , which are the only QPs that have valid port number
at their creation time.
sif: Avoid using SIFMT_2M for allocation of any tables in no_huge_page mode
The feature mask no_huge_pages, enabled for Xen due to
DMA address alignment issues with huge pages, did not apply
to allocation of CQs, RQs, and SQs, only to the tableworks.
This causes allocation of queues of these types
larger than 4M in total size to fail on Xen PV domains
such as dom0.
Mark the duplicate completion status as PSIF_WC_STATUS_DUPL_COMPL_ERR if
the additional/duplicate completions are detected by walk_and_update_cqes.
Then, the translate_wr_id can identify the duplicate completion during
sif_poll_cq.
Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com> Reviewed-by: Knut Omang <knut.omang@oracle.com>
The sif_fixup_cqes function, which update the wr_id from the SQ handle, is
moved to reset_qp to cover the scenario where IB user reuses a QP after
performing ib_modify_qp(RESET). This patch also handles a scenario in
sif_fixup_cqes where a QP has been reset multiple times but the IB user has
not polled the associated CQ completely.
Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com> Reviewed-by: Knut Omang <knut.omang@oracle.com>
Francisco Triviño [Tue, 13 Sep 2016 15:09:40 +0000 (17:09 +0200)]
sif: eq: Implement threaded interrupt handler
The handler function (sif_intr) for a single interrupt is mostly
processing completions notification events (CNE) as long as there
are events in the queue. Sometimes the CNEs are received at a higher
rate than the handler is able to process them, then it keeps
infinitely processing events until the queue might be full, which
leads to a fatal error, or the watchdog triggers a kernel panic,
as shown in orabug 24657844.
This commit replaces request_irq by request_threaded_irq, which
allows the driver to specify a threaded handler (sif_intr_worker)
in addition. The original handler function (sif_intr) is called
in hard interrupt context and can return IRQ_HANDLED if the timeout
SIF_IRQ_HANDLER_TIMEOUT is not exceeded or IRQ_WAKE_THREAD otherwise.
The flag IRQF_ONESHOT is used to ensure that the interrupt is
disabled when IRQ_WAKE_THREAD is returned.
The feature check_all_eqs_on_intr is no longer needed. This commit
removes this driver feature and hence simplifies the interrupt
handler implementation.
In ocfs2_lock_refcount_tree, if ocfs2_read_refcount_block() returns error,
we do ocfs2_refcount_tree_put twice (once in ocfs2_unlock_refcount_tree
and once outside it), thereby reducing the refcount of the refcount tree
twice, but we dont delete the tree in this case. This will make refcnt
of the tree = 0 and the ocfs2_refcount_tree_put will eventually call
ocfs2_mark_lockres_freeing, setting OCFS2_LOCK_FREEING for the
refcount_tree->rf_lockres.
The error returned by ocfs2_read_refcount_block is propagated all the way
back and for next iteration of write, ocfs2_lock_refcount_tree gets the
same tree back from ocfs2_get_refcount_tree because we havent deleted the
tree. Now we have the same tree, but OCFS2_LOCK_FREEING is set for
rf_lockres and eventually, when _ocfs2_lock_refcount_tree is called in
this iteration, BUG_ON( __ocfs2_cluster_lock:1395 ERROR: Cluster lock
called on freeing lockres T00000000000000000386019775b08d! flags 0x81) is
triggerred.
Call stack:
(loop16,11155,0):ocfs2_lock_refcount_tree:482 ERROR: status = -5
(loop16,11155,0):ocfs2_refcount_cow_hunk:3497 ERROR: status = -5
(loop16,11155,0):ocfs2_refcount_cow:3560 ERROR: status = -5
(loop16,11155,0):ocfs2_prepare_inode_for_refcount:2111 ERROR: status = -5
(loop16,11155,0):ocfs2_prepare_inode_for_write:2190 ERROR: status = -5
(loop16,11155,0):ocfs2_file_write_iter:2331 ERROR: status = -5
(loop16,11155,0):__ocfs2_cluster_lock:1395 ERROR: bug expression:
lockres->l_flags & OCFS2_LOCK_FREEING
(loop16,11155,0):__ocfs2_cluster_lock:1395 ERROR: Cluster lock called on
freeing lockres T00000000000000000386019775b08d! flags 0x81
------------[ cut here ]------------
kernel BUG at fs/ocfs2/dlmglue.c:1395!
Sumit Saxena [Thu, 10 Mar 2016 10:14:37 +0000 (02:14 -0800)]
megaraid_sas: Don't issue kill adapter for MFI controllers in case of PD list DCMD failure
There are few MFI adapters which do not support MR_DCMD_PD_LIST_QUERY so
if MFI adapters fail this DCMD, it should not be considered as FATAL and
driver should not issue kill adapter and set per controller's instance
variable- pd_list_not_supported so that same variable can be used inside
functions- slave_alloc and slave_configure to allow firmware scan.
Killing adapter because of DCMD failure when this DCMD is not supported
causes driver's probe getting failed. This issue got introduced by
commit 6d40afbc7d13 ("megaraid_sas: MFI IO timeout handling").
Killing adapter in case of this DCMD failure should be limited to Fusion
adapters only. Per controller's instance variable allow_fw_scan is
removed as pd_list_not_supported better reflect the purpose.
Fixes: 6d40afbc7d13359b30a5cd783e3db6ebefa5f40a Signed-off-by: Sumit Saxena <sumit.saxena@broadcom.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Hannes Reinicke <hare@suse.de> Reviewed-by: Ewan Milne <emilne@redhat.com> Reviewed-by: Tomas Henzl <thenzl@redhat.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 3084558658c2f9a48d7c460d57aeb30964c06b7e)
The dummy ruleset I used to test the original validation change was broken,
most rules were unreachable and were not tested by mark_source_chains().
In some cases rulesets that used to load in a few seconds now require
several minutes.
sample ruleset that shows the behaviour:
echo "*filter"
for i in $(seq 0 100000);do
printf ":chain_%06x - [0:0]\n" $i
done
for i in $(seq 0 100000);do
printf -- "-A INPUT -j chain_%06x\n" $i
printf -- "-A INPUT -j chain_%06x\n" $i
printf -- "-A INPUT -j chain_%06x\n" $i
done
echo COMMIT
[ pipe result into iptables-restore ]
This ruleset will be about 74mbyte in size, with ~500k searches
though all 500k[1] rule entries. iptables-restore will take forever
(gave up after 10 minutes)
Instead of always searching the entire blob for a match, fill an
array with the start offsets of every single ipt_entry struct,
then do a binary search to check if the jump target is present or not.
After this change ruleset restore times get again close to what one
gets when reverting 36472341017529e (~3 seconds on my workstation).
[1] every user-defined rule gets an implicit RETURN, so we get
300k jumps + 100k userchains + 100k returns -> 500k rule entries
Fixes: 36472341017529e ("netfilter: x_tables: validate targets of jumps") Reported-by: Jeff Wu <wujiafu@gmail.com> Tested-by: Jeff Wu <wujiafu@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
(cherry picked from commit 2686f12b26e217befd88357cf84e78d0ab3c86a1) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Make sure the table names via getsockopt GET_ENTRIES is nul-terminated
in ebtables and all the x_tables variants and their respective compat
code. Uncovered by KASAN.
Reported-by: Baozeng Ding <sploving1@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
(cherry picked from commit b301f2538759933cf9ff1f7c4f968da72e3f0757) Signed-off-by: Brian Maly <brian.maly@oracle.com>
This looks like refactoring, but its also a bug fix.
Problem is that the compat path (32bit iptables, 64bit kernel) lacks a few
sanity tests that are done in the normal path.
For example, we do not check for underflows and the base chain policies.
While its possible to also add such checks to the compat path, its more
copy&pastry, for instance we cannot reuse check_underflow() helper as
e->target_offset differs in the compat case.
Other problem is that it makes auditing for validation errors harder; two
places need to be checked and kept in sync.
At a high level 32 bit compat works like this:
1- initial pass over blob:
validate match/entry offsets, bounds checking
lookup all matches and targets
do bookkeeping wrt. size delta of 32/64bit structures
assign match/target.u.kernel pointer (points at kernel
implementation, needed to access ->compatsize etc.)
2- allocate memory according to the total bookkeeping size to
contain the translated ruleset
3- second pass over original blob:
for each entry, copy the 32bit representation to the newly allocated
memory. This also does any special match translations (e.g.
adjust 32bit to 64bit longs, etc).
4- check if ruleset is free of loops (chase all jumps)
5-first pass over translated blob:
call the checkentry function of all matches and targets.
The alternative implemented by this patch is to drop steps 3&4 from the
compat process, the translation is changed into an intermediate step
rather than a full 1:1 translate_table replacement.
In the 2nd pass (step #3), change the 64bit ruleset back to a kernel
representation, i.e. put() the kernel pointer and restore ->u.user.name .
This gets us a 64bit ruleset that is in the format generated by a 64bit
iptables userspace -- we can then use translate_table() to get the
'native' sanity checks.
This has two drawbacks:
1. we re-validate all the match and target entry structure sizes even
though compat translation is supposed to never generate bogus offsets.
2. we put and then re-lookup each match and target.
THe upside is that we get all sanity tests and ruleset validations
provided by the normal path and can remove some duplicated compat code.
iptables-restore time of autogenerated ruleset with 300k chains of form
-A CHAIN0001 -m limit --limit 1/s -j CHAIN0002
-A CHAIN0002 -m limit --limit 1/s -j CHAIN0003
shows no noticeable differences in restore times:
old: 0m30.796s
new: 0m31.521s
64bit: 0m25.674s
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit af815d264b7ed1cdceb0e1fdf4fa7d3bd5fe9a99) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Quoting John Stultz:
In updating a 32bit arm device from 4.6 to Linus' current HEAD, I
noticed I was having some trouble with networking, and realized that
/proc/net/ip_tables_names was suddenly empty.
Digging through the registration process, it seems we're catching on the:
Validate that all matches (if any) add up to the beginning of
the target and that each match covers at least the base structure size.
The compat path should be able to safely re-use the function
as the structures only differ in alignment; added a
BUILD_BUG_ON just in case we have an arch that adds padding as well.
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit a605e7476c66a13312189026e6977bad6ed3050d) Signed-off-by: Brian Maly <brian.maly@oracle.com>
We're currently asserting that targetoff + targetsize <= nextoff.
Extend it to also check that targetoff is >= sizeof(xt_entry).
Since this is generic code, add an argument pointing to the start of the
match/target, we can then derive the base structure size from the delta.
We also need the e->elems pointer in a followup change to validate matches.
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 451e4403bc4abc51539376d4314baa739ab9e996) Signed-off-by: Brian Maly <brian.maly@oracle.com>
We have targets and standard targets -- the latter carries a verdict.
The ip/ip6tables validation functions will access t->verdict for the
standard targets to fetch the jump offset or verdict for chainloop
detection, but this happens before the targets get checked/validated.
Thus we also need to check for verdict presence here, else t->verdict
can point right after a blob.
Spotted with UBSAN while testing malformed blobs.
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 73bfda1c492bef7038a87adfa887b7e6b7cd6679) Signed-off-by: Brian Maly <brian.maly@oracle.com>
32bit rulesets have different layout and alignment requirements, so once
more integrity checks get added to xt_check_entry_offsets it will reject
well-formed 32bit rulesets.
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit acbcf85306bd563910a2afe07f07d30381b031b0) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Once we add more sanity testing to xt_check_entry_offsets it
becomes relvant if we're expecting a 32bit 'config_compat' blob
or a normal one.
Since we already have a lot of similar-named functions (check_entry,
compat_check_entry, find_and_check_entry, etc.) and the current
incarnation is short just fold its contents into the callers.
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 801cd32774d12dccfcfc0c22b0b26d84ed995c6f) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Currently arp/ip and ip6tables each implement a short helper to check that
the target offset is large enough to hold one xt_entry_target struct and
that t->u.target_size fits within the current rule.
Unfortunately these checks are not sufficient.
To avoid adding new tests to all of ip/ip6/arptables move the current
checks into a helper, then extend this helper in followup patches.
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit a471ac817cf0e0d6e87779ca1fee216ba849e613) Signed-off-by: Brian Maly <brian.maly@oracle.com>
In the mark_source_chains function (net/ipv4/netfilter/ip_tables.c) it
is possible for a user-supplied ipt_entry structure to have a large
next_offset field. This field is not bounds checked prior to writing a
counter value at the supplied offset.
Problem is that mark_source_chains should not have been called --
the rule doesn't have a next entry, so its supposed to return
an absolute verdict of either ACCEPT or DROP.
However, the function conditional() doesn't work as the name implies.
It only checks that the rule is using wildcard address matching.
However, an unconditional rule must also not be using any matches
(no -m args).
The underflow validator only checked the addresses, therefore
passing the 'unconditional absolute verdict' test, while
mark_source_chains also tested for presence of matches, and thus
proceeeded to the next (not-existent) rule.
Unify this so that all the callers have same idea of 'unconditional rule'.
Reported-by: Ben Hawkes <hawkes@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 850c377e0e2d76723884d610ff40827d26aa21eb) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
net/ipv4/netfilter/arp_tables.c
In the mark_source_chains function (net/ipv4/netfilter/ip_tables.c) it
is possible for a user-supplied ipt_entry structure to have a large
next_offset field. This field is not bounds checked prior to writing a
counter value at the supplied offset.
Base chains enforce absolute verdict.
User defined chains are supposed to end with an unconditional return,
xtables userspace adds them automatically.
But if such return is missing we will move to non-existent next rule.
Reported-by: Ben Hawkes <hawkes@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit cf756388f8f34e02a338356b3685c46938139871) Signed-off-by: Brian Maly <brian.maly@oracle.com>
In the mark_source_chains function (net/ipv4/netfilter/ip_tables.c) it
is possible for a user-supplied ipt_entry structure to have a large
next_offset field. This field is not bounds checked prior to writing a
counter value at the supplied offset.
Problem is that mark_source_chains should not have been called --
the rule doesn't have a next entry, so its supposed to return
an absolute verdict of either ACCEPT or DROP.
However, the function conditional() doesn't work as the name implies.
It only checks that the rule is using wildcard address matching.
However, an unconditional rule must also not be using any matches
(no -m args).
The underflow validator only checked the addresses, therefore
passing the 'unconditional absolute verdict' test, while
mark_source_chains also tested for presence of matches, and thus
proceeeded to the next (not-existent) rule.
Unify this so that all the callers have same idea of 'unconditional rule'.
Reported-by: Ben Hawkes <hawkes@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com>
(cherry picked from commit 850c377e0e2d76723884d610ff40827d26aa21eb) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Ben Hawkes says:
integer overflow in xt_alloc_table_info, which on 32-bit systems can
lead to small structure allocation and a copy_from_user based heap
corruption.
Reported-by: Ben Hawkes <hawkes@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Brian Maly <brian.maly@oracle.com>
Otherwise this function may read data beyond the ruleset blob.
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
(cherry picked from commit 6e94e0cfb0887e4013b3b930fa6ab1fe6bb6ba91) Signed-off-by: Brian Maly <brian.maly@oracle.com>
The root cause is that CPU_UP_PREPARE is completely the wrong notifier
action from which to access cpu_data(), because smp_store_cpu_info()
won't have been executed by the target CPU at that point, which in turn
means that ->x86_cache_max_rmid and ->x86_cache_occ_scale haven't been
filled out.
Instead let's invoke our handler from CPU_STARTING and rename it
appropriately.
Reported-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Matt Fleming <matt.fleming@intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Ashok Raj <ashok.raj@intel.com> Cc: Kanaka Juvva <kanaka.d.juvva@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vikas Shivappa <vikas.shivappa@intel.com> Link: http://lkml.kernel.org/r/1438863163-14083-1-git-send-email-matt@codeblueprint.co.uk Signed-off-by: Ingo Molnar <mingo@kernel.org>
Orabug: 24745516
(cherry picked from commit d7a702f0b1033cf402fef65bd6395072738f0844) Acked-by: Chuck Anderson <chuck.anderson@oracle.com>