Just call kmem_cache_zalloc() instead of calling kmem_cache_alloc().
We're just initializing most fields to 0, false and NULL later on
_anyway_, so to make the code mode readable and potentially gain
a bit of performance (completely untested claim), we should fill our
btrfs_trans_handle with zeros on allocation then just initialize
those five remaining fields (not counting the list_heads) as normal.
Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Alexandru Moise <00moses.alexander00@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
(cherry picked from commit f2f767e7345dfe56102d6809f647ba38a238f718) Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
Conflicts:
fs/btrfs/transaction.c
Mike Snitzer [Thu, 25 Aug 2016 01:12:58 +0000 (21:12 -0400)]
dm flakey: fix reads to be issued if drop_writes configured
v4.8-rc3 commit 99f3c90d0d ("dm flakey: error READ bios during the
down_interval") overlooked the 'drop_writes' feature, which is meant to
allow reads to be issued rather than errored, during the down_interval.
Fixes: 99f3c90d0d ("dm flakey: error READ bios during the down_interval")
Orabug: 25444528
Reported-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Aniket Alshi <aniket.alshi@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Chuck Anderson [Fri, 17 Feb 2017 04:44:51 +0000 (20:44 -0800)]
Merge branch 'topic/uek-4.1/ofed' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* topic/uek-4.1/ofed:
Revert "RDS: Make message size limit compliant with spec"
RDS: ActiveBonding: Make its own thread for active active
RDS: correct condition check in reconnect_timeout()
RDS: ActiveBonding: Create a cluster sync point for failback
This partial revert is needed due to regression with some RDS
applications. It reverts RDS message size limit changes but retains the
QOS related changes in the data send path.
Santosh Shilimkar [Sat, 27 Aug 2016 03:19:43 +0000 (20:19 -0700)]
RDS: ActiveBonding: Make its own thread for active active
The IP related work has no relation to RDS internal workers and
the active active itself is for full HOST than just RDS. Move
all the IP specific work(s) to its own dedicated workqueue.
This will reduce queue contention and also further reduce direct
link between RDS and Active bonding and help to separate the
active active code from RDS.
Santosh Shilimkar [Tue, 8 Nov 2016 01:22:51 +0000 (17:22 -0800)]
RDS: ActiveBonding: Create a cluster sync point for failback
On hardware port linkups, at time the multi-cast joins fails
which delays the IP layer to bringup the interface quickly.
Subsequent multi-cast retry might succeed and then the IP
layer will be ready for IP migration. This happens very
sporadically on bare metal systems but more often on VM systems
and the number of multi-cast queries also goes up with number of VMs.
This create load of RC connection thrashing across the cluster
since the IP migration gets staggered which is not ideal for
active active. So we create a sync point so that entire cluster
gets synced up. This helps to reduce the thrashing and premature
failover attempts. Obviously its only applicable for failback
A user sysctl is provided "active_bonding_failback_ms"
in case there is a need to tune the sync point.
Chuck Anderson [Tue, 7 Feb 2017 05:56:11 +0000 (21:56 -0800)]
Merge branch topic/uek-4.1/upstream-cherry-picks of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* topic/uek-4.1/upstream-cherry-picks:
net: Documentation: Fix default value tcp_limit_output_bytes
tcp: double default TSQ output bytes limit
scsi: qla2xxx: Get mutex lock before checking optrom_state
kvm: x86: Check memopp before dereference (CVE-2016-8630)
firewire: net: guard against rx buffer overflows
USB: usbfs: fix potential infoleak in devio
usbnet: cleanup after bind() in probe()
cdc_ncm: do not call usbnet_link_change from cdc_ncm_bind
cdc_ncm: Add support for moving NDP to end of NCM frame
x86/mm/32: Enable full randomization on i386 and X86_32
Paul Durrant [Thu, 12 May 2016 13:43:03 +0000 (14:43 +0100)]
xen-netback: fix extra_info handling in xenvif_tx_err()
Patch 562abd39 "xen-netback: support multiple extra info fragments
passed from frontend" contained a mistake which can result in an in-
correct number of responses being generated when handling errors
encountered when processing packets containing extra info fragments.
This patch fixes the problem.
Signed-off-by: Paul Durrant <paul.durrant@citrix.com> Reported-by: Jan Beulich <JBeulich@suse.com> Cc: Wei Liu <wei.liu2@citrix.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 72eec92accabe3ec34f27a9d3cd459bf5a877c33) Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Orabug: 25445336 Tested-by: Majid Valiollahzadeh <majid.valiollahzadeh@oracle.com>
Wei Liu [Wed, 3 Jun 2015 10:10:42 +0000 (11:10 +0100)]
tcp: double default TSQ output bytes limit
Xen virtual network driver has higher latency than a physical NIC.
Having only 128K as limit for TSQ introduced 30% regression in guest
throughput.
This patch raises the limit to 256K. This reduces the regression to 8%.
This buys us more time to work out a proper solution in the long run.
Signed-off-by: Wei Liu <wei.liu2@citrix.com> Cc: David Miller <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit c39c4c6abb89d24454b63798ccbae12b538206a5)
Oracle-Bug: 25424818 Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Tested-by: Raj Matharasi <raj.matharasi@oracle.com> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Milan P. Gandhi [Sat, 24 Dec 2016 16:32:46 +0000 (22:02 +0530)]
scsi: qla2xxx: Get mutex lock before checking optrom_state
There is a race condition with qla2xxx optrom functions where one thread
might modify optrom buffer, optrom_state while other thread is still
reading from it.
In couple of crashes, it was found that we had successfully passed the
following 'if' check where we confirm optrom_state to be
QLA_SREADING. But by the time we acquired mutex lock to proceed with
memory_read_from_buffer function, some other thread/process had already
modified that option rom buffer and optrom_state from QLA_SREADING to
QLA_SWAITING. Then we got ha->optrom_buffer 0x0 and crashed the system:
Below patch modifies qla2x00_sysfs_read_optrom,
qla2x00_sysfs_write_optrom functions to get the mutex_lock before
checking ha->optrom_state to avoid similar crashes.
The patch was applied and tested and same crashes were no longer
observed again.
Tested-by: Milan P. Gandhi <mgandhi@redhat.com> Signed-off-by: Milan P. Gandhi <mgandhi@redhat.com> Reviewed-by: Laurence Oberman <loberman@redhat.com> Acked-by: Himanshu Madhani <himanshu.madhani@cavium.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit c7702b8c22712a06080e10f1d2dee1a133ec8809) Signed-off-by: Ritika Srivastava <ritika.srivastava@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Commit 41061cdb98 ("KVM: emulate: do not initialize memopp") removes a
check for non-NULL under incorrect assumptions. An undefined instruction
with a ModR/M byte with Mod=0 and R/M-5 (e.g. 0xc7 0x15) will attempt
to dereference a null pointer here.
Fixes: 41061cdb98a0bec464278b4db8e894a3121671f5
Message-Id: <1477592752-126650-2-git-send-email-osh@google.com> Signed-off-by: Owen Hofmann <osh@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit d9092f52d7e61dd1557f2db2400ddb430e85937e) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
The IP-over-1394 driver firewire-net lacked input validation when
handling incoming fragmented datagrams. A maliciously formed fragment
with a respectively large datagram_offset would cause a memcpy past the
datagram buffer.
So, drop any packets carrying a fragment with offset + length larger
than datagram_size.
In addition, ensure that
- GASP header, unfragmented encapsulation header, or fragment
encapsulation header actually exists before we access it,
- the encapsulated datagram or fragment is of nonzero size.
Reported-by: Eyal Itkin <eyal.itkin@gmail.com> Reviewed-by: Eyal Itkin <eyal.itkin@gmail.com> Fixes: CVE 2016-8633 Cc: stable@vger.kernel.org Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de>
(cherry picked from commit 667121ace9dbafb368618dbabcf07901c962ddac) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
The stack object “ci” has a total size of 8 bytes. Its last 3 bytes
are padding bytes which are not initialized and leaked to userland
via “copy_to_user”.
Signed-off-by: Kangjie Lu <kjlu@gatech.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 681fef8380eb818c0b845fca5d2ab1dcbab114ee) Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Oliver Neukum [Mon, 7 Mar 2016 10:31:10 +0000 (11:31 +0100)]
usbnet: cleanup after bind() in probe()
In case bind() works, but a later error forces bailing
in probe() in error cases work and a timer may be scheduled.
They must be killed. This fixes an error case related to
the double free reported in
http://www.spinics.net/lists/netdev/msg367669.html
and needs to go on top of Linus' fix to cdc-ncm.
Signed-off-by: Oliver Neukum <ONeukum@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 1666984c8625b3db19a9abc298931d35ab7bc64b) Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Bjørn Mork [Mon, 7 Mar 2016 20:15:36 +0000 (21:15 +0100)]
cdc_ncm: do not call usbnet_link_change from cdc_ncm_bind
usbnet_link_change will call schedule_work and should be
avoided if bind is failing. Otherwise we will end up with
scheduled work referring to a netdev which has gone away.
Instead of making the call conditional, we can just defer
it to usbnet_probe, using the driver_info flag made for
this purpose.
cdc_ncm: Add support for moving NDP to end of NCM frame
NCM specs are not actually mandating a specific position in the frame for
the NDP (Network Datagram Pointer). However, some Huawei devices will
ignore our aggregates if it is not placed after the datagrams it points
to. Add support for doing just this, in a per-device configurable way.
While at it, update NCM subdrivers, disabling this functionality in all of
them, except in huawei_cdc_ncm where it is enabled instead.
We aren't making any distinction between different Huawei NCM devices,
based on what the vendor driver does. Standard NCM devices are left
unaffected: if they are compliant, they should be always usable, still
stay on the safe side.
This change has been tested and working with a Huawei E3131 device (which
works regardless of NDP position), a Huawei E3531 (also working both
ways) and an E3372 (which mandates NDP to be after indexed datagrams).
V1->V2:
- corrected wrong NDP acronym definition
- fixed possible NULL pointer dereference
- patch cleanup
V2->V3:
- Properly account for the NDP size when writing new packets to SKB
Signed-off-by: Enrico Mioso <mrkiko.rs@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 4a0e3e989d66bb7204b163d9cfaa7fa96d0f2023) Signed-off-by: Kirtikar Kashyap <kirtikar.kashyap@oracle.com> Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Hector Marco-Gisbert [Thu, 10 Mar 2016 19:51:00 +0000 (20:51 +0100)]
x86/mm/32: Enable full randomization on i386 and X86_32
Currently on i386 and on X86_64 when emulating X86_32 in legacy mode, only
the stack and the executable are randomized but not other mmapped files
(libraries, vDSO, etc.). This patch enables randomization for the
libraries, vDSO and mmap requests on i386 and in X86_32 in legacy mode.
By default on i386 there are 8 bits for the randomization of the libraries,
vDSO and mmaps which only uses 1MB of VA.
This patch preserves the original randomness, using 1MB of VA out of 3GB or
4GB. We think that 1MB out of 3GB is not a big cost for having the ASLR.
The first obvious security benefit is that all objects are randomized (not
only the stack and the executable) in legacy mode which highly increases
the ASLR effectiveness, otherwise the attackers may use these
non-randomized areas. But also sensitive setuid/setgid applications are
more secure because currently, attackers can disable the randomization of
these applications by setting the ulimit stack to "unlimited". This is a
very old and widely known trick to disable the ASLR in i386 which has been
allowed for too long.
Another trick used to disable the ASLR was to set the ADDR_NO_RANDOMIZE
personality flag, but fortunately this doesn't work on setuid/setgid
applications because there is security checks which clear Security-relevant
flags.
This patch always randomizes the mmap_legacy_base address, removing the
possibility to disable the ASLR by setting the stack to "unlimited".
Mukesh Kacker [Wed, 23 Dec 2015 03:49:49 +0000 (19:49 -0800)]
ib_uverbs: Allocate pd in a lazy manner to conserve resources
For usnic devices devices where the maximum number of pd
resources are limited (usnic devices), its a waste to
allocate this resource on device initialization.
Chuck Anderson [Mon, 23 Jan 2017 22:29:42 +0000 (14:29 -0800)]
Merge branch 'topic/uek-4.1/drivers' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* topic/uek-4.1/drivers: (66 commits)
ib/mlx4: add msi-x allocation kernel msg logging
NVMe: reduce admin queue depth as workaround for Samsung EPIC SQ errata
nvme: Limit command retries
nvme: avoid cqe corruption when update at the same time as read
NVMe: Don't unmap controller registers on reset
net: ena: change the return type of ena_set_push_mode() to be void.
net: ena: Fix error return code in ena_device_init()
net: ena: Remove unnecessary pci_set_drvdata()
net: ena: Add a driver for Amazon Elastic Network Adapters (ENA)
bnxt_en: Add interface to support RDMA driver.
bnxt_en: Refactor the driver registration function with firmware.
bnxt_en: Reserve RDMA resources by default.
bnxt_en: Improve completion ring allocation for VFs.
bnxt_en: Move function reset to bnxt_init_one().
bnxt_en: Enable MSIX early in bnxt_init_one().
bnxt_en: Add bnxt_set_max_func_irqs().
bnxt_en: Add PFC statistics.
bnxt_en: Implement DCBNL to support host-based DCBX.
bnxt_en: Update firmware header file to latest 1.6.0.
bnxt_en: Re-factor bnxt_setup_tc().
...
Chuck Anderson [Mon, 23 Jan 2017 22:28:05 +0000 (14:28 -0800)]
Merge branch topic/uek-4.1/upstream-cherry-picks of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* topic/uek-4.1/upstream-cherry-picks:
Don't feed anything but regular iovec's to blk_rq_map_user_iov
crypto: algif_hash - Only export and import on sockets with data
Qing Huang [Fri, 16 Dec 2016 00:03:58 +0000 (16:03 -0800)]
ib/mlx4: add msi-x allocation kernel msg logging
Kernel msg prints are added in the mlx4 driver when enabling msi-x
vectors during device initialization. This would help us to debug
issues when we encounter errors in this area on both bare metal and
VM.
Linus Torvalds [Wed, 7 Dec 2016 00:18:14 +0000 (16:18 -0800)]
Don't feed anything but regular iovec's to blk_rq_map_user_iov
In theory we could map other things, but there's a reason that function
is called "user_iov". Using anything else (like splice can do) just
confuses it.
Reported-and-tested-by: Johannes Thumshirn <jthumshirn@suse.de> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit a0ac402cfcdc904f9772e1762b3fda112dcc56a0)
Orabug: 25230657
CVE: CVE-2016-9576 Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Conflicts:
block/blk-map.c
Chuck Anderson [Mon, 23 Jan 2017 07:24:30 +0000 (23:24 -0800)]
Merge branch topic/uek-4.1/sparc of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* topic/uek-4.1/sparc:
Revert "sparc64: struct adi_caps should use __u64, not u64"
SPARC64: ds driver: Make memory allocations ATOMIC and enhance debugging
sparc64: Add symbolic access to M7 performance counters to perf
sonoma: perf: add support for sonoma (s7) into perf
sparc64:M8 cpu recognition typo fix
sparc64: Add M7 hardware cache events into perf
sparc64: Fix the watchdog corrupting performance counters
sparc64: Fix incorrect counting when using multiple perf counters
sparc64: Fix a race condition when stopping performance counters
sparc64: Stop performance counter before updating
sparc64: enable cpu hotplug feature for UEK4
sparc64: release thirds level cache reference for cpu hotplug feature
sparc64: fix compile warning section mismatch in find_node()
sparc64: fix sun4v_build_irq NULL pointer dereference
SPARC64: ldmvsw: tx queue stuck in stopped state after LDC reset
sparc: Implement watchdog_nmi_enable and watchdog_nmi_disable
sparc64: Setup a scheduling domain for highest level cache.
David Vrabel [Fri, 9 Dec 2016 14:41:13 +0000 (14:41 +0000)]
xenbus: fix deadlock on writes to /proc/xen/xenbus
/proc/xen/xenbus does not work correctly. A read blocked waiting for
a xenstore message holds the mutex needed for atomic file position
updates. This blocks any writes on the same file handle, which can
deadlock if the write is needed to unblock the read.
Clear FMODE_ATOMIC_POS when opening this device to always get
character device like sematics.
Signed-off-by: David Vrabel <david.vrabel@citrix.com> Reviewed-by: Juergen Gross <jgross@suse.com> Signed-off-by: Juergen Gross <jgross@suse.com>
Orabug: 25425387
(cherry picked from commit 581d21a2d02a798ee34e56dbfa13f891b3a90c30)
Jira: OCC-36718 Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Reviewed-by: Chuck Anderson <chuck.anderson@oracle.com>
Chuck Anderson [Mon, 23 Jan 2017 06:47:45 +0000 (22:47 -0800)]
Merge branch topic/uek-4.1/stable-cherry-picks of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* topic/uek-4.1/stable-cherry-picks: (187 commits)
ext4: verify extent header depth
nfsd: check permissions when setting ACLs
posix_acl: Add set_posix_acl
sysv, ipc: fix security-layer leaking
dm: set DMF_SUSPENDED* _before_ clearing DMF_NOFLUSH_SUSPENDING
dm rq: fix the starting and stopping of blk-mq queues
dm flakey: error READ bios during the down_interval
CIFS: Fix a possible invalid memory access in smb2_query_symlink()
fs/cifs: make share unaccessible at root level mountable
Input: i8042 - break load dependency between atkbd/psmouse and i8042
module: Invalidate signatures on force-loaded modules
Documentation/module-signing.txt: Note need for version info if reusing a key
net/irda: fix NULL pointer dereference on memory allocation failure
fs/dcache.c: avoid soft-lockup in dput()
iscsi-target: Fix panic when adding second TCP connection to iSCSI session
audit: fix a double fetch in audit_log_single_execve_arg()
Fix broken audit tests for exec arg len
audit: Fix check of return value of strnlen_user()
cifs: fix crash due to race in hmac(md5) handling
dm: fix second blk_delay_queue() parameter to be in msec units not jiffies
...
Aaron Young [Wed, 23 Nov 2016 16:02:02 +0000 (11:02 -0500)]
SPARC64: ds driver: Make memory allocations ATOMIC and enhance debugging
This patch fixes the following issues:
1. BUG 25107317 - Kernel Panic: Watchdog HARD LOCKUP out of ds_cap_fini()
2. BUG 24787856 - Forward port 19811909 - Unnecessary
warning - ldom_req_sp_token
BUG 25107317 appears to be caused by the ds driver allocating memory using
the GFP_KERNEL flag (which can result in sleeping) while holding a spinlock.
This is a violation of rules and resulted in the panic.
To fix BUG 24787856, the error message in question was changed to a
printk_once() which will result in the message only appearing once
in the console log instead of repeatedly.
The debugging facility in the driver was also enhanced by adding 3 separate
debug levels for the ds driver debug messages.
Signed-off-by: Aaron Young <Aaron.Young@oracle.com> Reviewed-by: Alexandre Chartre <Alexandre.Chartre@oracle.com> Reviewed-By: Liam Merwick <Liam.Merwick@oracle.com>
Orabug: 25107317, 24787856
(cherry picked from commit f3bf272f0512120708a2966a7916b51c34efe56d) Signed-off-by: Allen Pais <allen.pais@oracle.com>
Dave Aldridge [Thu, 19 May 2016 10:54:58 +0000 (03:54 -0700)]
sparc64: Add symbolic access to M7 performance counters to perf
This commit provides symbolic access to every performance counter
provided in the M7. The 'perf list' command can be used to provide
a complete list of these new events, which will be reported as
shown below.
Br_mispred OR cpu/Br_mispred/ [Kernel PMU event]
Br_taken OR cpu/Br_taken/ [Kernel PMU event]
Br_tgt_mispred OR cpu/Br_tgt_mispred/ [Kernel PMU event]
Signed-off-by: Dave Aldridge <david.j.aldridge@oracle.com> Acked-by: Rob Gardner <rob.gardner@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com>
(cherry picked from commit 39f70b2fa98ea10931133ab983f521c70cb7429f)
Signed-off-by: Dave Aldridge <david.j.aldridge@oracle.com>
(cherry picked from commit f39f00c4536c8c6ca0585a200a56894c2c158743) Signed-off-by: Allen Pais <allen.pais@oracle.com>
Dave Aldridge [Fri, 24 Jun 2016 13:17:25 +0000 (06:17 -0700)]
sparc64: Fix the watchdog corrupting performance counters
There is a race condition in the perf_event_grab_pmc() which
means that we do not increment the active_events count correctly
when a new event is added. Ultimately, we end up with a negative
value for the active_event count. This means that the next time
we try and add a new event the watchdog will not be stopped
correctly and corruption of the performance count will
be observed.
Note: In sparc64 land the watchdog is implemented using one
of the performance counters.
This issue is fixed by moving the mutex lock to make
sure it encompasses the whole critical section in the
perf_event_grab_pmc().
Dave Aldridge [Tue, 29 Mar 2016 10:57:14 +0000 (03:57 -0700)]
sparc64: Fix incorrect counting when using multiple perf counters
Commit 165050c1 introduced a change to the way we deal with
performance counter overflow interrupts. This change had the
side effect that when a performance counter overflow was
detected it assumed all performance counters in use
had overflowed. Thus, when using multiple performance
counters the event counting was incorrect.
This commit fixes this incorrect counting behaviour.
Dave Aldridge [Fri, 4 Nov 2016 16:56:07 +0000 (09:56 -0700)]
sparc64: Fix a race condition when stopping performance counters
When stopping a performance counter that is close to overflowing,
there is a race condition that can occur between writing to the
PCRx register to stop the counter (and also clearing the PCRx.ov
bit at the same time) vs the performance counter overflowing and
setting the PCRx.ov bit in the PCRx register.
The result of this race condition is that we occassionally miss
a performance counter overflow interrupt, which in turn leads
to incorrect event counting.
This race condition has been observed when counting cpu cycles.
To fix this issue when stopping a performance counter,
we simply allow it to continue counting and overflow before
stopping it. This allows the performance counter overflow
interrupt to be generated and acted upon.
This fix is applied for M7, T5 and T4 devices.
Signed-off-by: Dave Aldridge <david.j.aldridge@oracle.com> Signed-off-by: Eric Saint-Etienne <eric.saint.etienne@oracle.com>
(cherry picked from commit e5b7619e1de2f3e0dd858f632bc08ce64c344245) Signed-off-by: Allen Pais <allen.pais@oracle.com>
Dave Aldridge [Fri, 4 Mar 2016 11:18:45 +0000 (03:18 -0800)]
sparc64: Stop performance counter before updating
In order to reliably clear the PCRx.ov bit when updating a
performance counter value, we need to stop it counting first.
If we do not do this, then we can miss performance counter
overflow events.
Signed-off-by: Allen Pais <allen.pais@oracle.com> Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
(cherry picked from commit a53c94ca8afc7a7603ff3c1154d81abb113a9e71)
Reviewed-by: Chris Hyser <chris.hyser@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com>
(cherry picked from commit c33aebff52457ee7d0bacc922dc23b07cee4139a)
Thomas Tai [Fri, 11 Nov 2016 15:46:10 +0000 (10:46 -0500)]
sparc64: fix compile warning section mismatch in find_node()
A compile warning is introduced by a commit to fix the find_node().
This patch fix the compile warning by moving find_node() into __init
section. Because find_node() is only used by memblock_nid_range() which
is only used by a __init add_node_ranges(). find_node() and
memblock_nid_range() should also be inside __init section.
Signed-off-by: Thomas Tai <thomas.tai@oracle.com>
(cherry picked from commit e58d08f923190fc4dc2a1962710f84672c2bc9b2) Signed-off-by: Allen Pais <allen.pais@oracle.com>
sun4v_build_irq assume the given irq number is valid and use
it to get the handler pointer, the pointer is dereference
without being checked and cause kernel panic.
The cause of the invalid irq is that the tx/rx irq have never
been free during device removal. irq number end up exhausted during
continuous device add/removal test.
tx/rx irq is allocated during vio_device_probe() using irq_alloc()
and cookie_assign(). To free the tx/rx irq, cookie_unassign() and
irq_free() is called when the device is removed.
Signed-off-by: Thomas Tai <thomas.tai@oracle.com> Reviewed-by: Chris Hyser <chris.hyser@oracle.com>
(cherry picked from commit 80043637b8fb1eabc16ab5947019f4dcdbb8c79f) Signed-off-by: Allen Pais <allen.pais@oracle.com>
Aaron Young [Wed, 2 Nov 2016 17:00:29 +0000 (13:00 -0400)]
SPARC64: ldmvsw: tx queue stuck in stopped state after LDC reset
The following patch fixes an issue with the ldmvsw driver where
the network connection of a guest domain becomes non-functional after
the guest domain has panic'd and rebooted.
The root cause was determined to be from the following series of
events:
1. Guest domain panics - resulting in the guest no longer processing
network packets (from ldmvsw driver)
2. The ldmvsw driver (in the control domain) eventually exerts flow
control due to no more available tx drings and stops the tx queue
for the guest domain
3. The LDC of the network connection for the guest is reset when
the guest domain reboots after the panic.
4. The LDC reset event is received by the ldmvsw driver and the ldmvsw
responds by clearing the tx queue for the guest.
5. ldmvsw waits indefinitely for a DATA ACK from the guest - which is
the normal method to re-enable the tx queue. But the ACK never comes
because the tx queue was cleared due to the LDC reset.
To fix this issue, in addition to clearing the tx queue, re-enable the
tx queue on a LDC reset. This prevents the ldmvsw from getting caught in
this deadlocked state of waiting for a DATA ACK which will never come.
Signed-off-by: Aaron Young <Aaron.Young@oracle.com> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Orabug: 24714685
(cherry picked from commit d84ad41602ceb070c05d2633bc09d81f66796e15) Signed-off-by: Allen Pais <allen.pais@oracle.com>
Babu Moger [Thu, 13 Oct 2016 17:36:48 +0000 (10:36 -0700)]
sparc: Implement watchdog_nmi_enable and watchdog_nmi_disable
Implement functions watchdog_nmi_enable and watchdog_nmi_disable
to enable/disable nmi watchdogs. Sparc uses arch specific nmi watchdog
handler. Currently, we do not have a way to enable/disable nmi watchdog
dynamically. With these patches we can enable or disable arch
specific nmi watchdogs using proc or sysctl interface.
Example commands.
To enable: echo 1 > /proc/sys/kernel/nmi_watchdog
To disable: echo 0 > /proc/sys/kernel/nmi_watchdog
It can also achieved using the sysctl parameter kernel.nmi_watchdog
Signed-off-by: Babu Moger <babu.moger@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 43e96774e0a338e883e9ced9e717424df126b153) Signed-off-by: Allen Pais <allen.pais@oracle.com>
Atish Patra [Thu, 20 Oct 2016 00:33:29 +0000 (18:33 -0600)]
sparc64: Setup a scheduling domain for highest level cache.
Individual scheduler domain should consist different hierarchy
consisting of cores sharing similar property. Currently, no
scheduler domain is defined separately for the cores that shares
the last level cache. As a result, the scheduler fails to take
advantage of cache locality while migrating tasks during load
balancing.
Here are the cpu masks currently present for sparc that are/can
be used in scheduler domain construction.
cpu_core_map : set based on the cores that shares l1 cache.
core_core_sib_map : is set based on the socket id or max cache id.
The prior SPARC notion of socket was defined as highest level of
shared cache. However, the MD record on T7 platforms now describes
the CPUs that share the physical socket and this is no longer tied
to shared cache.
That's why a separate cpu mask needs to be created that truly
represent highest level of shared cache for all platforms.
Signed-off-by: Atish Patra <atish.patra@oracle.com> Reviewed-by: Chris Hyser <chris.hyser@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 1e655ca52bb2727471f20cf4d8f62b4b9f69e6fc) Signed-off-by: Allen Pais <allen.pais@oracle.com>
PCIe analyzer tracing by Oracle and Samsung revealed an errata in Samsung's
firmware for EPIC SSDs where the invalid completion entries in admin queue
and IO queue can occur when the queues straddle an 8MB DMA address boundary.
This patch limits admin queue depth to 64 for EPIC SSDs.
Signed-off-by: Ashok Vairavan <ashok.vairavan@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Keith Busch [Wed, 14 Dec 2016 22:38:35 +0000 (14:38 -0800)]
nvme: Limit command retries
Many controller implementations will return errors to commands that will
not succeed, but without the DNR bit set. The driver previously retried
these commands an unlimited number of times until the command timeout
has exceeded, which takes an unnecessarilly long period of time.
This patch limits the number of retries a command can have, defaulting
to 5, but is user tunable at load or runtime.
The struct request's 'retries' field is used to track the number of
retries attempted. This is in contrast with scsi's use of this field,
which indicates how many retries are allowed.
Signed-off-by: Keith Busch <keith.busch@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@fb.com>
Orabug: 25256529
Conflicts:
Patched the commits manually due to the lack of core.c file
drivers/nvme/host/pci.c
drivers/nvme/host/nvme.h
Signed-off-by: Ashok Vairavan <ashok.vairavan@intel.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Marta Rybczynska [Sun, 18 Dec 2016 18:21:19 +0000 (10:21 -0800)]
nvme: avoid cqe corruption when update at the same time as read
Make sure the CQE phase (validity) is read before the rest of the
structure. The phase bit is the highest address and the CQE
read will happen on most platforms from lower to upper addresses
and will be done by multiple non-atomic loads. If the structure
is updated by PCI during the reads from the processor, the
processor may get a corrupted copy.
The addition of the new nvme_cqe_valid function that verifies
the validity bit also allows refactoring of the other CQE read
sequences.
Signed-off-by: Marta Rybczynska <marta.rybczynska@kalray.eu> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <keith.busch@intel.com> Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit d783e0bd02e700e7a893ef4fa71c69438ac1c276)
Orabug: 24960824
Conflicts:
nvme_poll() function is not available in UEK4QU2. Resolved
the conflicts around nvme poll function.
Signed-off-by: Ashok Vairavan <ashok.vairavan@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
This patch changes the return type of ena_set_push_mode() to be void,
as it always returns 0.
Signed-off-by: Rami Rosen <ramirose@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 184b49c89f39f5c5ad262a6456248284e10984c6) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Fix to return a negative error code from the invalid dma width
error handling case instead of 0.
Signed-off-by: Wei Yongjun <weiyj.lk@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 6e22066fd02b675260b980b3e42b7d616a9839c5) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
The driver core clears the driver data to NULL after device_release
or on probe failure. Thus, it is not needed to manually clear the
device driver data to NULL.
Signed-off-by: Wei Yongjun <weiyj.lk@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 557bc7d44d52d52374bc72e9cc3b0beb41026886) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
This is a driver for the ENA family of networking devices.
Signed-off-by: Netanel Belgazal <netanel@annapurnalabs.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 1738cd3ed342294360d6a74d4e58800004bff854) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Although the extent tree depth of 5 should enough be for the worst
case of 2*32 extents of length 1, the extent tree code does not
currently to merge nodes which are less than half-full with a sibling
node, or to shrink the tree depth if possible. So it's possible, at
least in theory, for the tree depth to be greater than 5. However,
even in the worst case, a tree depth of 32 is highly unlikely, and if
the file system is maliciously corrupted, an insanely large eh_depth
can cause memory allocation failures that will trigger kernel warnings
(here, eh_depth = 65280):
Use set_posix_acl, which includes proper permission checks, instead of
calling ->set_acl directly. Without this anyone may be able to grant
themselves permissions to a file by setting the ACL.
Lock the inode to make the new checks atomic with respect to set_acl.
(Also, nfsd was the only caller of set_acl not locking the inode, so I
suspect this may fix other races.)
This also simplifies the code, and ensures our ACLs are checked by
posix_acl_valid.
The permission checks and the inode locking were lost with commit 4ac7249e, which changed nfsd to use the set_acl inode operation directly
instead of going through xattr handlers.
Reported-by: David Sinquin <david@sinquin.eu>
[agreunba@redhat.com: use set_posix_acl] Fixes: 4ac7249e Cc: Christoph Hellwig <hch@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: stable@vger.kernel.org Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
(cherry picked from commit 28a6d048eb841a6bca10558a6c9e30ec5ca2b1af) Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Commit 53dad6d3a8e5 ("ipc: fix race with LSMs") updated ipc_rcu_putref()
to receive rcu freeing function but used generic ipc_rcu_free() instead
of msg_rcu_free() which does security cleaning.
Running LTP msgsnd06 with kmemleak gives the following:
Otherwise, there is potential for both DMF_SUSPENDED* and
DMF_NOFLUSH_SUSPENDING to not be set during dm_suspend() -- which is
definitely _not_ a valid state.
This fix, in conjuction with "dm rq: fix the starting and stopping of
blk-mq queues", addresses the potential for request-based DM multipath's
__multipath_map() to see !dm_noflush_suspending() during suspend.
Reported-by: Bart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
(cherry picked from commit 834ced1a13bdf510eae34d08bae1f1ae49b33141) Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Improve dm_stop_queue() to cancel any requeue_work. Also, have
dm_start_queue() and dm_stop_queue() clear/set the QUEUE_FLAG_STOPPED
for the blk-mq request_queue.
On suspend dm_stop_queue() handles stopping the blk-mq request_queue
BUT: even though the hw_queues are marked BLK_MQ_S_STOPPED at that point
there is still a race that is allowing block/blk-mq.c to call ->queue_rq
against a hctx that it really shouldn't. Add a check to
dm_mq_queue_rq() that guards against this rarity (albeit _not_
race-free).
Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: stable@vger.kernel.org # must patch dm.c on < 4.8 kernels Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
(cherry picked from commit 655fe78746d0b9141fe763535fc16d6652665c13) Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
When the corrupt_bio_byte feature was introduced it caused READ bios to
no longer be errored with -EIO during the down_interval. This had to do
with the complexity of needing to submit READs if the corrupt_bio_byte
feature was used.
Fix it so READ bios are properly errored with -EIO; doing so early in
flakey_map() as long as there isn't a match for the corrupt_bio_byte
feature.
During following a symbolic link we received err_buf from SMB2_open().
While the validity of SMB2 error response is checked previously
in smb2_check_message() a symbolic link payload is not checked at all.
Fix it by adding such checks.
Cc: Dan Carpenter <dan.carpenter@oracle.com> CC: Stable <stable@vger.kernel.org> Signed-off-by: Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by: Steve French <smfrench@gmail.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
(cherry picked from commit a3b180a9da61b9be52e1bcf8ff54b4cea3ce332c) Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
if, when mounting //HOST/share/sub/dir/foo we can query /sub/dir/foo but
not any of the path components above:
- store the /sub/dir/foo prefix in the cifs super_block info
- in the superblock, set root dentry to the subpath dentry (instead of
the share root)
- set a flag in the superblock to remember it
- use prefixpath when building path from a dentry
fixes bso#8950
Signed-off-by: Aurelien Aptel <aaptel@suse.com> CC: Stable <stable@vger.kernel.org> Reviewed-by: Pavel Shilovsky <pshilovsky@samba.org> Signed-off-by: Steve French <smfrench@gmail.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
(cherry picked from commit b7e61a108f9fccd8d1b90ffe62704b929ba841eb) Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
As explained in 1407814240-4275-1-git-send-email-decui@microsoft.com we
have a hard load dependency between i8042 and atkbd which prevents
keyboard from working on Gen2 Hyper-V VMs.
> hyperv_keyboard invokes serio_interrupt(), which needs a valid serio
> driver like atkbd.c. atkbd.c depends on libps2.c because it invokes
> ps2_command(). libps2.c depends on i8042.c because it invokes
> i8042_check_port_owner(). As a result, hyperv_keyboard actually
> depends on i8042.c.
>
> For a Generation 2 Hyper-V VM (meaning no i8042 device emulated), if a
> Linux VM (like Arch Linux) happens to configure CONFIG_SERIO_I8042=m
> rather than =y, atkbd.ko can't load because i8042.ko can't load(due to
> no i8042 device emulated) and finally hyperv_keyboard can't work and
> the user can't input: https://bugs.archlinux.org/task/39820
> (Ubuntu/RHEL/SUSE aren't affected since they use CONFIG_SERIO_I8042=y)
To break the dependency we move away from using i8042_check_port_owner()
and instead allow serio port owner specify a mutex that clients should use
to serialize PS/2 command stream.
Reported-by: Mark Laws <mdl@60hz.org> Tested-by: Mark Laws <mdl@60hz.org> Cc: stable@vger.kernel.org Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
(cherry picked from commit b5e8e7f655d3d01430c357bdebe72a6ddc19e9a7) Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Signing a module should only make it trusted by the specific kernel it
was built for, not anything else. Loading a signed module meant for a
kernel with a different ABI could have interesting effects.
Therefore, treat all signatures as invalid when a module is
force-loaded.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: stable@vger.kernel.org Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
(cherry picked from commit 6ac9857245bfb71d836f46db817b0c11e3e4bf69) Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Signing a module should only make it trusted by the specific kernel it
was built for, not anything else. If a module signing key is used for
multiple ABI-incompatible kernels, the modules need to include enough
version information to distinguish them.
Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Cc: stable@vger.kernel.org Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
(cherry picked from commit e9071d07878c865449c5afb062e6d305c62d1e85) Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
There is a double fetch problem in audit_log_single_execve_arg()
where we first check the execve(2) argumnets for any "bad" characters
which would require hex encoding and then re-fetch the arguments for
logging in the audit record[1]. Of course this leaves a window of
opportunity for an unsavory application to munge with the data.
This patch reworks things by only fetching the argument data once[2]
into a buffer where it is scanned and logged into the audit
records(s). In addition to fixing the double fetch, this patch
improves on the original code in a few other ways: better handling
of large arguments which require encoding, stricter record length
checking, and some performance improvements (completely unverified,
but we got rid of some strlen() calls, that's got to be a good
thing).
As part of the development of this patch, I've also created a basic
regression test for the audit-testsuite, the test can be tracked on
GitHub at the following link:
[1] If you pay careful attention, there is actually a triple fetch
problem due to a strnlen_user() call at the top of the function.
[2] This is a tiny white lie, we do make a call to strnlen_user()
prior to fetching the argument data. I don't like it, but due to the
way the audit record is structured we really have no choice unless we
copy the entire argument at once (which would require a rather
wasteful allocation). The good news is that with this patch the
kernel no longer relies on this strnlen_user() value for anything
beyond recording it in the log, we also update it with a trustworthy
value whenever possible.
Reported-by: Pengfei Wang <wpengfeinudt@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
(cherry picked from commit 634a3fc5f16470e9b78ccd7ce643305122d5ebb2) Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
The "fix" in commit 0b08c5e5944 ("audit: Fix check of return value of
strnlen_user()") didn't fix anything, it broke things. As reported by
Steven Rostedt:
"Yes, strnlen_user() returns 0 on fault, but if you look at what len is
set to, than you would notice that on fault len would be -1"
because we just subtracted one from the return value. So testing
against 0 doesn't test for a fault condition, it tests against a
perfectly valid empty string.
Also fix up the usual braindamage wrt using WARN_ON() inside a
conditional - make it part of the conditional and remove the explicit
unlikely() (which is already part of the WARN_ON*() logic, exactly so
that you don't have to write unreadable code.
Reported-and-tested-by: Steven Rostedt <rostedt@goodmis.org> Cc: Jan Kara <jack@suse.cz> Cc: Paul Moore <pmoore@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
(cherry picked from commit a4664afa0dffd5340c61511d3da14e30bfd01517) Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
strnlen_user() returns 0 when it hits fault, not -1. Fix the test in
audit_log_single_execve_arg(). Luckily this shouldn't ever happen unless
there's a kernel bug so it's mostly a cosmetic fix.
CC: Paul Moore <pmoore@redhat.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
(cherry picked from commit a49b282f08d96cd73838e4e1a5ace747d432ba7d) Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
The secmech hmac(md5) structures are present in the TCP_Server_Info
struct and can be shared among multiple CIFS sessions. However, the
server mutex is not currently held when these structures are allocated
and used, which can lead to a kernel crashes, as in the scenario below:
Commit d548b34b062 ("dm: reduce the queue delay used in dm_request_fn
from 100ms to 10ms") always intended the value to be 10 msecs -- it
just expressed it in jiffies because earlier commit 7eaceaccab ("block:
remove per-queue plugging") did.
Signed-off-by: Tahsin Erdogan <tahsin@google.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Fixes: d548b34b062 ("dm: reduce the queue delay used in dm_request_fn from 100ms to 10ms") Cc: stable@vger.kernel.org # 4.1+ -- stable@ backports must be applied to drivers/md/dm.c Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
(cherry picked from commit abf9569225763ab83a538530454f7d280fd08e4a) Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
If we encounter a filesystem error during orphan cleanup, we should stop.
Otherwise, we may end up in an infinite loop where the same inode is
processed again and again.
EXT4-fs (loop0): warning: checktime reached, running e2fsck is recommended
EXT4-fs error (device loop0): ext4_mb_generate_buddy:758: group 2, block bitmap and bg descriptor inconsistent: 6117 vs 0 free clusters
Aborting journal on device loop0-8.
EXT4-fs (loop0): Remounting filesystem read-only
EXT4-fs error (device loop0) in ext4_free_blocks:4895: Journal has aborted
EXT4-fs error (device loop0) in ext4_do_update_inode:4893: Journal has aborted
EXT4-fs error (device loop0) in ext4_do_update_inode:4893: Journal has aborted
EXT4-fs error (device loop0) in ext4_ext_remove_space:3068: IO failure
EXT4-fs error (device loop0) in ext4_ext_truncate:4667: Journal has aborted
EXT4-fs error (device loop0) in ext4_orphan_del:2927: Journal has aborted
EXT4-fs error (device loop0) in ext4_do_update_inode:4893: Journal has aborted
EXT4-fs (loop0): Inode 16 (00000000618192a0): orphan list check failed!
[...]
EXT4-fs (loop0): Inode 16 (0000000061819748): orphan list check failed!
[...]
EXT4-fs (loop0): Inode 16 (0000000061819bf0): orphan list check failed!
[...]
See-also: c9eb13a9105 ("ext4: fix hang when processing corrupted orphaned inode list") Cc: Jan Kara <jack@suse.cz> Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
(cherry picked from commit 881052c264a9a44481f1a29d4877478e54b4c690) Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
When opening a file with O_CREAT flag, check to see if the file opened
is an existing directory.
This prevents the directory from being opened which subsequently causes
a crash when the close function for directories cifs_closedir() is called
which frees up the file->private_data memory while the file is still
listed on the open file list for the tcon.
Signed-off-by: Sachin Prabhu <sprabhu@redhat.com> Signed-off-by: Steve French <smfrench@gmail.com> CC: Stable <stable@vger.kernel.org> Reported-by: Xiaoli Feng <xifeng@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
(cherry picked from commit 9b01eafbc9514e71056e0a1a4714606385a431a4) Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
If s_reserved_gdt_blocks is extremely large, it's possible for
ext4_init_block_bitmap(), which is called when ext4 sets up an
uninitialized block bitmap, to corrupt random kernel memory. Add the
same checks which e2fsck has --- it must never be larger than
blocksize / sizeof(__u32) --- and then add a backup check in
ext4_init_block_bitmap() in case the superblock gets modified after
the file system is mounted.
If ext4_fill_super() fails early, it's possible for ext4_evict_inode()
to call ext4_should_journal_data() before superblock options and flags
are fully set up. In that case, the iput() on the journal inode can
end up causing a BUG().
Work around this problem by reordering the tests so we only call
ext4_should_journal_data() after we know it's not the journal inode.
Fixes: 2d859db3e4 ("ext4: fix data corruption in inodes with journalled data") Fixes: 2b405bfa84 ("ext4: fix data=journal fast mount/umount hang") Cc: Jan Kara <jack@suse.cz> Cc: stable@vger.kernel.org Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
(cherry picked from commit e19f0ec5aeb659cb7dd2bf6e4f1e842f5ad71fcf) Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Commit 06bd3c36a733 (ext4: fix data exposure after a crash) uncovered a
deadlock in ext4_writepages() which was previously much harder to hit.
After this commit xfstest generic/130 reproduces the deadlock on small
filesystems.
The problem happens when ext4_do_update_inode() sets LARGE_FILE feature
and marks current inode handle as synchronous. That subsequently results
in ext4_journal_stop() called from ext4_writepages() to block waiting for
transaction commit while still holding page locks, reference to io_end,
and some prepared bio in mpd structure each of which can possibly block
transaction commit from completing and thus results in deadlock.
Fix the problem by releasing page locks, io_end reference, and
submitting prepared bio before calling ext4_journal_stop().
[ Changed to defer the call to ext4_journal_stop() only if the handle
is synchronous. --tytso ]
Reported-and-tested-by: Eryu Guan <eguan@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
(cherry picked from commit 906d6f4d9cdc8509c505f29f6146ec627fef2f06) Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
An extent with lblock = 4294967295 and len = 1 will pass the
ext4_valid_extent() test:
ext4_lblk_t last = lblock + len - 1;
if (len == 0 || lblock > last)
return 0;
since last = 4294967295 + 1 - 1 = 4294967295. This would later trigger
the BUG_ON(es->es_lblk + es->es_len < es->es_lblk) in ext4_es_end().
We can simplify it by removing the - 1 altogether and changing the test
to use lblock + len <= lblock, since now if len = 0, then lblock + 0 ==
lblock and it fails, and if len > 0 then lblock + len > lblock in order
to pass (i.e. it doesn't overflow).
Fixes: 5946d0893 ("ext4: check for overlapping extents in ext4_valid_extent_entries()") Fixes: 2f974865f ("ext4: check for zero length extent explicitly") Cc: Eryu Guan <guaneryu@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Phil Turnbull <phil.turnbull@oracle.com> Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
(cherry picked from commit c580d82e1d5532b785e339450d43f82e5a8b4e79) Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Backport of caaee6234d05a58c5b4d05e7bf766131b810a657 ("ptrace: use fsuid,
fsgid, effective creds for fs access checks") to v4.1 failed to update the
mode parameter in the mm_access() call in pagemap_read() to have one of the
new PTRACE_MODE_*CREDS flags.
Attempting to read any other process' pagemap results in a WARN()
radix_tree_iter_retry() resets slot to NULL, but it doesn't reset tags.
Then NULL slot and non-zero iter.tags passed to radix_tree_next_slot()
leading to crash:
Currently, osd_weight and osd_state fields are updated in the encoding
order. This is wrong, because an incremental map may look like e.g.
new_up_client: { osd=6, addr=... } # set osd_state and addr
new_state: { osd=6, xorstate=EXISTS } # clear osd_state
Suppose osd6's current osd_state is EXISTS (i.e. osd6 is down). After
applying new_up_client, osd_state is changed to EXISTS | UP. Carrying
on with the new_state update, we flip EXISTS and leave osd6 in a weird
"!EXISTS but UP" state. A non-existent OSD is considered down by the
mapping code
2087 for (i = 0; i < pg->pg_temp.len; i++) {
2088 if (ceph_osd_is_down(osdmap, pg->pg_temp.osds[i])) {
2089 if (ceph_can_shift_osds(pi))
2090 continue;
2091
2092 temp->osds[temp->size++] = CRUSH_ITEM_NONE;
and so requests get directed to the second OSD in the set instead of
the first, resulting in OSD-side errors like:
[WRN] : client.4239 192.168.122.21:0/2444980242 misdirected client.4239.1:2827 pg 2.5df899f2 to osd.4 not [1,4,6] in e680/680
and hung rbds on the client:
[ 493.566367] rbd: rbd0: write 400000 at 11cc00000 (0)
[ 493.566805] rbd: rbd0: result -6 xferred 400000
[ 493.567011] blk_update_request: I/O error, dev rbd0, sector 9330688
The fix is to decouple application from the decoding and:
- apply new_weight first
- apply new_state before new_up_client
- twiddle osd_state flags if marking in
- clear out some of the state if osd is destroyed
Fixes: http://tracker.ceph.com/issues/14901 Cc: stable@vger.kernel.org # 3.15+: 6dd74e44dc1d: libceph: set 'exists' flag for newly up osd Cc: stable@vger.kernel.org # 3.15+ Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Josh Durgin <jdurgin@redhat.com> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
(cherry picked from commit 6831c98ce0b8a3e88db64aa224372effd0dcc694) Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
The size of individual keymap in drivers/tty/vt/keyboard.c is NR_KEYS,
which is currently 256, whereas number of keys/buttons in input device (and
therefor in key_down) is much larger - KEY_CNT - 768, and that can cause
out-of-bound access when we do
sym = U(key_maps[0][k]);
with large 'k'.
To fix it we should not attempt iterating beyond smaller of NR_KEYS and
KEY_CNT.
Also while at it let's switch to for_each_set_bit() instead of open-coding
it.
Fix a memory leak on probe error of the airspy usb device driver.
The problem is triggered when more than 64 usb devices register with
v4l2 of type VFL_TYPE_SDR or VFL_TYPE_SUBDEV.
The memory leak is caused by the probe function of the airspy driver
mishandeling errors and not freeing the corresponding control structures
when an error occours registering the device to v4l2 core.
A badusb device can emulate 64 of these devices, and then through
continual emulated connect/disconnect of the 65th device, cause the
kernel to run out of RAM and crash the kernel, thus causing a local DOS
vulnerability.
It's possible to isolate some freepages in a pageblock and then fail
split_free_page() due to the low watermark check. In this case, we hit
VM_BUG_ON() because the freeing scanner terminated early without a
contended lock or enough freepages.
This should never have been a VM_BUG_ON() since it's not a fatal
condition. It should have been a VM_WARN_ON() at best, or even handled
gracefully.
Regardless, we need to terminate anytime the full pageblock scan was not
done. The logic belongs in isolate_freepages_block(), so handle its
state gracefully by terminating the pageblock loop and making a note to
restart at the same pageblock next time since it was not possible to
complete the scan this time.
[rientjes@google.com: don't rescan pages in a pageblock] Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1607111244150.83138@chino.kir.corp.google.com Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1606291436300.145590@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Reported-by: Minchan Kim <minchan@kernel.org> Tested-by: Minchan Kim <minchan@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
(cherry picked from commit fe071fb0d4e9fd40fe7c46c6a9f8f23d5f27e92f) Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Handling the position where compaction free scanner should restart
(stored in cc->free_pfn) got more complex with commit e14c720efdd7 ("mm,
compaction: remember position within pageblock in free pages scanner").
Currently the position is updated in each loop iteration of
isolate_freepages(), although it should be enough to update it only when
breaking from the loop. There's also an extra check outside the loop
updates the position in case we have met the migration scanner.
This can be simplified if we move the test for having isolated enough
from the for-loop header next to the test for contention, and
determining the restart position only in these cases. We can reuse the
isolate_start_pfn variable for this instead of setting cc->free_pfn
directly. Outside the loop, we can simply set cc->free_pfn to current
value of isolate_start_pfn without any extra check.
Also add a VM_BUG_ON to catch possible mistake in the future, in case we
later add a new condition that terminates isolate_freepages_block()
prematurely without also considering the condition in
isolate_freepages().
Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Christoph Lameter <cl@linux.com> Cc: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
(cherry picked from commit ca0d868322c49b0d6ee4dfaae94a28e12969552c) Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
The chmap ctls assigned to PCM streams are freed in the PCM disconnect
callback. However, since the disconnect callback isn't called when
the card gets freed before registering, the chmap ctls may still be
left assigned. They are eventually freed together with other ctls,
but it may cause an Oops at pcm_chmap_ctl_private_free(), as the
function refers to the assigned PCM stream, while the PCM objects have
been already freed beforehand.
The fix is to free the chmap ctls also at PCM free callback, not only
at PCM disconnect.
Right now when a new overlay inode is created, we initialize overlay
inode's ->i_mode from underlying inode ->i_mode but we retain only
file type bits (S_IFMT) and discard permission bits.
This patch changes it and retains permission bits too. This should allow
overlay to do permission checks on overlay inode itself in task context.
[SzM] It also fixes clearing suid/sgid bits on write.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reported-by: Eryu Guan <eguan@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Fixes: 4bacc9c9234c ("overlayfs: Make f_path always point to the overlay and f_inode to the underlay") Cc: <stable@vger.kernel.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
(cherry picked from commit 31534f8fead7e0dff7ba68bd5dfcf6a9dfe908bc) Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Before 4bacc9c9234c ("overlayfs: Make f_path...") file->f_path pointed to
the underlying file, hence suid/sgid removal on write worked fine.
After that patch file->f_path pointed to the overlay file, and the file
mode bits weren't copied to overlay_inode->i_mode. So the suid/sgid
removal simply stopped working.
The fix is to copy the mode bits, but then ovl_setattr() needs to clear
ATTR_MODE to avoid the BUG() in notify_change(). So do this first, then in
the next patch copy the mode.
Reported-by: Eryu Guan <eguan@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Fixes: 4bacc9c9234c ("overlayfs: Make f_path always point to the overlay and f_inode to the underlay") Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
(cherry picked from commit cb75f65fe798bcac694f6bde299c52d31bdc8e96) Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>