Chuck Anderson [Fri, 20 Jan 2017 10:12:44 +0000 (02:12 -0800)]
Merge branch topic/uek-4.1/rpm-build of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* topic/uek-4.1/rpm-build:
perf: build TUI by default by pulling in slang and linking it statically
configs: ol6: set with_headers, with_dtrace defaults to 0
smartpqi: enable driver in uek config files
Add the CONFIG_DEBUG_SET_MODULE_RONX option to OL6
Chuck Anderson [Fri, 20 Jan 2017 10:11:31 +0000 (02:11 -0800)]
Merge branch topic/uek-4.1/pmem of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* topic/uek-4.1/pmem: (222 commits)
libnvdimm, dax: record the specified alignment of a dax-device instance
libnvdimm, dax: reserve space to store labels for device-dax
libnvdimm, dax: introduce device-dax infrastructure
libnvdimm: cleanup nvdimm_namespace_common_probe(), kill 'host'
mm, dax: fix livelock, allow dax pmd mappings to become writeable
dax: fix lifetime of in-kernel dax mappings with dax_map_atomic()
tools/testing/libnvdimm: cleanup mock resource lookup
block: protect rw_page against device teardown
fix kABI breakage caused by "block: generic request_queue reference counting"
block: generic request_queue reference counting
fix kABI breakage caused by "block: use an atomic_t for mq_freeze_depth"
block: use an atomic_t for mq_freeze_depth
dax: guarantee page aligned results from bdev_direct_access()
dax: increase granularity of dax_clear_blocks() operations
pmem, dax: clean up clear_pmem()
xfs: fix recursive splice read locking with DAX
xfs: per-filesystem stats counter implementation
xfs: per-filesystem stats in sysfs
xfs: pass xfsstats structures to handlers and macros
xfs: consolidate sysfs ops
...
Chuck Anderson [Fri, 20 Jan 2017 10:10:57 +0000 (02:10 -0800)]
Merge branch 'topic/uek-4.1/ofed' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* topic/uek-4.1/ofed:
RDS: don't commit to queue till transport connection is up
RDS: restrict socket connection reset to CAP_NET_ADMIN
xsigo: Fix crash in accessing xve proc l2 entries
xsigo: Fix race in freeing aged Forwarding table entry
xsigo: Schedule while uninterruptible
Chuck Anderson [Fri, 20 Jan 2017 10:09:31 +0000 (02:09 -0800)]
Merge branch topic/uek-4.1/upstream-cherry-picks of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* topic/uek-4.1/upstream-cherry-picks:
xfs: validate metadata LSNs against log on v5 superblocks
IB/ipoib: move back IB LL address into the hard header
net: preserve IP control block during GSO segmentation
xfs: fix broken multi-fsb buffer logging
xfs: Split default quota limits by quota type
Since the onset of v5 superblocks, the LSN of the last modification has
been included in a variety of on-disk data structures. This LSN is used
to provide log recovery ordering guarantees (e.g., to ensure an older
log recovery item is not replayed over a newer target data structure).
While this works correctly from the point a filesystem is formatted and
mounted, userspace tools have some problematic behaviors that defeat
this mechanism. For example, xfs_repair historically zeroes out the log
unconditionally (regardless of whether corruption is detected). If this
occurs, the LSN of the filesystem is reset and the log is now in a
problematic state with respect to on-disk metadata structures that might
have a larger LSN. Until either the log catches up to the highest
previously used metadata LSN or each affected data structure is modified
and written out without incident (which resets the metadata LSN), log
recovery is susceptible to filesystem corruption.
This problem is ultimately addressed and repaired in the associated
userspace tools. The kernel is still responsible to detect the problem
and notify the user that something is wrong. Check the superblock LSN at
mount time and fail the mount if it is invalid. From that point on,
trigger verifier failure on any metadata I/O where an invalid LSN is
detected. This results in a filesystem shutdown and guarantees that we
do not log metadata changes with invalid LSNs on disk. Since this is a
known issue with a known recovery path, present a warning to instruct
the user how to recover.
After the commit 9207f9d45b0a ("net: preserve IP control block
during GSO segmentation"), the GSO CB and the IPoIB CB conflict.
That destroy the IPoIB address information cached there,
causing a severe performance regression, as better described here:
This change moves the data cached by the IPoIB driver from the
skb control lock into the IPoIB hard header, as done before
the commit 936d7de3d736 ("IPoIB: Stop lying about hard_header_len
and use skb->cb to stash LL addresses").
In order to avoid GRO issue, on packet reception, the IPoIB driver
stash into the skb a dummy pseudo header, so that the received
packets have actually a hard header matching the declared length.
To avoid changing the connected mode maximum mtu, the allocated
head buffer size is increased by the pseudo header length.
After this commit, IPoIB performances are back to pre-regression
value.
v2 -> v3: rebased
v1 -> v2: avoid changing the max mtu, increasing the head buf size
Fixes: 9207f9d45b0a ("net: preserve IP control block during GSO segmentation") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Skb_gso_segment() uses skb control block during segmentation.
This patch adds 32-bytes room for previous control block which
will be copied into all resulting segments.
This patch fixes kernel crash during fragmenting forwarded packets.
Fragmentation requires valid IP CB in skb for clearing ip options.
Also patch removes custom save/restore in ovs code, now it's redundant.
Santosh Shilimkar [Thu, 15 Dec 2016 21:21:09 +0000 (13:21 -0800)]
RDS: don't commit to queue till transport connection is up
A rouge application can flood the send queue by targeting a dead
or non-existing node. Don't commit any messages to the queue
till the legitimate connection to the peer is established.
Let application retry so that only legit connections can
get on to the send queue.
Todd Vierling [Tue, 20 Dec 2016 14:58:21 +0000 (09:58 -0500)]
configs: ol6: set with_headers, with_dtrace defaults to 0
This makes OL6 config match OL7 config for these default %defines. Later
directives conditionally set with_headers if sparc64, and with_dtrace if
sparc64|x86_64, overriding the defaults.
Samsung uses D2H command with Vendor Uniq Command (VUC) code F7
(the 0th bit of which is 1) for retrieving memory dump. In UEK4,
Bit 0 of the D2H command code has to be 0. Because of this voilation,
the nvmecli is unable to do crash and memory dumps in UEK4.
As the Samsung firmware can only understand VUC command code F7,
the IO direction is reversed for this vendor command code to
retrieve memory and crash dump.
Multi-block buffers are logged based on buffer offset in
xfs_trans_log_buf(). xfs_buf_item_log() ultimately walks each mapping in
the buffer and marks the associated range to be logged in the
xfs_buf_log_format bitmap for that mapping. This code is broken,
however, in that it marks the actual buffer offsets of the associated
range in each bitmap rather than shifting to the byte range for that
particular mapping.
For example, on a 4k fsb fs, buffer offset 4096 refers to the first byte
of the second mapping in the buffer. This means byte 0 of the second log
format bitmap should be tagged as dirty. Instead, the current code marks
byte offset 4096 of the second log format bitmap, which is invalid and
potentially out of range of the mapping.
As a result of this, the log item format code invoked at transaction
commit time is not be able to correctly identify what parts of the
buffer to copy into log vectors. This can lead to NULL log vector
pointer dereferences in CIL push context if the item format code was not
able to locate any dirty ranges at all. This crash has been reproduced
on a 4k FSB filesystem using 16k directory blocks where an unlink
operation happened not to log anything in the first block of the
mapping. The logged offsets were all over 4k, marked as such in the
subsequent log format mappings, and thus left the transaction with an
xfs_log_item that is marked DIRTY but without any logged regions.
Further, even when the logged regions are marked correctly in the buffer
log format bitmaps, the format code doesn't copy the correct ranges of
the buffer into the log. This means that any logged region beyond the
first block of a multi-block buffer is subject to corruption after a
crash and log recovery sequence. This is due to a failure to convert the
mapping bm_len field from basic blocks to bytes in the buffer offset
tracking code in xfs_buf_item_format().
Update xfs_buf_item_log() to convert buffer offsets to segment relative
offsets when logging multi-block buffers. This ensures that the modified
regions of a buffer are logged correctly and avoids the aforementioned
crash. Also update xfs_buf_item_format() to correctly track the source
offset into the buffer for the log vector formatting code. This ensures
that the correct data is copied into the log.
Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
Default quotas are globally set due historical reasons. IRIX only
supported user and project quotas, and default quota was only
applied to user quotas.
In Linux, when a default quota is set, all different quota types
inherits the same default value.
An user with a quota limit larger than the default quota value, will
still be limited to the default value because the group quotas also
inherits the default quotas. Unless the group which the user belongs
to have a custom quota limit set.
This patch aims to split the default quota value by quota type.
Allowing each quota type having different default values.
Default time limits are still set globally. XFS does not set a
per-user/group timer, but a single global timer. For changing this
behavior, some changes should be made in user-space tools another
bugs being fixed.
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
When accessing l2tables using /proc/driver/xve/devices/
system panics if path is created and there is no associated
Transmit QP. Added a check for validating tx structure before
using it.
Jack Vogel [Fri, 13 Jan 2017 21:01:26 +0000 (13:01 -0800)]
Call i40e_client_get_params only after the instance is checked
We can avoid the minor bit of work by calling check params after we
check for the client instance, since we're about to return early in
cases where we do not have a client. But, more importantly, this
change prevents a panic from a NULL pointer when attempting to
get the client params.
In the ioaccel path, the calculation of the starting LBA for
READ(6)/WRITE(6) SCSI commands does not take into account the most
significant 5 bits of the LBA: it only uses the least significant 16
bits of the starting LBA.
Reported-by: Mahesh Rajashekhara <mahesh.rajashekhara@microsemi.com> Signed-off-by: Kevin Barnett <kevin.barnett@microsemi.com> Signed-off-by: Don Brace <don.brace@microsemi.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit e018ef572ba4ff17caa9e82d5e1b5cea0d76f903) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Reviewed-by: Scott Teel <scott.teel@microsemi.com> Reviewed-by: Scott Benesh <scott.benesh@microsemi.com> Reviewed-by: Tomas Henzl <thenzl@redhat.com> Signed-off-by: Kevin Barnett <kevin.barnett@microsemi.com> Signed-off-by: Don Brace <don.brace@microsemi.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 699bed758b1313f97a5ac78848090e8357d12ab1) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Reviewed-by: Kevin Barnett <kevin.barnett@microsemi.com> Reviewed-by: Scott Benesh <scott.benesh@microsemi.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Don Brace <don.brace@microsemi.com> Reviewed-by: Tomas Henzl <thenzl@redhat.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 425b490b2aa745740ea3618e1cdcc2bc37c0d996) Signed-off-by: Brian Maly <brian.maly@oracle.com>
The aacraid driver will not managage Microsemi smartpqi controllers, but
will still manage older aacraid devices.
Updated help section.
Reviewed-by: Scott Teel <scott.teel@microsemi.com> Reviewed-by: Scott Benesh <scott.benesh@microsemi.com> Reviewed-by: Kevin Barnett <kevin.barnett@microsemi.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Tomas Henzl <thenzl@redhat.com> Signed-off-by: Don Brace <don.brace@microsemi.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit e8a31ebae1669f05254430d2fced99d77c63fc10) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Some cache flush operations can take longer than the timeout value. Best
to not impose a time limit to handle all cases.
Reviewed-by: Scott Teel <scott.teel@microsemi.com> Reviewed-by: Scott Benesh <scott.benesh@microsemi.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Tomas Henzl <thenzl@redhat.com> Signed-off-by: Kevin Barnett <kevin.barnett@microsemi.com> Signed-off-by: Don Brace <don.brace@microsemi.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit d48f8fad1e435eff26c29e8e109c1a50c441e533) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Reviewed-by: Scott Teel <scott.teel@microsemi.com> Reviewed-by: Scott Benesh <scott.benesh@microsemi.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Tomas Henzl <thenzl@redhat.com> Signed-off-by: Kevin Barnett <kevin.barnett@microsemi.com> Signed-off-by: Don Brace <don.brace@microsemi.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 7d81d2b8714ec72462a99875acbf2f976402f3f1) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Reviewed-by: Scott Teel <scott.teel@microsemi.com> Reviewed-by: Scott Benesh <scott.benesh@microsemi.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Kevin Barnett <kevin.barnett@microsemi.com> Signed-off-by: Don Brace <don.brace@microsemi.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 4fbebf1a779d9f6890ddc1df90c497b161dfb34c) Signed-off-by: Brian Maly <brian.maly@oracle.com>
reformatted pqi_num_elements_free() to match the rest of the driver
Reviewed-by: Scott Teel <scott.teel@microsemi.com> Reviewed-by: Scott Benesh <scott.benesh@microsemi.com> Reviewed-by: Tomas Henzl <thenzl@redhat.com> Signed-off-by: Kevin Barnett <kevin.barnett@microsemi.com> Signed-off-by: Don Brace <don.brace@microsemi.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit df7a1fcfc4761e658b60739e2ff4cd148afcae89) Signed-off-by: Brian Maly <brian.maly@oracle.com>
the driver no longer waits for the firmware to consume
the event ack IU.
Reviewed-by: Scott Teel <scott.teel@microsemi.com> Reviewed-by: Scott Benesh <scott.benesh@microsemi.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Tomas Henzl <thenzl@redhat.com> Signed-off-by: Kevin Barnett <kevin.barnett@microsemi.com> Signed-off-by: Don Brace <don.brace@microsemi.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 5e6429df9c8b3ab9e0a8d18af5248692ebe41871) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Fixes: 6c223761e 'smartpqi: initial commit of Microsemi smartpqi driver'
Fixed a bug where the driver would not free all of the
controller resources if the controller ever went offline.
Reviewed-by: Scott Teel <scott.teel@microsemi.com> Reviewed-by: Scott Benesh <scott.benesh@microsemi.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Tomas Henzl <thenzl@redhat.com> Signed-off-by: Kevin Barnett <kevin.barnett@microsemi.com> Signed-off-by: Don Brace <don.brace@microsemi.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit e57a1f9b2fa4326ec289f1d03c658184ed6addb8) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Reviewed-by: Scott Teel <scott.teel@microsemi.com> Reviewed-by: Scott Benesh <scott.benesh@microsemi.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Tomas Henzl <thenzl@redhat.com> Signed-off-by: Kevin Barnett <kevin.barnett@microsemi.com> Signed-off-by: Don Brace <don.brace@microsemi.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit ff6abb7383d2eec6c8c988ff661352e66f245686) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Reviewed-by: Scott Teel <scott.teel@microsemi.com> Reviewed-by: Scott Benesh <scott.benesh@microsemi.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Tomas Henzl <thenzl@redhat.com> Signed-off-by: Kevin Barnett <kevin.barnett@microsemi.com> Signed-off-by: Don Brace <don.brace@microsemi.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 14bb215d09de98a8e95fa2bb1b8f35b79672c5df) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Reviewed-by: Scott Teel <scott.teel@microsemi.com> Reviewed-by: Scott Benesh <scott.benesh@microsemi.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Tomas Henzl <thenzl@redhat.com> Signed-off-by: Kevin Barnett <kevin.barnett@microsemi.com> Signed-off-by: Don Brace <don.brace@microsemi.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit e58081a714275d1490e470bdaf1b5dc23043cf2a) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Removed the workaround for the transition to spanning.
Reviewed-by: Scott Teel <scott.teel@microsemi.com> Reviewed-by: Scott Benesh <scott.benesh@microsemi.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Tomas Henzl <thenzl@redhat.com> Signed-off-by: Kevin Barnett <kevin.barnett@microsemi.com> Signed-off-by: Don Brace <don.brace@microsemi.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 77668f412dbcd6a9dd04c92f0b170c5b5182a5fb) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Reviewed-by: Scott Teel <scott.teel@microsemi.com> Reviewed-by: Scott Benesh <scott.benesh@microsemi.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Tomas Henzl <thenzl@redhat.com> Signed-off-by: Kevin Barnett <kevin.barnett@microsemi.com> Signed-off-by: Don Brace <don.brace@microsemi.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit b17f048658c9b1bc8ac1d9a54b223f740c70f8fd) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Reviewed-by: Scott Teel <scott.teel@microsemi.com> Reviewed-by: Scott Benesh <scott.benesh@microsemi.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Tomas Henzl <thenzl@redhat.com> Signed-off-by: Kevin Barnett <kevin.barnett@microsemi.com> Signed-off-by: Don Brace <don.brace@microsemi.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit a60eec0251fe1bfc0cd549c073591e6657761158) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Reviewed-by: Scott Teel <scott.teel@microsemi.com> Signed-off-by: Kevin Barnett <kevin.barnett@microsemi.com> Signed-off-by: Don Brace <don.brace@microsemi.com>
This initial commit contains Microsemi's smartpqi module.
[mkp: Minor tweaks to apply to 4.9/scsi-queue]
Reviewed-by: Scott Benesh <scott.benesh@microsemi.com> Reviewed-by: Kevin Barnett <kevin.barnett@microsemi.com> Signed-off-by: Kevin Barnett <kevin.barnett@microsemi.com> Signed-off-by: Don Brace <don.brace@microsemi.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Ewan D. Milne <emilne@redhat.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 6c223761eb5482dca2bd981d0a800c4aba3c9009) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Fixed a race in case of freeing an Aged Forwarding entry
When there is no path associated with a Forwarding entry and
if it ages out uVNIC driver starts aging it ,if a new path with
different Forwarding entry (different mac but same remote guid
and qpn) creates a path present code ends up deleting
that newly created path.
Added a variable del_in_progress to detect and avoid race if
forwarding table is accessed during deletion.
Added protection in one case, while accessing forwarding table entry.
Cleaned up unwanted messages
Jack Vogel [Fri, 16 Dec 2016 18:55:45 +0000 (10:55 -0800)]
Add the CONFIG_DEBUG_SET_MODULE_RONX option to OL6
Set the CONFIG_DEBUG_SET_MODULE_RONX option in the OL6 configuration,
this changes the page permissions of a loadable module, code and RO
data are set to RX, leaving off the write permission for the page,
this causes modify access to be trapped and provides enhanced
security.
Orabug: 24910950 Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Chuck Anderson [Wed, 18 Jan 2017 23:01:44 +0000 (15:01 -0800)]
Merge branch topic/uek-4.1/upstream-cherry-picks of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* topic/uek-4.1/upstream-cherry-picks: (113 commits)
packet: fix race condition in packet_set_ring
net: avoid signed overflows for SO_{SND|RCV}BUFFORCE
ALSA: pcm : Call kill_fasync() in stream lock
netlink: Fix dump skb leak/double free
rcu: Fix soft lockup for rcu_nocb_kthread
mpi: Fix NULL ptr dereference in mpi_powm() [ver #3]
sctp: validate chunk len before actually using it
kvm: raise KVM_SOFT_MAX_VCPUS to support more vcpus
netfilter: nfnetlink: fix splat due to incorrect socket memory accounting in skbuff clones
netfilter: nfnetlink: avoid recurrent netns lookups in call_batch
netfilter: nf_tables: fix wrong destroy anonymous sets if binding fails
netfilter: nf_tables: use reverse traversal commit_list in nf_tables_abort
ixgbevf: Handle previously-freed msix_entries
PCI: pciehp: Prioritize data-link event over presence detect
PCI: pciehp: Leave power indicator on when enabling already-enabled slot
net: Fix use after free in the recvmmsg exit path
signals: avoid unnecessary taking of sighand->siglock
audit: fix a double fetch in audit_log_single_execve_arg()
KEYS: Fix short sprintf buffer in /proc/keys show function
tools/power turbostat: Replace MSR_NHM_TURBO_RATIO_LIMIT
...
Philip Pettersson [Wed, 30 Nov 2016 22:55:36 +0000 (14:55 -0800)]
packet: fix race condition in packet_set_ring
When packet_set_ring creates a ring buffer it will initialize a
struct timer_list if the packet version is TPACKET_V3. This value
can then be raced by a different thread calling setsockopt to
set the version to TPACKET_V1 before packet_set_ring has finished.
This leads to a use-after-free on a function pointer in the
struct timer_list when the socket is closed as the previously
initialized timer will not be deleted.
The bug is fixed by taking lock_sock(sk) in packet_setsockopt when
changing the packet version while also taking the lock at the start
of packet_set_ring.
Fixes: f6fb8f100b80 ("af-packet: TPACKET_V3 flexible buffer implementation.") Signed-off-by: Philip Pettersson <philip.pettersson@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 84ac7260236a49c79eede91617700174c2c19b0c)
Eric Dumazet [Fri, 2 Dec 2016 17:44:53 +0000 (09:44 -0800)]
net: avoid signed overflows for SO_{SND|RCV}BUFFORCE
CAP_NET_ADMIN users should not be allowed to set negative
sk_sndbuf or sk_rcvbuf values, as it can lead to various memory
corruptions, crashes, OOM...
Note that before commit 82981930125a ("net: cleanups in
sock_setsockopt()"), the bug was even more serious, since SO_SNDBUF
and SO_RCVBUF were vulnerable.
This needs to be backported to all known linux kernels.
Again, many thanks to syzkaller team for discovering this gem.
Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit b98b0bc8c431e3ceb4b26b0dfc8db509518fb290)
Takashi Iwai [Mon, 5 Dec 2016 19:16:37 +0000 (11:16 -0800)]
ALSA: pcm : Call kill_fasync() in stream lock
Currently kill_fasync() is called outside the stream lock in
snd_pcm_period_elapsed(). This is potentially racy, since the stream
may get released even during the irq handler is running. Although
snd_pcm_release_substream() calls snd_pcm_drop(), this doesn't
guarantee that the irq handler finishes, thus the kill_fasync() call
outside the stream spin lock may be invoked after the substream is
detached, as recently reported by KASAN.
As a quick workaround, move kill_fasync() call inside the stream
lock. The fasync is rarely used interface, so this shouldn't have a
big impact from the performance POV.
Ideally, we should implement some sync mechanism for the proper finish
of stream and irq handler. But this oneliner should suffice for most
cases, so far.
Herbert Xu [Mon, 16 May 2016 09:28:16 +0000 (17:28 +0800)]
netlink: Fix dump skb leak/double free
When we free cb->skb after a dump, we do it after releasing the
lock. This means that a new dump could have started in the time
being and we'll end up freeing their skb instead of ours.
This patch saves the skb and module before we unlock so we free
the right memory.
Fixes: 16b304f3404f ("netlink: Eliminate kmalloc in netlink dump operation.") Reported-by: Baozeng Ding <sploving1@gmail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 92964c79b357efd980812c4de5c1fd2ec8bb5520)
It turns out that the rcuos callback-offload kthread is busy processing
a very large quantity of RCU callbacks, and it is not reliquishing the
CPU while doing so. This commit therefore adds an cond_resched_rcu_qs()
within the loop to allow other tasks to run.
Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
[ paulmck: Substituted cond_resched_rcu_qs for cond_resched. ] Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
(cherry picked from commit bedc1969150d480c462cdac320fa944b694a7162)
Andrey Ryabinin [Thu, 24 Nov 2016 13:23:10 +0000 (13:23 +0000)]
mpi: Fix NULL ptr dereference in mpi_powm() [ver #3]
This fixes CVE-2016-8650.
If mpi_powm() is given a zero exponent, it wants to immediately return
either 1 or 0, depending on the modulus. However, if the result was
initalised with zero limb space, no limbs space is allocated and a
NULL-pointer exception ensues.
Fix this by allocating a minimal amount of limb space for the result when
the 0-exponent case when the result is 1 and not touching the limb space
when the result is 0.
This affects the use of RSA keys and X.509 certificates that carry them.
Andrey Konovalov reported that KASAN detected that SCTP was using a slab
beyond the boundaries. It was caused because when handling out of the
blue packets in function sctp_sf_ootb() it was checking the chunk len
only after already processing the first chunk, validating only for the
2nd and subsequent ones.
The fix is to just move the check upwards so it's also validated for the
1st chunk.
Reported-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Reviewed-by: Xin Long <lucien.xin@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit bf911e985d6bbaa328c20c3e05f4eb03de11fdd6) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
net/sctp/sm_statefuns.c
Orabug: 25142846
CVE: CVE-2016-9555 Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com> Reviewed-by: Chuck Anderson <chuck.anderson@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Despite the name, the kernel's manifest constant KVM_SOFT_MAX_VCPUS
defines the hard maximum number of vcpus that libvirt will support.
(When "--vcpus <n>" is specified on the virt-install command line,
the utility queries the kernel via ioctl(..., KVM_CAP_NR_VCPUS)
and uses the returned value as its limit. Right now, the ioctl
implementation simply returns KVM_SOFT_MAX_VCPUS.)
Signed-off-by: Dan Duval <dan.duval@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
(cherry picked from commit dc6eb6241ff41cd2e5e73bc6ebd71a754fb5333c) Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com> Reviewed-by: Chuck Anderson <chuck.anderson@oracle.com>
Conflicts:
arch/x86/include/asm/kvm_host.h
Pablo Neira Ayuso [Wed, 9 Dec 2015 11:09:56 +0000 (12:09 +0100)]
netfilter: nfnetlink: fix splat due to incorrect socket memory accounting in skbuff clones
If we attach the sk to the skb from nfnetlink_rcv_batch(), then
netlink_skb_destructor() will underflow the socket receive memory
counter and we get warning splat when releasing the socket.
$ cat /proc/net/netlink
sk Eth Pid Groups Rmem Wmem Dump Locks Drops Inode ffff8800ca903000 12 0 00000000 -54144 0 0 2 0 17942
^^^^^^
skb->sk was only needed to lookup for the netns, however we don't need
this anymore since 633c9a840d0b ("netfilter: nfnetlink: avoid recurrent
netns lookups in call_batch") so this patch removes this manual socket
assignment to resolve this problem.
Reported-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com> Reported-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Tested-by: Arturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
(cherry picked from commit bd678e09dc1797bc0e2e536b6b268e7cf46e0184)
Liping Zhang [Sat, 11 Jun 2016 04:20:28 +0000 (12:20 +0800)]
netfilter: nf_tables: fix wrong destroy anonymous sets if binding fails
When we add a nft rule like follows:
# nft add rule filter test tcp dport vmap {1: jump test}
-ELOOP error will be returned, and the anonymous set will be
destroyed.
But after that, nf_tables_abort will also try to remove the
element and destroy the set, which was already destroyed and
freed.
If we add a nft wrong rule, nft_tables_abort will do the cleanup
work rightly, so nf_tables_set_destroy call here is redundant and
wrong, remove it.
Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
(cherry picked from commit a02f424863610a0a7abd80c468839e59cfa4d0d8)
Xin Long [Mon, 7 Dec 2015 10:48:07 +0000 (18:48 +0800)]
netfilter: nf_tables: use reverse traversal commit_list in nf_tables_abort
When we use 'nft -f' to submit rules, it will build multiple rules into
one netlink skb to send to kernel, kernel will process them one by one.
meanwhile, it add the trans into commit_list to record every commit.
if one of them's return value is -EAGAIN, status |= NFNL_BATCH_REPLAY
will be marked. after all the process is done. it will roll back all the
commits.
now kernel use list_add_tail to add trans to commit, and use
list_for_each_entry_safe to roll back. which means the order of adding
and rollback is the same. that will cause some cases cannot work well,
even trigger call trace, like:
1. add a set into table foo [return -EAGAIN]:
commit_list = 'add set trans'
2. del foo:
commit_list = 'add set trans' -> 'del set trans' -> 'del tab trans'
then nf_tables_abort will be called to roll back:
firstly process 'add set trans':
case NFT_MSG_NEWSET:
trans->ctx.table->use--;
list_del_rcu(&nft_trans_set(trans)->list);
it will del the set from the table foo, but it has removed when del
table foo [step 2], then the kernel will panic.
the right order of rollback should be:
'del tab trans' -> 'del set trans' -> 'add set trans'.
which is opposite with commit_list order.
so fix it by rolling back commits with reverse order in nf_tables_abort.
Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
(cherry picked from commit a907e36d54e0ff836e55e04531be201bf6b4d8c8)
The msix_entries memory can be freed by a previous suspend or
remove, so don't crash on close when it isn't there. Also only
clear the interrupts when the interface is up, because there
aren't any when it is not up.
Signed-off-by: Mark Rustad <mark.d.rustad@intel.com> Tested-by: Krishneil Singh <krishneil.k.singh@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit eeffceee421b17a1d484679d738f278fbaa01384) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Ashok Raj [Sat, 19 Nov 2016 08:32:45 +0000 (00:32 -0800)]
PCI: pciehp: Prioritize data-link event over presence detect
If Slot Status indicates changes in both Data Link Layer Status and
Presence Detect, prioritize the Link status change.
When both events are observed, pciehp currently relies on the Slot Status
Presence Detect State (PDS) to agree with the Link Status Data Link Layer
Active status. The Presence Detect State, however, may be set to 1 through
out-of-band presence detect even if the link is down, which creates
conflicting events.
Since the Link Status accurately reflects the reachability of the
downstream bus, the Link Status event should take precedence over a
Presence Detect event. Skip checking the PDC status if we handled a link
event in the same handler.
Ashok Raj [Sat, 19 Nov 2016 08:32:46 +0000 (00:32 -0800)]
PCI: pciehp: Leave power indicator on when enabling already-enabled slot
If an error occurs when enabling a slot, pciehp_power_thread() turns off
the power indicator. But if the only error is that the slot was already
enabled, we should leave the power indicator on.
Return success if called to enable an already-enabled slot.
This is in the same spirit of the special handling for EEXISTS when
pciehp_configure_device() determines the slot devices already exist.
And, as Dmitry rightly assessed, that is because we can drop the
reference and then touch it when the underlying recvmsg calls return
some packets and then hit an error, which will make recvmmsg to set
sock->sk->sk_err, oops, fix it.
Reported-and-Tested-by: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Kostya Serebryany <kcc@google.com> Cc: Sasha Levin <sasha.levin@oracle.com> Fixes: a2e2725541fa ("net: Introduce recvmmsg socket syscall")
http://lkml.kernel.org/r/20160122211644.GC2470@redhat.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 34b88a68f26a75e4fded796f1a49c40f82234b7d) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Waiman Long [Tue, 13 Dec 2016 01:52:45 +0000 (12:52 +1100)]
signals: avoid unnecessary taking of sighand->siglock
When running certain database workload on a high-end system with many
CPUs, it was found that spinlock contention in the sigprocmask syscalls
became a significant portion of the overall CPU cycles as shown below.
Looking further into the swapcontext function in glibc, it was found that
the function always call sigprocmask() without checking if there are
changes in the signal mask.
A check was added to the __set_current_blocked() function to avoid taking
the sighand->siglock spinlock if there is no change in the signal mask.
This will prevent unneeded spinlock contention when many threads are
trying to call sigprocmask().
With this patch applied, the spinlock contention in sigprocmask() was
gone.
Link: http://lkml.kernel.org/r/1474979209-11867-1-git-send-email-Waiman.Long@hpe.com Signed-off-by: Waiman Long <Waiman.Long@hpe.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Stas Sergeev <stsp@list.ru> Cc: Scott J Norton <scott.norton@hpe.com> Cc: Douglas Hatch <doug.hatch@hpe.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit c7be96af89d4b53211862d8599b2430e8900ed92)
There is a double fetch problem in audit_log_single_execve_arg()
where we first check the execve(2) argumnets for any "bad" characters
which would require hex encoding and then re-fetch the arguments for
logging in the audit record[1]. Of course this leaves a window of
opportunity for an unsavory application to munge with the data.
This patch reworks things by only fetching the argument data once[2]
into a buffer where it is scanned and logged into the audit
records(s). In addition to fixing the double fetch, this patch
improves on the original code in a few other ways: better handling
of large arguments which require encoding, stricter record length
checking, and some performance improvements (completely unverified,
but we got rid of some strlen() calls, that's got to be a good
thing).
As part of the development of this patch, I've also created a basic
regression test for the audit-testsuite, the test can be tracked on
GitHub at the following link:
[1] If you pay careful attention, there is actually a triple fetch
problem due to a strnlen_user() call at the top of the function.
[2] This is a tiny white lie, we do make a call to strnlen_user()
prior to fetching the argument data. I don't like it, but due to the
way the audit record is structured we really have no choice unless we
copy the entire argument at once (which would require a rather
wasteful allocation). The good news is that with this patch the
kernel no longer relies on this strnlen_user() value for anything
beyond recording it in the log, we also update it with a trustworthy
value whenever possible.
Reported-by: Pengfei Wang <wpengfeinudt@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Paul Moore <paul@paul-moore.com> Signed-off-by: Dan Duval <dan.duval@oracle.com>
(cherry picked from commit 43761473c254b45883a64441dd0bc85a42f3645c)
Conflict:
kernel/auditsc.c
Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Fix a short sprintf buffer in proc_keys_show(). If the gcc stack protector
is turned on, this can cause a panic due to stack corruption.
The problem is that xbuf[] is not big enough to hold a 64-bit timeout
rendered as weeks:
(gdb) p 0xffffffffffffffffULL/(60*60*24*7)
$2 = 30500568904943
That's 14 chars plus NUL, not 11 chars plus NUL.
Expand the buffer to 16 chars.
I think the unpatched code apparently works if the stack-protector is not
enabled because on a 32-bit machine the buffer won't be overflowed and on a
64-bit machine there's a 64-bit aligned pointer at one side and an int that
isn't checked again on the other side.
Reported-by: Ondrej Kozina <okozina@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Ondrej Kozina <okozina@redhat.com>
cc: stable@vger.kernel.org Signed-off-by: James Morris <james.l.morris@oracle.com>
(cherry picked from commit 03dab869b7b239c4e013ec82aea22e181e441cfc) Signed-off-by: Jack Vogel <jack.vogel@oracle.com> Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Replace MSR_NHM_TURBO_RATIO_LIMIT with MSR_TURBO_RATIO_LIMIT.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit ebf5926a001fd0d0d3fdfc2c39c0c8ff0c846c1a) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
When run
make -C tools DESTDIR=/my/nice/dir turbostat_install
get a message
install: cannot create regular file '/usr/bin/turbostat': Permission denied
Allow user to alter DESTDIR and PREFIX variables.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit ac485cb4b3b79fb7ef979cb243ee16e1a1ffa767) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Sometimes the rc6 sysfs counter spontaneously resets,
causing turbostat prints a very large number
as it tries to calcuate % = 100 * (old - new) / interval
When we see (old > new), print ***.**% instead
of a bogus huge number.
Note that this detection is not fool-proof, as the counter
could reset several times and still result in new > old.
Signed-off-by: Len Brown <len.brown@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 9185e988e9d5bb70b690362e84bb2e4a9d71f2c5) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Len Brown <len.brown@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit cdc57272ea0a0e952c4609b56e157e4d0ec8e956) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Len Brown <len.brown@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit ec53e594c65ab099ca784d62b6f4c191e3a4d7cc) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Len Brown <len.brown@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit e8efbc80db5e824ce2382d5e65429b6b493e71e2) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Len Brown <len.brown@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit e4085d543e256aff6606ba99ed257f7c06685f3b) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Some processors use the Interrupt Response Time Limit (IRTL) MSR value
to describe the maximum IRQ response time latency for deep
package C-states. (Though others have the register, but do not use it)
Lets print it out to give insight into the cases where it is used.
IRTL begain in SNB, with PC3/PC6/PC7, and HSW added PC8/PC9/PC10.
Signed-off-by: Len Brown <len.brown@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 5a63426e2a18775ed05b20e3bc90c68bacb1f68a) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
The CPUID.SGX bit was printed, even if --debug was used
Signed-off-by: Len Brown <len.brown@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit 8ae7225591fd15aac89769cbebb3b5ecc8b12fe5) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
MSR_CONFIG_TDP_NOMINAL:
should print all 8 bits of base_ratio (bit 0:7) 0xFF
MSR_CONFIG_TDP_LEVEL_1:
should print all 15 bits of PKG_MIN_PWR_LVL1 (bit 48:62) 0x7FFF
should print all 15 bits of PKG_MAX_PWR_LVL1 (bit 32:46) 0x7FFF
should print all 8 bits of LVL1_RATIO (bit 16:23) 0xFF
should print all 15 bits of PKG_TDP_LVL1 (bit 0:14) 0x7FFF
And the same modification to MSR_CONFIG_TDP_LEVEL_2.
MSR_TURBO_ACTIVATION_RATIO:
should print all 8 bits of MAX_NON_TURBO_RATIO (bit 0:7) 0xFF
Signed-off-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit 685b535b2cdb9cdf354321f8af9ed17dcf19d19f) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
MSR_NHM_SNB_PKG_CST_CFG_CTL: 0x1e008008 (...pkg-cstate-limit=0: unlimited)
should print as
MSR_NHM_SNB_PKG_CST_CFG_CTL: 0x1e008008 (...pkg-cstate-limit=8: unlimited)
Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit 6c34f160df82ae07bfff5b4c51f50622e9c2759e) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
turbostat already checks whether calling each cpuid leavf is legal,
and it doesn't look at the function return value,
so call the simpler gcc intrinsic __cpuid() instead of __get_cpuid().
syntax only, no functional change
Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit 5aea2f7f645b27635b856311dee5b775d277c686) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
SGX presence is related to a SKL power workaround,
so lets show when that is enabled.
Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit aa8d8cc79af16e16da04efff1c1a72b1ea4a9e7e) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
The accuracy of Bzy_Mhz and Busy% depend on reading
the TSC, APERF, and MPERF close together in time.
When there is a very short measurement interval,
or a large system is profoundly idle, the changes
in APERF and MPERF may be very small.
They can be small enough that an expensive interrupt
between reading APERF and MPERF can cause the APERF/MPERF
ratio to become inaccurate, resulting in invalid
calculation and display of Bzy_MHz.
A dummy APERF read of APERF makes this problem
much more rare. Apparently this 1st systemn call
after exiting a long stretch of idle is when we
typically see expensive timer interrupts that cause
large jitter.
For the cases that dummy APERF read fails to prevent,
we compare the latency of the APERF and MPERF reads.
If they differ by more than 2x, we re-issue them.
Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit 0102b06747c7d24e334d2b27c4b43eed693676f1) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
The column "GFX%c6" show the percentage of time the GPU
is in the "render C6" state, rc6. Deep package C-states on several
systems depend on the GPU being in RC6.
This information comes from the counter
/sys/class/drm/card0/power/rc6_residency_ms,
as read before and after the measurement interval.
Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit fdf676e51f301d207586d9bac509b8ce055bae8a) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Under the column "GFXMHz", show a snapshot of this attribute:
/sys/class/graphics/fb0/device/drm/card0/gt_cur_freq_mhz
This is an instantaneous snapshot of what sysfs presents
at the end of the measurement interval. turbostat does
not average or otherwise perform any math on this value.
Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit 27d47356b6dfa92042a17a0b474f08910d4c8e8f) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
The new IRQ column shows how many interrupts have occurred on each CPU
during the measurement inteval. This information comes from
the difference between /proc/interrupts shapshots made before
and after the measurement interval.
The first row, the system summary, shows the sum of the IRQS
for all CPUs during that interval.
Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit 562a2d377bb9882c49debc9e1be7127a1717e242) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
skip the open(2)/close(2) on each msr read
by keeping the /dev/cpu/*/msr files open.
The remaining read(2) is generally far fewer cycles
than the removed open(2) system call.
Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit 36229897ba966bb0dc9e060222ff17b198252367) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit 58cc30a4e6086d542302e04eea39b2a438e4c5dd) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
tools/power/x86/turbostat/turbostat.c
Turbostat --debug gconfiguration info goes to stderr.
In FORK mode, turbostat statistics go to stderr.
In PERIODIC mode, turbostat statistics go to stdout.
These defaults do not change, but an option "--out file"
will send all output above only to the specified file.
Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit b7d8c1483bbf6ec9d2dd76d6a1c91a38c3f6ac35) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
some tools processing turbostat output
have difficulty with items that begin with %...
Reported-by: Jacob Pan <jacob.jun.pan@linux.intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit 75d2e44e60490ba1fee076a5f4dcfbdc8598e8c4) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Following changes have been made:
- changed MSR_NHM_TURBO_RATIO_LIMIT to MSR_TURBO_RATIO_LIMIT in debug print
for consistency with Developer Manual
- updated definition of bitfields in MSR_TURBO_RATIO_LIMIT and appropriate
parsing code
- added x200 to list of architectures that do not support Nahlem compatible
definition of MSR_TURBO_RATIO_LIMIT register (x200 has the register but
bits definition is custom)
- fixed typo in code that parses MSR_TURBO_RATIO_LIMIT
(logical instead of bitwise operator)
- changed MSR_TURBO_RATIO_LIMIT parsing algorithm so the print out had the
same order as implementations for other platforms
Signed-off-by: Hubert Chrzaniuk <hubert.chrzaniuk@intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit cbf97abaf3689652bcddc0741dc49629d1838142) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
x200 does not enable any way to programmatically obtain bus clock
speed. Bclk for the architecture has a fixed value of 100 MHz.
At the same time x200 cannot be included in has_snb_msrs since
it does not support C7 idle state.
prior to this patch, MHz values reported on this chip
were erroneously calculated using bclk of 133MHz,
causing MHz values to be reported 33% higher than actual.
Signed-off-by: Hubert Chrzaniuk <hubert.chrzaniuk@intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit 121b48bb77187cf2ed3053e147d2e6c1e864083c) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
will sample and display statistics every interval_sec.
interval_sec used to be a whole number of seconds,
but now we accept a decimal, as small as 0.001 sec (1 ms).
Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit 2a0609c02e6558df6075f258af98a54a74b050ff) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
When building with gcc 6 we're getting various build warnings that just
require some trivial function declaration and call fixes:
turbostat.c: In function ‘dump_cstate_pstate_config_info’:
turbostat.c:1973:1: warning: type of ‘family’ defaults to ‘int’
dump_cstate_pstate_config_info(family, model)
turbostat.c:1973:1: warning: type of ‘model’ defaults to ‘int’
turbostat.c: In function ‘get_tdp’:
turbostat.c:2145:8: warning: type of ‘model’ defaults to ‘int’
double get_tdp(model)
turbostat.c: In function ‘perf_limit_reasons_probe’:
turbostat.c:2259:6: warning: type of ‘family’ defaults to ‘int’
void perf_limit_reasons_probe(family, model)
turbostat.c:2259:6: warning: type of ‘model’ defaults to ‘int’
Signed-off-by: Colin Ian King <colin.king@canonical.com> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-wbicer8n0s9qe6ql8h9x478e@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
(cherry picked from commit 1b69317d2dc80bc8e1d005e1a771c4f5bff3dabe) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
This MSR is helpful to show if P-state HW coordination
is enabled or disabled.
Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit f0057310b40efe9f797ff337e9464e6a6fb9d782) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit 7f5c258e1ce1e0909d3195694ac79f143051e513) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit 61a87ba7893a256d86c7eea6a7ab10d38ccac9b2) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
tools/power/x86/turbostat/turbostat.c
Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit 69807a638f91524ed75027f808cd277417ecee7a) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
MSR_PLATFORM_INFO is the new name for MSR_NHM_PLATFORM_INFO
no functional change
Signed-off-by: Len Brown <len.brown@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit ec0adc539b8bf59b7c00db0748671f6594b77843) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
MSR_TURBO_ACTIVATION_RATIO: 0x00000016 (MAX_NON_TURBO_RATIO=6 lock=0)
should print all 7 bits of MAX_NON_TURBO_RATIO (in decimal):
MSR_TURBO_ACTIVATION_RATIO: 0x00000016 (MAX_NON_TURBO_RATIO=22 lock=0)
Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit 759d2a932b82009a7039ef5567e7dcba153ce123) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
base_mhz is calculated directly from the base_ratio
reported in MSR_NHM_PLATFORM_INFO * bclk,
and bclk is discovered via MSR or cpuid.
This reduces the dependency of Bzy_MHz calculation on the TSC.
Previously, there were 4 TSC readings required in each caculation,
the raw TSC delta combined with the measurement_interval.
This also removes the "tsc_tweak" correction factor used when
TSC runs on a different base clock from the CPU's bclk.
After this change, tsc_tweak is used only for %Busy.
The end-result should be a Bzy_MHz result slightly less prone to jitter.
Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit 21ed5574d1622118b49b0c6342acc8d27d0799be) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
On a Skylake with 1500MHz base frequency,
the TSC runs at 1512MHz.
This is because the TSC is no longer in the n*100 MHz BCLK domain,
but is now in the m*24MHz crystal clock domain. (24 MHz * 63 = 1512 MHz)
This adds error to several calculations in turbostat,
unless the TSC sample sizes are adjusted for this difference.
Note that calculations in the time domain are immune
from this issue, as the timing sub-system has already
calibrated the TSC against a known wall clock.
AVG_MHz = APERF_delta/measurement_interval
need no adjustment. APERF_delta is in the BCLK domain,
and measurement_interval is in the time domain.
TSC_MHz = TSC_delta/measurement_interval
needs no adjustment -- as we really do want to report
the actual measured TSC delta here, and measurement_interval
is in the accurate time domain.
%Busy = MPERF_delta/TSC_delta
needs adjustment to use TSC_BCLK_DOMAIN_delta.
TSC_BCLK_DOMAIN_delta = TSC_delta * base_hz / tsc_hz
Note that in this example, the "Before" Bzy_MHz
was reported as exceeding the 2200 max turbo rate.
Also, even a pinned spin loop would not be reported
as over 99% busy.
Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit a2b7b74945dbfe5d734eafe8aa52f9f1f8bc6931) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
KNL increments APERF and MPERF every 1024 clocks.
This is compliant with the architecture specification,
which requires that only the ratio of APERF/MPERF need be valid.
However, turbostat takes advantage of the fact that these
two MSRs increment every un-halted clock
at the actual and base frequency:
AVG_MHz = APERF_delta/measurement_interval
%Busy = MPERF_delta/TSC_delta
This quirk is needed for these calculations to also work on KNL,
which would otherwise show a value 1024x smaller than expected.
Signed-off-by: Hubert Chrzaniuk <hubert.chrzaniuk@intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit b2b34dfe4d9aa4c468fc363b3b666974783ed1f9) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Staring in Linux-4.3-rc1,
commit 6fb3143b561c ("tools/power turbostat: dump CONFIG_TDP")
touches MSR 0x648, which is not supported on IVB-Xeon.
This results in "turbostat --debug" exiting on those systems:
Remove IVB-Xeon from the list of machines supporting with that MSR.
Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit 756357b8e4b072fd5ee86421f794e071a348802b) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Len Brown [Fri, 24 Jul 2015 14:35:23 +0000 (10:35 -0400)]
tools/power turbostat: fix typo on DRAM column in Joules-mode
< RAM_W
> RAM_J
Reported-by: Hubert Chrzaniuk <hubert.chrzaniuk@intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit bd6906ed3d7a00d55c9bd368a640ef83bb487d1d) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
turbostat supports forked command when sampling cpu state. However,
the forked command is not allowed to be executed with options, otherwise
turbostat might regard these options as invalid turbostat options.
Reported-by: Chen Yu <yu.c.chen@intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit a01e72fbc41e322ed229465de8b595a7e3ec6549) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Config TDP is a feature that allows parts to be configured
for different thermal limits after they have left the factory.
This can have an effect on the operation of the part,
particularly in determiniing...
Max Non-turbo Ratio
Turbo Activation Ratio
Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit 6fb3143b561c4a7865e5513eeb02d42ef38e8173) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
The --debug option reads a number of per-package MSRs.
Previously we explicitly read them on cpu0, but recently
turbostat changed to read them on the current "base_cpu".
Update the print-out to reflect base_cpu, rather than
the hard-coded cpu0.
Signed-off-by: Len Brown <len.brown@intel.com>
(cherry picked from commit bfae2052265cde825afaba35eb3a4d3889432734) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>