www.infradead.org Git - users/jedix/linux-maple.git/log

scsi: qedi: Fix possible memory leak in qedi_iscsi_update_conn()

Orabug: 26403604

'conn_info' is malloced in qedi_iscsi_update_conn() and should be freed
before leaving from the error handling cases, otherwise it will cause
memory leak.

Fixes: ace7f46ba5fd ("scsi: qedi: Add QLogic FastLinQ offload iSCSI driver framework.")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Acked-by: Manish Rangankar <Manish.Rangankar@cavium.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: qedi: return via va_end to match corresponding va_start

Orabug: 26403604

Although on most systems va_end is a no-op, it is good practice to use
va_end on the function return path, especially since the va_start
documenation states:

"Each invocation of va_start() must be matched by a corresponding
invocation of va_end() in the same function."

Found with static analysis by CoverityScan, CIDs 1389477-1389479

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Manish Rangankar <manish.rangankar@cavium.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: qedi: fix build, depends on UIO

Orabug: 26403604

Fix build of SCSI qedi driver. It uses uio interfaces so it should
depend on UIO.

ERROR: "uio_unregister_device" [drivers/scsi/qedi/qedi.ko] undefined!
ERROR: "uio_event_notify" [drivers/scsi/qedi/qedi.ko] undefined!
ERROR: "__uio_register_device" [drivers/scsi/qedi/qedi.ko] undefined!

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: QLogic-Storage-Upstream@cavium.com
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

scsi: qedi: Add QLogic FastLinQ offload iSCSI driver framework.

Orabug: 26403604

The QLogic FastLinQ Driver for iSCSI (qedi) is the iSCSI specific module
for 41000 Series Converged Network Adapters by QLogic.

This patch consists of following changes:

  - MAINTAINERS Makefile and Kconfig changes for qedi,
  - PCI driver registration,
  - iSCSI host level initialization,
  - Debugfs and log level infrastructure.

The following indiviual changes are merged into this commit:

  qedi: Add LL2 iSCSI interface for offload iSCSI.
  qedi: Add support for iSCSI session management.
  qedi: Add support for data path.

Signed-off-by: Nilesh Javali <nilesh.javali@cavium.com>
Signed-off-by: Adheer Chandravanshi <adheer.chandravanshi@qlogic.com>
Signed-off-by: Chad Dupuis <chad.dupuis@cavium.com>
Signed-off-by: Saurav Kashyap <saurav.kashyap@cavium.com>
Signed-off-by: Arun Easi <arun.easi@cavium.com>
Signed-off-by: Manish Rangankar <manish.rangankar@cavium.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Brian Maly <brian.maly@oracle.com>

xen/grant-table: log the lack of grants

Orabug: 26324349

log a message when we enter this situation:
1) we already allocated the max number of available grants from hypervisor
and
2) we still need more (but the request fails because of 1)).

Sometimes the lack of grants causes IO hangs in xen_blkfront devices.
Adding this log would help debuging.

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>

macsec: dynamically allocate space for sglist

Orabug: 25953290
CVE: CVE-2017-7477

We call skb_cow_data, which is good anyway to ensure we can actually
modify the skb as such (another error from prior). Now that we have the
number of fragments required, we can safely allocate exactly that amount
of memory.

Fixes: c09440f7dcb3 ("macsec: introduce IEEE 802.1AE driver")
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Acked-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 5294b83086cc1c35b4efeca03644cf9d12282e5b)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/macsec.c

macsec: avoid heap overflow in skb_to_sgvec

Orabug: 25953290
CVE: CVE-2017-7477

While this may appear as a humdrum one line change, it's actually quite
important. An sk_buff stores data in three places:

1. A linear chunk of allocated memory in skb->data. This is the easiest
   one to work with, but it precludes using scatterdata since the memory
   must be linear.
2. The array skb_shinfo(skb)->frags, which is of maximum length
   MAX_SKB_FRAGS. This is nice for scattergather, since these fragments
   can point to different pages.
3. skb_shinfo(skb)->frag_list, which is a pointer to another sk_buff,
   which in turn can have data in either (1) or (2).

The first two are rather easy to deal with, since they're of a fixed
maximum length, while the third one is not, since there can be
potentially limitless chains of fragments. Fortunately dealing with
frag_list is opt-in for drivers, so drivers don't actually have to deal
with this mess. For whatever reason, macsec decided it wanted pain, and
so it explicitly specified NETIF_F_FRAGLIST.

Because dealing with (1), (2), and (3) is insane, most users of sk_buff
doing any sort of crypto or paging operation calls a convenient function
called skb_to_sgvec (which happens to be recursive if (3) is in use!).
This takes a sk_buff as input, and writes into its output pointer an
array of scattergather list items. Sometimes people like to declare a
fixed size scattergather list on the stack; othertimes people like to
allocate a fixed size scattergather list on the heap. However, if you're
doing it in a fixed-size fashion, you really shouldn't be using
NETIF_F_FRAGLIST too (unless you're also ensuring the sk_buff and its
frag_list children arent't shared and then you check the number of
fragments in total required.)

Macsec specifically does this:

        size += sizeof(struct scatterlist) * (MAX_SKB_FRAGS + 1);
        tmp = kmalloc(size, GFP_ATOMIC);
        *sg = (struct scatterlist *)(tmp + sg_offset);
...
        sg_init_table(sg, MAX_SKB_FRAGS + 1);
        skb_to_sgvec(skb, sg, 0, skb->len);

Specifying MAX_SKB_FRAGS + 1 is the right answer usually, but not if you're
using NETIF_F_FRAGLIST, in which case the call to skb_to_sgvec will
overflow the heap, and disaster ensues.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: stable@vger.kernel.org
Cc: security@kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 4d6fa57b4dab0d77f4d8e9d9c73d1e63f6fe8fee)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/macsec.c

nfsd: check for oversized NFSv2/v3 arguments

Orabug: 25917857
CVE: CVE-2017-7645

A client can append random data to the end of an NFSv2 or NFSv3 RPC call
without our complaining; we'll just stop parsing at the end of the
expected data and ignore the rest.

Encoded arguments and replies are stored together in an array of pages,
and if a call is too large it could leave inadequate space for the
reply.  This is normally OK because NFS RPC's typically have either
short arguments and long replies (like READ) or long arguments and short
replies (like WRITE).  But a client that sends an incorrectly long reply
can violate those assumptions.  This was observed to cause crashes.

Also, several operations increment rq_next_page in the decode routine
before checking the argument size, which can leave rq_next_page pointing
well past the end of the page array, causing trouble later in
svc_free_pages.

So, following a suggestion from Neil Brown, add a central check to
enforce our expectation that no NFSv2/v3 call has both a large call and
a large reply.

As followup we may also want to rewrite the encoding routines to check
more carefully that they aren't running off the end of the page array.

We may also consider rejecting calls that have any extra garbage
appended.  That would be safer, and within our rights by spec, but given
the age of our server and the NFS protocol, and the fact that we've
never enforced this before, we may need to balance that against the
possibility of breaking some oddball client.

Reported-by: Tuomas Haanpää <thaan@synopsys.com>
Reported-by: Ari Kauppi <ari@synopsys.com>
Cc: stable@vger.kernel.org
Reviewed-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
(cherry picked from commit e6838a29ecb484c97e4efef9429643b9851fba6e)
Signed-off-by: Brian Maly <brian.maly@oracle.com>

bonding: avoid defaulting hard_header_len to ETH_HLEN on slave removal

On slave list updates, the bonding driver computes its hard_header_len
as the maximum of all enslaved devices's hard_header_len.
If the slave list is empty, e.g. on last enslaved device removal,
ETH_HLEN is used.

Since the bonding header_ops are set only when the first enslaved
device is attached, the above can lead to header_ops->create()
being called with the wrong skb headroom in place.

If bond0 is configured on top of ipoib devices, with the
following commands:

ifup bond0
for slave in $BOND_SLAVES_LIST; do
ip link set dev $slave nomaster
done
ping -c 1 <ip on bond0 subnet>

we will obtain a skb_under_panic() with a similar call trace:
skb_push+0x3d/0x40
push_pseudo_header+0x17/0x30 [ib_ipoib]
ipoib_hard_header+0x4e/0x80 [ib_ipoib]
arp_create+0x12f/0x220
arp_send_dst.part.19+0x28/0x50
arp_solicit+0x115/0x290
neigh_probe+0x4d/0x70
__neigh_event_send+0xa7/0x230
neigh_resolve_output+0x12e/0x1c0
ip_finish_output2+0x14b/0x390
ip_finish_output+0x136/0x1e0
ip_output+0x76/0xe0
ip_local_out+0x35/0x40
ip_send_skb+0x19/0x40
ip_push_pending_frames+0x33/0x40
raw_sendmsg+0x7d3/0xb50
inet_sendmsg+0x31/0xb0
sock_sendmsg+0x38/0x50
SYSC_sendto+0x102/0x190
SyS_sendto+0xe/0x10
do_syscall_64+0x67/0x180
entry_SYSCALL64_slow_path+0x25/0x25

This change addresses the issue avoiding updating the bonding device
hard_header_len when the slaves list become empty, forbidding to
shrink it below the value used by header_ops->create().

The bug is there since commit 54ef31371407 ("[PATCH] bonding: Handle large
hard_header_len") but the panic can be triggered only since
commit fc791b633515 ("IB/ipoib: move back IB LL address into the hard
header").

Orabug: 26087204

Reported-by: Norbert P <noe@physik.uzh.ch>
Fixes: 54ef31371407 ("[PATCH] bonding: Handle large hard_header_len")
Fixes: fc791b633515 ("IB/ipoib: move back IB LL address into the hard header")
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 19cdead3e2ef8ed765c5d1ce48057ca9d97b5094)

Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Reviewed-by: Qing Huang <qing.huang@oracle.com>
Tested-by: Qing Huang <qing.huang@oracle.com>

aacraid: initialize scsi shared tag map

Need to initialize shared tag map when initialize the aacraid adapter.

Orabug: 26367701

Signed-off-by: Joe Jin <joe.jin@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

[PATCH] RDS: Print failed rdma op details if failure is remote access

Improves diagnosability when RDMA op fails allowing this print to be
matched with prints on the responder side; which prints RDMA keys
which are prematurely purged by the owning process.

Signed-off-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com>
Reviewed-by: Avinash Repaka <avinash.repaka@oracle.com>
Reviewed-by: Hakon Bugge <haakon.bugge@oracle.com>
Orabug: 26277933

[PATCH] RDS: When RDS socket is closed, print unreleased MR's

Improves diagnosability when the requester RDMA operation fails with
remote access error. This commit prints RDMA credentials which are
prematurely purged by owning process.

Signed-off-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com>
Reviewed-by: Avinash Repaka <avinash.repaka@oracle.com>
Reviewed-by: Hakon Bugge <haakon.bugge@oracle.com>
Orabug: 26276427

RDMA/core: not to set page dirty bit if it's already set.

This change will optimize kernel memory deregistration operations.
__ib_umem_release() used to call set_page_dirty_lock() against every
writable page in its memory region. Its purpose is to keep data
synced between CPU and DMA device when swapping happens after mem
deregistration ops. Now we choose not to set page dirty bit if it's
already set by kernel prior to calling __ib_umem_release(). This
reduces memory deregistration time by half or even more when we ran
application simulation test program.

Orabug: 24313031

Signed-off-by: Qing Huang <qing.huang@oracle.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
(cherry picked from upstream commit 53376fedb9da54c0d3b0bd3a6edcbeb681692909)
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>

net/rds: Reduce memory footprint in rds_sendmsg

Orabug: 26151323

8K + 256 (8448B) is an important message size for the RDBMS workload. Since
Infiniband supports scatter-gather in hardware, there is no reason to
fragment each RDS message into PAGE_SIZE work requests. Hence, RDS fragment
sizes up to 16K has been introduced.

Fixes: 23f90cccfba4 ("RDS: fix the sg allocation based on actual msg sz")
Previous behavior was allocating a contiguous memory buffer, corresponding
to the size of the RDS message. Although this was functional correct, it
introduced hard pressure on the memory allocation system, which was
not needed.

This commit fixes the drawback introduced by only allocating
the buffer according to RDS_MAX_FRAG_SIZE.

Orabug: 26350949

Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Acked-by: Joe Jin <joe.jin@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Tested-by: John Sobecki <john.sobecki@oracle.com>

xprtrdma: Don't require LOCAL_DMA_LKEY support for fastreg

There is no need to require LOCAL_DMA_LKEY support as the
PD allocation makes sure that there is a local_dma_lkey. Also
correctly set a return value in error path.

This caused a NULL pointer dereference in mlx5 which removed
the support for LOCAL_DMA_LKEY.

Fixes: bb6c96d72879 ("xprtrdma: Replace global lkey with lkey local to PD")
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
Orabug: 26151481

(cherry picked from commit f022fa88ce6abc83c6e678a890df5c1e4b0eaf89)
Signed-off-by: Ghazale Hosseinabadi <ghazale.hosseinabadi@oracle.com>
Reviewed-by: Avinash Repaka <avinash.repaka@oracle.com>

i40e/i40evf: Add support for mapping pages with DMA attributes

This patch adds support for DMA_ATTR_SKIP_CPU_SYNC and
DMA_ATTR_WEAK_ORDERING. By enabling both of these for the Rx path we
are able to see performance improvements on architectures that implement
either one due to the fact that page mapping and unmapping only has to
sync what is actually being used instead of the entire buffer. In addition
by enabling the weak ordering attribute enables a performance improvement
for architectures that can associate a memory ordering with a DMA buffer
such as Sparc.

Change-ID: If176824e8231c5b24b8a5d55b339a6026738fc75
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Orabug: 26396243
Local modifications to account for dma_attr data type difference.
(cherry picked from commit 59605bc09630c2b577858c371edf89c099b5f925)
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>

block: defer timeouts to a workqueue

Timer context is not very useful for drivers to perform any meaningful abort
action from. So instead of calling the driver from this useless context
defer it to a workqueue as soon as possible.

Note that while a delayed_work item would seem the right thing here I didn't
dare to use it due to the magic in blk_add_timer that pokes deep into timer
internals. But maybe this encourages Tejun to add a sensible API for that to
the workqueue API and we'll all be fine in the end :)

Contains a major update from Keith Bush:

"This patch removes synchronizing the timeout work so that the timer can
start a freeze on its own queue. The timer enters the queue, so timer
context can only start a freeze, but not wait for frozen."

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
(cherry picked from commit 287922eb0b186e2a5bf54fdd04b734c25c90035c)

Orabug: 25654233

Signed-off-by: Ashok Vairavan <ashok.vairavan@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

Merge branch 'ca-git/topic/uek-4.1/dtrace' into ca-git/uek/uek-4.1-next

sparc64: Set valid bytes of misaligned no-fault loads

If a misaligned no-fault load (ldm* from ASI 0x82, primary no fault)
crosses a page boundary, and one of the pages causes an MMU miss
that cannot be resolved, then the kernel must load bytes from the
valid page into the high (or low) bytes of the destination register,
and must load zeros into the low (or high) bytes of the register.

Orabug: 25766652

Signed-off-by: Rob Gardner <rob.gardner@oracle.com>
Reviewed-by: Steve Sistare steven.sistare@oracle.com
Reviewed-by: Anthony Yznaga <anthony.yznaga@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

fs/fuse: Fix for correct number of numa nodes

When fuse filesystem is mounted it sets up data structure
for all the available numa nodes(with -o numa). However,
it uses nr_node_ids which is set to MAX_NUMNODES(16). This
causes following panic when kmalloc_node is called.

Call Trace:
[00000000005cb0bc] allocate_slab+0x9c/0x2e0
[00000000005cb48c] new_slab+0x2c/0x220
[00000000005cc84c] __slab_alloc+0x2ec/0x3c0
[00000000005cec60] kmem_cache_alloc_node_trace+0xa0/0x2c0
[0000000010608870] fuse_conn_init+0x150/0x2a0 [fuse]
[0000000010609128] fuse_fill_super+0x128/0x440 [fuse]
[00000000005e6040] mount_nodev+0x40/0xa0
[00000000106074fc] fuse_mount+0x1c/0x40 [fuse]
[00000000005e5e14] mount_fs+0x34/0x160
[0000000000606c04] vfs_kern_mount+0x44/0xe0
[0000000000606ec8] do_new_mount+0x1e8/0x300
[00000000006071a4] do_mount+0x1c4/0x220
[0000000000607260] SyS_mount+0x60/0xa0
[00000000004062d4] linux_sparc_syscall+0x34/0x44

Fix it by setting it to only online node(nr_online_nodes).
Also fix the case with FUSE_CPU with num_present_cpus.

This started happening only after we moved to SLUB allocation.
Slab allocation uses fallback_alloc method when node is not valid.

This panic happens only on SPARC. In x86, nr_node_ids is set to
only available nodes.

Orabug: 25947102

Signed-off-by: Babu Moger <babu.moger@oracle.com>
Reviewed-by: Ashish Samant <ashish.samant@oracle.com>
Reviewed-by: Jane Chu <jane.chu@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: broken %tick frequency on spitfire cpus

After early boot time stamps project the %tick frequency is detected
incorrectly on spittfire cpus.

We must use cpuid of boot cpu to find corresponding cpu node in OpenBoot,
and extract clock-frequency property from there.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Orabug: 24401250
Orabug: 25637776

(cherry picked from commit eea9833453bd39e2f35325abb985d00486c8aa69)
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: use prom interface to get %stick frequency

We initialize time early, we must use prom interface instead of open
firmware driver, which is not yet initialized.

Also, use prom_getintdefault() instead of prom_getint() to be compatible
with the code before early boot timestamps project.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Orabug: 24401250
Orabug: 25637776

(cherry picked from commit fca4afe400cb68fe5a7f0a97fb1ba5cfdcb81675)
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: optimize functions that access tick

Replace read tick function pointers with the new hot-patched get_tick().
This optimizes the performance of functions such as: sched_clock()

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Orabug: 24401250
Orabug: 25637776

(cherry picked from commit eae3fc9871111e9bbc77dad5481a3e805e02ac46)
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: add hot-patched and inlined get_tick()

Add the new get_tick() function that is hot-patched during boot based on
processor we are booting on.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Orabug: 24401250
Orabug: 25637776

(cherry picked from commit 4929c83a6ce6584cb64381bf1407c487f67d588a)
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: initialize time early

In Linux it is possible to configure printk() to output timestamp next to
every line. This is very useful to determine the slow parts of the boot
process, and also to avoid regressions, as boot time is visiable to
everyone.

Also, there are scripts that change these time stamps to intervals.

However, on larger machines these timestamps start appearing many seconds,
and even minutes into the boot process. This patch gets stick-frequency
property early from OpenBoot, and uses its value to initialize time stamps
before the first printk() messages are printed.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Orabug: 24401250
Orabug: 25637776

(cherry picked from commit 83e8eb99d908da78e6eff7dd141f26626fe01d12)
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: improve modularity tick options

This patch prepares the code for early boot time stamps by making it more
modular.

- init_tick_ops() to initialize struct sparc64_tick_ops
- new sparc64_tick_ops operation get_frequency() which returns a
frequency

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Orabug: 24401250
Orabug: 25637776

(cherry picked from commit 89108c3423e8047cd0da73182ea09b9da190b57e)
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: optimize loads in clock_sched()

In clock sched we now have three loads:
- Function pointer
- quotient for multiplication
- offset

However, it is possible to improve performance substantially, by
guaranteeing that all three loads are from the same cacheline.

By moving these three values first in sparc64_tick_ops, and by having
tick_operations 64-byte aligned we guarantee this.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Orabug: 24401250
Orabug: 25637776

(cherry picked from commit 178bf2b9a20e866677bbca5cb521b09a8498c1d7)
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: show time stamps from zero

On most platforms, time is shown from the beginning of boot. This patch is
adding offset to sched_clock() for SPARC, to also show time from 0.

This means we will have one more load, but we saved one in an ealier patch.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Orabug: 24401250
Orabug: 25637776

(cherry picked from commit b5dd4d807f0fe7da67c5cc67b2ec681b60e4994b)
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: access tick function from variable

In timer_64.c tick functions are access via pointer (tick_ops), every time
clock is read, there is one extra load to get to the function.

This patch optimizes it, by accessing functions pointer from value.

Current ched_clock():
sethi  %hi(0xb9b400), %g1
ldx  [ %g1 + 0x250 ], %g1  ! <tick_ops>
ldx  [ %g1 ], %g1
call  %g1
nop
sethi  %hi(0xb9b400), %g1
ldx  [ %g1 + 0x300 ], %g1  ! <timer_ticks_per_nsec_quotient>
mulx  %o0, %g1, %g1
rett  %i7 + 8
srlx  %g1, 0xa, %o0

New sched_clock():
sethi  %hi(0xb9b400), %g1
ldx  [ %g1 + 0x340 ], %g1
call  %g1
nop
sethi  %hi(0xb9b400), %g1
ldx  [ %g1 + 0x378 ], %g1
mulx  %o0, %g1, %g1
rett  %i7 + 8
srlx  %g1, 0xa, %o0

Before three loads, now two loads.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Orabug: 24401250
Orabug: 25637776

(cherry picked from commit b8a83fcb78c859b99807af4c8b0ab09f0f827a40)
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: remove trailing white spaces

A few changes that were reported by checkpatch, removed all trailing white
spaces in these two files.

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Orabug: 24401250
Orabug: 25637776

(cherry picked from commit 68a792174d7f67c7d2108bf1cc55ab8a63fc4678)
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: delete old wrap code

The old method that is using xcall and softint to get new context id is
deleted, as it is replaced by a method of using per_cpu_secondary_mm
without xcall to perform the context wrap.

Orabug: 25999953

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 0197e41ce70511dc3b71f7fefa1a676e2b5cd60b)
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: new context wrap

The current wrap implementation has a race issue: it is called outside of
the ctx_alloc_lock, and also does not wait for all CPUs to complete the
wrap.  This means that a thread can get a new context with a new version
and another thread might still be running with the same context. The
problem is especially severe on CPUs with shared TLBs, like sun4v. I used
the following test to very quickly reproduce the problem:
- start over 8K processes (must be more than context IDs)
- write and read values at a  memory location in every process.

Very quickly memory corruptions start happening, and what we read back
does not equal what we wrote.

Several approaches were explored before settling on this one:

Approach 1:
Move smp_new_mmu_context_version() inside ctx_alloc_lock, and wait for
every process to complete the wrap. (Note: every CPU must WAIT before
leaving smp_new_mmu_context_version_client() until every one arrives).

This approach ends up with deadlocks, as some threads own locks which other
threads are waiting for, and they never receive softint until these threads
exit smp_new_mmu_context_version_client(). Since we do not allow the exit,
deadlock happens.

Approach 2:
Handle wrap right during mondo interrupt. Use etrap/rtrap to enter into
into C code, and issue new versions to every CPU.
This approach adds some overhead to runtime: in switch_mm() we must add
some checks to make sure that versions have not changed due to wrap while
we were loading the new secondary context. (could be protected by PSTATE_IE
but that degrades performance as on M7 and older CPUs as it takes 50 cycles
for each access). Also, we still need a global per-cpu array of MMs to know
where we need to load new contexts, otherwise we can change context to a
thread that is going way (if we received mondo between switch_mm() and
switch_to() time). Finally, there are some issues with window registers in
rtrap() when context IDs are changed during CPU mondo time.

The approach in this patch is the simplest and has almost no impact on
runtime.  We use the array with mm's where last secondary contexts were
loaded onto CPUs and bump their versions to the new generation without
changing context IDs. If a new process comes in to get a context ID, it
will go through get_new_mmu_context() because of version mismatch. But the
running processes do not need to be interrupted. And wrap is quicker as we
do not need to xcall and wait for everyone to receive and complete wrap.

Orabug: 25999953

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit a0582f26ec9dfd5360ea2f35dd9a1b026f8adda0)
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: add per-cpu mm of secondary contexts

The new wrap is going to use information from this array to figure out
mm's that currently have valid secondary contexts setup.

Orabug: 25999953

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 7a5b4bbf49fe86ce77488a70c5dccfe2d50d7a2d)
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: redefine first version

CTX_FIRST_VERSION defines the first context version, but also it defines
first context. This patch redefines it to only include the first context
version.

Orabug: 25999953

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit c4415235b2be0cc791572e8e7f7466ab8f73a2bf)
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: combine activate_mm and switch_mm

The only difference between these two functions is that in activate_mm we
unconditionally flush context. However, there is no need to keep this
difference after fixing a bug where cpumask was not reset on a wrap. So, in
this patch we combine these.

Orabug: 25999953

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 14d0334c6748ff2aedb3f2f7fdc51ee90a9b54e7)
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: reset mm cpumask after wrap

After a wrap (getting a new context version) a process must get a new
context id, which means that we would need to flush the context id from
the TLB before running for the first time with this ID on every CPU. But,
we use mm_cpumask to determine if this process has been running on this CPU
before, and this mask is not reset after a wrap. So, there are two possible
fixes for this issue:

1. Clear mm cpumask whenever mm gets a new context id
2. Unconditionally flush context every time process is running on a CPU

This patch implements the first solution

Orabug: 25999953

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 588974857359861891f478a070b1dc7ae04a3880)
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steven Sistare <steven.sistare@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

Revert "sparc64: Restrict number of processes"

This reverts commit 8fd2a52b3bcc87f809298093a3be374d08d030f2.

The next patch set fixes the issue that limits the maximum number of
processes to number of context. Thefore revert.

Orabug: 24523680

Conflicts:

arch/sparc/mm/tsb.c

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

x86/ras/therm_throt: Do not log a fake MCE for thermal events

We log a fake bank 128 MCE to note that we're handling a CPU thermal
event. However, this confuses people into thinking that their hardware
generates MCEs. Hijacking MCA for logging thermal events is a gross
misuse anyway and it shouldn't have been done in the first place. And
besides we have other means for dealing with thermal events which are
much more suitable.

So let's kill the MCE logging part.

Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Ashok Raj <ashok.raj@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yazen Ghannam <Yazen.Ghannam@amd.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/20170105213846.GA12024@gmail.com
Link: http://lkml.kernel.org/r/20170123183514.13356-3-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 9b052ea4ced0fa1ad30a2eafe86984a16297e6f1)

Orabug: 26361336

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>

SUNRPC: Handle EADDRNOTAVAIL on connection failures

If the connect attempt immediately fails with an EADDRNOTAVAIL error, then
that means our choice of source port number was bad.
This error is expected when we set the SO_REUSEPORT socket option and we
have 2 sockets sharing the same source and destination address and port
combinations.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Fixes: 402e23b4ed9ed ("SUNRPC: Fix stupid typo in xs_sock_set_reuseport")
Orabug: 26221910
(cherry picked from commit 1f4c17a03ba7f430d63dba8c8e08ff1e2712581d)
Signed-off-by: Calum Mackay <calum.mackay@oracle.com>
Reviewed-by: Manjunath Patil <manjunath.b.patil@oracle.com>

dtrace: add kprobe-unsafe addresses to FBT blacklist

By means of the newly introduced API to add entries to the FBT
blacklist, we make sure to register addresses that are unsafe for
kprobes with the FBT blacklist because they are unsafe there also.

Orabug: 26190412
Signed-off-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Acked-by: Nick Alcock <nick.alcock@oracle.com>
Reviewed-by: Dhaval Giani <dhaval.giani@oracle.com>

dtrace: convert FBT blacklist to RB-tree

The blacklist for FBT was implemented as a sorted list, populated from
a static list of functions. In order to allow functions to be added
from other places (i.e. programmatically), it has been converted to an
RB-tree with an API to add functions and to traverse the list. It is
still possible to add functions by address or to add them by symbol
name, to be resolved into the corresponding address.

Orabug: 26190412
Signed-off-by: Tomas Jedlicka <tomas.jedlicka@oracle.com>
Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Acked-by: Nick Alcock <nick.alcock@oracle.com>

sparc64: Enable MGAG200 driver support

This driver enables video console on T7 systems that use
MGA G200e video device. This console can be used to view
kernel boot prints and login to the system.

Orabug: 26133952

Reviewed-by: Eric Saint-Etienne <eric.saint.etienne@oracle.com>
Signed-off-by: Sanath Kumar <sanath.s.kumar@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

selftests: sparc64: memory: Add tests for privileged ADI driver

This patch contains a few basic tests for verifying proper
functionality of the sparc64 privileged ADI driver.

The tests also report simple statistics of how long the read(),
write(), and seek() operations took to complete.

Syscall Call    AvgTime AvgSize
        Count   (ticks) (bytes)
-------------------------------
read          3  167134    8133
pread         4  137217    6741
write         3   73787    8133
pwrite        4   65832    6741
seek         10    2422       0

Ran 8 tests
All tests passed

Orabug: 26170808
Reviewed-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Tom Hromatka <tom.hromatka@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

memory: sparc64: Add privileged ADI driver

This patch adds an ADI driver for reading/writing MCD versions
using physical addresses from privileged user space processes.
This file maps linearly to physical memory at a ratio of
1:adi_blksz.  A read (or write) of offset K in the file operates
upon physical address K * adi_blksz.  The version information
is encoded as one version per byte.  Intended consumers are
makedumpfile and crash.

Orabug: 2617080
Signed-off-by: Tom Hromatka <tom.hromatka@oracle.com>
Reviewed-by: Anthony Yznaga <anthony.yznaga@oracle.com>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: Export the adi_state structure

Orabug: 26170808
Signed-off-by: Tom Hromatka <tom.hromatka@oracle.com>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: sunvdc: skip vdisk response validation upon error

Skip validating the vdisk IO response from vdisk server, if IO
request has failed.

sunvdc checks if the size of the request processed by the
server matches with the size of request sent by vdc. This
is to ensure that partial IO completions are caught, since
it's not expected. In the case where the server reports an
error, it could set the size of IO processed to zero.
Therefore, validating the size of request processed in the
case of an error could mis-classify the problem.

Orabug: 26242270

Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
Reviewed-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: add DAX2 support to dax driver

Orabug: 25904994

Introduce DAX2 support in the driver. This involves negotiating the right
version with hypervisor as well as exposing a new INIT_V2 ioctl. This new
ioctl will return failure if DAX2 is not present on the system, otherwise
it will attempt to initialize the DAX2. A user should first call INIT_V2
and on failure call INIT_V1. See Documentation/sparc/dax.txt for more
detail.

Signed-off-by: Jonathan Helman <jonathan.helman@oracle.com>
Reviewed-by: Sanath Kumar <sanath.s.kumar@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: Exclude perf user callchain during critical sections

This fixes another cause of random segfaults and bus errors that
may occur while running perf with the callgraph (-g) option.

Critical sections beginning with spin_lock_irqsave() raise the
interrupt level to PIL_NORMAL_MAX (14) and intentionally do not block
performance counter interrupts, which arrive at PIL_NMI (15).  So perf
code must be very careful about what it does since it might execute in
the middle of one of these critical sections. In particular, the
perf_callchain_user() path is problematic because it accesses user
space and may cause TLB activity as well as faults as it unwinds the
user stack.

One particular critical section occurs in switch_mm:

spin_lock_irqsave(&mm->context.lock, flags);
...
load_secondary_context(mm);
tsb_context_switch(mm);
...
spin_unlock_irqrestore(&mm->context.lock, flags);

If a perf interrupt arrives in between load_secondary_context() and
tsb_context_switch(), then perf_callchain_user() could execute with
the context ID of one process, but with an active tsb for a different
process. When the user stack is accessed, it is very likely to
incur a TLB miss, since the h/w context ID has been changed. The TLB
will then be reloaded with a translation from the TSB for one process,
but using a context ID for another process. This exposes memory from
one process to another, and since it is a mapping for stack memory,
this usually causes the new process to crash quickly.

Some potential solutions are:

1) Make critical sections run at PIL_NMI instead of PIL_NORMAL_MAX.
This would certainly eliminate the problem, but it would also prevent
perf from having any visibility into code running in these critical
sections, and it seems clear that PIL_NORMAL_MAX is used for just
this reason.

2) Protect this particular critical section by masking all interrupts,
either by setting %pil to PIL_NMI or by clearing pstate.ie around the
calls to load_secondary_context() and tsb_context_switch(). This approach
has a few drawbacks:

- It would only address this particular critical section, and would
   have to be repeated in several other known places. There might be
   other such critical sections that are not known.

- It has a performance cost which would be incurred at every context
   switch, since it would require additional accesses to %pil or
   %pstate.

- Turning off pstate.ie would require changing __tsb_context_switch(),
   which expects to be called with pstate.ie on.

3) Avoid user space MMU activity entirely in perf_callchain_user() by
implementing a new copy_from_user() function that accesses the user
stack via physical addresses. This works, but requires quite a bit of
new code to get it to perform reasonably, ie, caching of translations,
etc.

4) Allow the perf interrupt to happen in existing critical sections as
it does now, but have perf code detect that this is happening, and
skip any user callchain processing. This approach was deemed best, as
the change is extremely localized and covers both known and unknown
instances of perf interrupting critical sections. Perf has interrupted
a critical section when %pil == PIL_NORMAL_MAX at the time of the perf
interrupt.

Ordinarily, a function that has a pt_regs passed in can simply examine
(reg->tstate & TSTATE_PIL) to extract the interrupt level that was in
effect at the time of the perf interrupt.  However, when the perf
interrupt occurs while executing in the kernel, the pt_regs pointer is
replaced with task_pt_regs(current) in perf_callchain(), and so
perf_callchain_user() does not have access to the state of the machine
at the time of the perf interrupt. To work around this, we check
(regs->tstate & TSTATE_PIL) in perf_event_nmi_handler() before calling
in to the arch independent part of perf, and temporarily change the
event attributes so that user callchain processing is avoided. Though
a user stack sample is not collected, this loss is not statistically
significant. Kernel call graph collection is not affected.

Orabug: 25577560

Signed-off-by: Rob Gardner <rob.gardner@oracle.com>
Signed-off-by: Dave Aldridge <david.j.aldridge@oracle.com>
Reviewed-by: Bob Picco <bob.picco@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: rtrap must set PSTATE.mcde before handling outstanding user work

The kernel must execute with PSTATE.mcde=1 for ADI version checking to
be enabled when the kernel reads or writes user memory mapped with ADI
enabled using versioned addresses.  If PSTATE.mcde=0 then the MMU
interprets version bits in an address as address bits, and an access
attempt results in a data access exception.  Until now setting
PSTATE.mcde=1 in the kernel has been handled only by patching etrap to
ensure that is set on entry into the kernel.  However, there are code
paths in rtrap that overwrite PSTATE and inadvertently clear PSTATE.mcde
before additional execution happens in the kernel.

rtrap is executed to exit the kernel and return to user mode execution.
Before restoring registers and returning to user mode, rtrap checks for
work to do.  The check is done with interrupts disabled, and if there is
work to do, then interrupts are enabled before calling a function to
complete the work after which interrupts are disabled again and the
check is repeated.  Interrupts are disabled and enabled by overwriting
PSTATE.  Possible work includes (but is not limited to) preemption,
signal delivery, and  writing out buffered user register windows to the
stack.  All of these may lead to accessing user addresses.  In the case
of preemption, a resumed thread will run with PSTATE.mcde=0 until it
completes a return to user mode or is rescheduled on a CPU where
PSTATE.mcde is set.  If the thread accesses ADI-enabled user memory with
a versioned address (e.g. to complete some I/O) in that timeframe then
the access will fail.  To fix the problem, patch rtrap to set
PSTATE.mcde when interrupts are enabled before handling the work.

Orabug: 25853545
Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Reviewed-by: Steve Sistare <steven.sistare@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sunvnet: restrict advertized checksum offloads to just IP

As much as we'd like to play well with others, we really aren't
handling the checksums on non-IP protocol packets very well. This
is easily seen when trying to do TCP over ipv6 - the checksums are
garbage.

Here we restrict the checksum feature flag to just IP traffic so
that we aren't given work we can't yet do.

Orabug: 26175391, 26259755

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry-picked from commit 7e9191c54a36c864b901ea8ce56dc42f10c2f5ae)
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc-config: Enable timestamp in dmesg output.

Orabug: 24760107

Currently timestamp is not enabled in dmesg output. Timestamp
helps a lot in debugging timing related kernel issues.

Enable it by default.

Signed-off-by: Atish Patra <atish.patra@oracle.com>
Reviewed-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Reviewed-by: Larry Bassel <larry.bassel@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

arch/sparc: Avoid DCTI Couples

Avoid un-intended DCTI Couples. Use of DCTI couples is deprecated per
Oracle SPARC Architecture notes below(Section 6.3.4.7 - DCTI Couples).

"A delayed control transfer instruction (DCTI) in the delay slot of another
DCTI is referred to as a DCTI couple. The use of DCTI couples is deprecated
in the Oracle SPARC Architecture; no new software should place a DCTI in
the delay slot of another DCTI, because on future Oracle SPARC Architecture
implementations DCTI couples may execute either slowly or differently than
the programmer assumes it will."

Orabug: 25456049

Signed-off-by: Babu Moger <babu.moger@oracle.com>
Reviewed-by: Rob Gardner <rob.gardner@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

net/rds: Fix minor linker warnings

Fixes: fcdaab66 {IB/{core,ipoib},net/{mlx4,rds}}: Mark unload_allowed
as __initdata variable

Seeing this warning while building the kernel. Fix it.
MODPOST 1555 modules
WARNING: net/rds/rds_rdma.o(.text+0x1d8): Section mismatch in
reference from the function rds_rdma_init() to the variable
.init.data:unload_allowed
The function rds_rdma_init() references
the variable __initdata unload_allowed.
This is often because rds_rdma_init lacks a __initdata
annotation or the annotation of unload_allowed is wrong.

Orabug: 25393132

Signed-off-by: Babu Moger <babu.moger@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

drivers/usb: Skip auto handoff for TI and RENESAS usb controllers

I have never seen auto handoff working on TI and RENESAS xhci
cards. Eventually, we force handoff. This code forces the handoff
unconditionally. It saves 5 seconds boot time for each card.
Added the vendor/device id checks for the card which I have tested.

Orabug: 22345396

Signed-off-by: Babu Moger <babu.moger@oracle.com>
Acked-by: Karl Volz <karl.volz@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

usb/core: Added devspec sysfs entry for devices behind the usb hub

Grub finds incorrect of_node path for devices behind usb hub.
Added devspec sysfs entry for devices behind usb hub so that
right of_node path is returned during grub sysfs walk for these
devices.

Signed-off-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 51fa91475e431e75b802dbab7c056f2829dc9410)

Signed-off-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Reviewed-by: Babu Moger <babu.moger@oracle.com>
Orabug: 24785721
Signed-off-by: Allen Pais <allen.pais@oracle.com>

USB: core: let USB device know device node

Although most of USB devices are hot-plug's, there are still some devices
are hard wired on the board, eg, for HSIC and SSIC interface USB devices.
If these kinds of USB devices are multiple functions, and they can supply
other interfaces like i2c, gpios for other devices, we may need to
describe these at device tree.

In this commit, it uses "reg" in dts as physical port number to match
the phyiscal port number decided by USB core, if they are the same,
then the device node is for the device we are creating for USB core.

Signed-off-by: Peter Chen <peter.chen@freescale.com>
Acked-by: Philipp Zabel <p.zabel@pengutronix.de>
Acked-by: Alan Stern <stern@rowland.harvard.edu>
Acked-by: Rob Herring <robh@kernel.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 69bec725985324e79b1c47ea287815ac4ddb0521)

Conflicts:

include/linux/usb/of.h

Signed-off-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Reviewed-by: Babu Moger <babu.moger@oracle.com>
Orabug: 24785721
Signed-off-by: Allen Pais <allen.pais@oracle.com>

Improves clear_huge_page() using work queues

The idea is to exploit the parallelism available on large
multicore systems such as SPARC T7 systems to clear huge pages
in parallel with multiple worker threads.

Orabug: 25130263

Reviewed-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Signed-off-by: Kishore Pusukuri <kishore.kumar.pusukuri@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

bnxt: add dma mapping attributes

On the SPARC platform we need to use the DMA_ATTR_WEAK_ORDERING
attribute in our dma mapping in order to get the expected performance
out of the receive path. Setting this boosts a simple iperf receive
session from 2 Gbe to 23.4 Gbe.

Orabug: 25830685

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Tushar Dave <tushar.n.dave@oracle.com>
Reviewed-by: Tom Saeger <tom.saeger@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

dma-mapping: add interfaces for mapping pages with attributes

Fix up the map and unmap page interfaces to make attributes
available.

Orabug: 25830685

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
(backported from commit 0495c3d367944e4af053983ff3cdf256b567b053)
Reviewed-by: Tom Saeger <tom.saeger@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

IB/IPoIB: ibX: failed to create mcg debug file

When udev renames the netdev devices, ipoib debugfs entries does not
get renamed. As a result, if subsequent probe of ipoib device reuse the
name then creating a debugfs entry for the new device would fail.

Also, moved ipoib_create_debug_files and ipoib_delete_debug_files as part
of ipoib event handling in order to avoid any race condition between these.

Fixes: 1732b0ef3b3a ([IPoIB] add path record information in debugfs)
Cc: stable@vger.kernel.org # 2.6.15+
Signed-off-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
(cherry picked from git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma
commit 771a52584096c45e4565e8aabb596eece9d73d61)

Orabug: 24711873

Reviewed-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Shamir Rabinovitch <shamir.rabinovitch@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

ftrace: remove unnecessary __maybe_unused from waitfd() parameters

__maybe_unused is not required for unused function parameters. In
fact, using __maybe_unused confuses the ftrace parser. Thus, this
change removes these superfluous descriptors.

Orabug: 24327424

Reviewed-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Tom Hromatka <tom.hromatka@oracle.com>
(cherry picked from commit 2f772dbf28c6bd410ca0855fc3260331030b6d9a)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

mm: fix new crash in unmapped_area_topdown()

Trinity gets kernel BUG at mm/mmap.c:1963! in about 3 minutes of
mmap testing.  That's the VM_BUG_ON(gap_end < gap_start) at the
end of unmapped_area_topdown().  Linus points out how MAP_FIXED
(which does not have to respect our stack guard gap intentions)
could result in gap_end below gap_start there.  Fix that, and
the similar case in its alternative, unmapped_area().

Cc: stable@vger.kernel.org
Fixes: 1be7107fbe18 ("mm: larger stack guard gap, between vmas")
Reported-by: Dave Jones <davej@codemonkey.org.uk>
Debugged-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit f4cb767d76cf7ee72f97dd76f6cfa6c76a5edc89)

Orabug: 26161422
CVE: CVE-2017-1000364
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>

mm: larger stack guard gap, between vmas

Stack guard page is a useful feature to reduce a risk of stack smashing
into a different mapping. We have been using a single page gap which
is sufficient to prevent having stack adjacent to a different mapping.
But this seems to be insufficient in the light of the stack usage in
userspace. E.g. glibc uses as large as 64kB alloca() in many commonly
used functions. Others use constructs liks gid_t buffer[NGROUPS_MAX]
which is 256kB or stack strings with MAX_ARG_STRLEN.

This will become especially dangerous for suid binaries and the default
no limit for the stack size limit because those applications can be
tricked to consume a large portion of the stack and a single glibc call
could jump over the guard page. These attacks are not theoretical,
unfortunatelly.

Make those attacks less probable by increasing the stack guard gap
to 1MB (on systems with 4k pages; but make it depend on the page size
because systems with larger base pages might cap stack allocations in
the PAGE_SIZE units) which should cover larger alloca() and VLA stack
allocations. It is obviously not a full fix because the problem is
somehow inherent, but it should reduce attack space a lot.

One could argue that the gap size should be configurable from userspace,
but that can be done later when somebody finds that the new 1MB is wrong
for some special case applications. For now, add a kernel command line
option (stack_guard_gap) to specify the stack gap size (in page units).

Implementation wise, first delete all the old code for stack guard page:
because although we could get away with accounting one extra page in a
stack vma, accounting a larger gap can break userspace - case in point,
a program run with "ulimit -S -v 20000" failed when the 1MB gap was
counted for RLIMIT_AS; similar problems could come with RLIMIT_MLOCK
and strict non-overcommit mode.

Instead of keeping gap inside the stack vma, maintain the stack guard
gap as a gap between vmas: using vm_start_gap() in place of vm_start
(or vm_end_gap() in place of vm_end if VM_GROWSUP) in just those few
places which need to respect the gap - mainly arch_get_unmapped_area(),
and and the vma tree's subtree_gap support for that.

Original-patch-by: Oleg Nesterov <oleg@redhat.com>
Original-patch-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Tested-by: Helge Deller <deller@gmx.de> # parisc
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 1be7107fbe18eed3e319a6c3e83c78254b693acb)

Orabug: 26161422
CVE: CVE-2017-1000364
Signed-off-by: Dhaval Giani <dhaval.giani@oracle.com>
Reviewed-by: Jack Vogel <jack.vogel@oracle.com>
Reviewed-by: John Haxby <john.haxby@oracle.com>
Conflicts:
Documentation/admin-guide/kernel-parameters.txt
arch/powerpc/mm/hugetlbpage-radix.c
arch/powerpc/mm/mmap.c
arch/s390/mm/mmap.c
include/linux/mm.h
mm/gup.c
mm/memory.c
mm/mmap.c

net/rds: prioritize the base connection establishment

As of today, all the TOS connections can only be established after their
base connections are up. This is due to the fact that TOS connections rely
on their base connections to perform route resolution. Nevertheless, when
all the connections drop/reconnect(e.g., ADDR_CHANGE event), the TOS
connections establishment consume the CPU resources by constantly retrying
the connection establishment until their base connections are up.

Thus, this patch delays all the TOS connections if their associated base
connections are not up. By doing so, the priority is given to the base
connections establishment. Consequently, the base connections can be
established faster and subsequent their associated TOS connections.

Orabug: 25521901

Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Tested-by: Dib Chatterjee <dib.chatterjee@oracle.com>
Tested-by: Rosa Isela Lopez Romero <rosa.lopez@oracle.com>

net/rds: determine active/passive connection with IP addresses

This patch changes RDS to use randomize backoff only in the first attempt
to reconnect. This means both ends try to be active by sending out REQ to
its peer in random t seconds. If the connection can't be established due to
a race, the peer IP addresses comparison is used to determine
active/passive connection establishment. (e.g IP_A > IP_B)

The following description illustrates the connection establishment,

t1randA: 192.168.1.A (active)  --------------> 192.168.1.B (passive)
t1randB: 192.168.1.A (passive) <-------------  192.168.1.B (active)
t2     : 192.168.1.A (active) ---------------> REJ
t3     : 192.168.1.B (active) ---------------> REJ
t4     : Connection between A,B is not up.
t5     : 192.168.1.A (active) --------------> 192.168.1.B (passive)

Orabug: 25521901

Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Suggested-by : Håkon Bugge <haakon.bugge@oracle.com>
Tested-by: Dib Chatterjee <dib.chatterjee@oracle.com>
Tested-by: Rosa Isela Lopez Romero <rosa.lopez@oracle.com>

net/rds: use different workqueue for base_conn

RDS uses rds_wq for various operations. During ADDR_CHANGE event,
each connection queues at least two tasks into this single-threaded
workqueue. Furthermore, the TOS connections have dependency on its
base connection. Thus, a separate workqueue is created specifically
for the base connections to speed up base connection establishment.

Orabug: 25521901

Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Tested-by: Dib Chatterjee <dib.chatterjee@oracle.com>
Tested-by: Rosa Isela Lopez Romero <rosa.lopez@oracle.com>

net/rds: Revert "RDS: add reconnect retry scheme for stalled connections"

This reverts commit 5acb959ad59966b0b6905802ed720d26c560c3c5.

Commit "RDS: add reconnect retry scheme for stalled connections" introduces
the rds_reconnect_timeout to retry the connection establishment after
sysctl_reconnect_retry_ms (default is 1000ms). Nevertheless, this proactive
mechanism is overkilled and it is causing long brownout time in the
virtualized environment. In short, below are the justifications why commit
5acb959ad59966b0b6905802ed720d26c560c3c5 is reverted.

a) The retry counter starts ticking after RDS received an ADDR_CHANGE
event. After receiving an ADDR_CHANGE event, RDS needs to perform shutdown
via shutdown_worker. Then, initiate a new connection via connect_worker.
Eventually, a CM REQ is only sent out after rds received
RDMA_CM_EVENT_ADDR_RESOLVED and RDMA_CM_EVENT_ROUTE_RESOLVED events. If the
retry is made to cater for stalled connection due to missing CM messages,
the retry should only happen after a CM REQ is sent. With the current
retry scheme (and with the default 1000 ms) that happens after ADDR_CHANGE
event, it introduces congestion in the single threaded workqueue.

b) Assuming that we modify the retry counter to start ticking after a CM
REQ message is sent out. By introducing another retry timeout, it
complicates the system tuning. Why? First, the sysctl_reconnect_retry_ms
relies on the underlying cma_response_timeout. Any modication of
cma_response_timeout requires to tune sysctl_reconnect_retry_ms. Second, it
is hard to find a universal timing that fits all configurations
(bare-metal, virtualized, mixed environment, and homo/heterogenous system).

Orabug: 25521901

Signed-off-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Reviewed-by: Håkon Bugge <haakon.bugge@oracle.com>
Tested-by: Dib Chatterjee <dib.chatterjee@oracle.com>
Tested-by: Rosa Isela Lopez Romero <rosa.lopez@oracle.com>
Acked-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>

IB/mlx4: Fix CM REQ retries in paravirt mode

CM REQs cannot be successfully retried, because a new pv_cm_id is
created for each request, without checking if one already exists.

This commit fixes this, by checking if an id exists before creating
one.

This bug can be provoked by running an RDMA CM user-land application,
but inserting a five seconds delay before the rdma_accept() call on
the passive side. This delay is larger than the default CMA timeout,
and triggers a retry from the active side. The retried REQ will use
another pv_cm_id (the cm_id on the wire). This confuses the CM
protocol and two REJs are sent from the passive side.

This commit is required to achieve the reduced HA Brownout time,
needed by Exadata. The Brownout issue is tracked by orabug 25521901.

Orabug: 26287667

Suggested-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Reported-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Tested-by: Wei Lin Guay <wei.lin.guay@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>

uek-rpm: enable CONFIG_TCM_USER2 for ol6 and ol7

Enables userspace passthrough driver (tcmu) for target core.

Orabug: 25395066

Signed-off-by: Kyle Fortin <kyle.fortin@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>

tcmu: Skip Data-Out blocks before gathering Data-In buffer for BIDI case

For the bidirectional case, the Data-Out buffer blocks will always at
the head of the tcmu_cmd's bitmap, and before gathering the Data-In
buffer, first of all it should skip the Data-Out ones, or the device
supporting BIDI commands won't work.

Fixed: 26418649eead ("target/user: Introduce data_bitmap, replace
data_length/data_head/data_tail")
Reported-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
Tested-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
Signed-off-by: Xiubo Li <lixiubo@cmss.chinamobile.com>
Cc: stable@vger.kernel.org # 4.6+
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Orabug: 25395066
(cherry picked from commit a5d68ba85801a78c892a0eb8efb711e293ed314b)
Signed-off-by: Kyle Fortin <kyle.fortin@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>

tcmu: Fix wrongly calculating of the base_command_size

The t_data_nents and t_bidi_data_nents are the numbers of the
segments, but it couldn't be sure the block size equals to size
of the segment.

For the worst case, all the blocks are discontiguous and there
will need the same number of iovecs, that's to say: blocks == iovs.
So here just set the number of iovs to block count needed by tcmu
cmd.

Tested-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Xiubo Li <lixiubo@cmss.chinamobile.com>
Cc: stable@vger.kernel.org # 3.18+
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Orabug: 25395066
(cherry picked from commit abe342a5b4b5aa579f6bf40ba73447c699e6b579)
Signed-off-by: Kyle Fortin <kyle.fortin@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>

tcmu: Fix possible overwrite of t_data_sg's last iov[]

If there has BIDI data, its first iov[] will overwrite the last
iov[] for se_cmd->t_data_sg.

To fix this, we can just increase the iov pointer, but this may
introuduce a new memory leakage bug: If the se_cmd->data_length
and se_cmd->t_bidi_data_sg->length are all not aligned up to the
DATA_BLOCK_SIZE, the actual length needed maybe larger than just
sum of them.

So, this could be avoided by rounding all the data lengthes up
to DATA_BLOCK_SIZE.

Reviewed-by: Mike Christie <mchristi@redhat.com>
Tested-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
Reviewed-by: Bryant G. Ly <bryantly@linux.vnet.ibm.com>
Signed-off-by: Xiubo Li <lixiubo@cmss.chinamobile.com>
Cc: stable@vger.kernel.org # 3.18+
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Orabug: 25395066
(cherry picked from commit ab22d2604c86ceb01bb2725c9860b88a7dd383bb)
Signed-off-by: Kyle Fortin <kyle.fortin@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>

tcmu: make cmd timeout configurable

A single daemon could implement multiple types of devices
using multuple types of real devices that may not support
restarting from crashes and/or handling tcmu timeouts. This
makes the cmd timeout configurable, so handlers that do not
support it can turn if off for now.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Orabug: 25395066
(cherry picked from commit af980e46a26ac8805685bb70c8572dbc47abb126)
Signed-off-by: Kyle Fortin <kyle.fortin@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>

tcmu: add helper to check if dev was configured

This adds a helper to check if the dev was configured. It
will be used in the next patch to prevent updates to some
config settings after the device has been setup.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Orabug: 25395066
(cherry picked from commit 972c7f167974fa41ea8a2eed4b857cc59f59c42c)
Signed-off-by: Kyle Fortin <kyle.fortin@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>
Conflicts:
drivers/target/target_core_user.c

tcmu: return on first Opt parse failure

We only were returing failure if the last opt to be parsed failed.
This has a return failure when we first detect a failure.

Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Orabug: 25395066
(cherry picked from commit 2579325ca0acc598fdf41ba12b2871d3467f28df)
Signed-off-by: Kyle Fortin <kyle.fortin@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>

tcmu: allow hw_max_sectors greater than 128

tcmu hard codes the hw_max_sectors to 128 which is a litle small.
Userspace uses the max_sectors to report the optimal IO size and
some initiators perform better with larger IOs (open-iscsi seems
to do better with 256 to 512 depending on the test).

(Fix do not display hw max sectors twice - MNC)

Signed-off-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Orabug: 25395066
(cherry picked from commit 3abaa2bfdb1e6bb33d38a2e82cf3bb82ec0197bf)
Signed-off-by: Kyle Fortin <kyle.fortin@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>

target/user: Fix use-after-free of tcmu_cmds if they are expired

Don't free the cmd in tcmu_check_expired_cmd, it's still referenced by
an entry in our cmd_id->cmd idr. If userspace ever resumes processing,
tcmu_handle_completions() will use the now-invalid cmd pointer.

Instead, don't free cmd. It will be freed by tcmu_handle_completion() if
userspace ever recovers, or tcmu_free_device if not.

Cc: stable@vger.kernel.org
Reported-by: Bryant G Ly <bgly@us.ibm.com>
Tested-by: Bryant G Ly <bgly@us.ibm.com>
Signed-off-by: Andy Grover <agrover@redhat.com>
Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Orabug: 25395066
(cherry picked from commit d0905ca757bc40bd1ebc261a448a521b064777d7)
Signed-off-by: Kyle Fortin <kyle.fortin@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>

target/user: Add an #include directive

Since this driver uses kmap_atomic(), include the highmem header file.

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Cc: Nicholas Bellinger <nab@linux-iscsi.org>
Cc: Andy Grover <agrover@redhat.com>
Orabug: 25395066
(cherry picked from commit f5045724578babc7bd3460087f34cc787a8b0e20)
Signed-off-by: Kyle Fortin <kyle.fortin@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>

target/user: Fix a data type in tcmu_queue_cmd()

This patch avoids that sparse reports the following error messages:

drivers/target/target_core_user.c:547:13: warning: incorrect type in assignment (different base types)
drivers/target/target_core_user.c:547:13:    expected int [signed] ret
drivers/target/target_core_user.c:547:13:    got restricted sense_reason_t
drivers/target/target_core_user.c:548:20: warning: restricted sense_reason_t degrades to integer
drivers/target/target_core_user.c:557:16: warning: incorrect type in return expression (different base types)
drivers/target/target_core_user.c:557:16:    expected restricted sense_reason_t
drivers/target/target_core_user.c:557:16:    got int [signed] ret

Signed-off-by: Bart Van Assche <bart.vanassche@sandisk.com>
Orabug: 25395066
(cherry picked from commit ecaf597b411e9a7b071bf7a36a4cf750c529cd28)
Signed-off-by: Kyle Fortin <kyle.fortin@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>

target/user: Fix comments to not refer to data ring

We no longer use a ringbuffer for the data area, so this might cause
confusion. Just call it the data area.

Signed-off-by: Andy Grover <agrover@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Orabug: 25395066
(cherry picked from commit 3d9b95558f5874ac5d63a057813dc66b480de7e1)
Signed-off-by: Kyle Fortin <kyle.fortin@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>

target/user: Return an error if cmd data size is too large

Userspace should be implementing VPD B0 (Block Limits) to inform the
initiator of max data size, but just in case we do get a too-large request,
do what the spec says and return INVALID_CDB_FIELD.

Make sure to unlock udev->cmdr_lock before returning.

Signed-off-by: Andy Grover <agrover@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Orabug: 25395066
(cherry picked from commit 554617b2bbe25c3fb3c80c28fe7a465884bb40b1)
Signed-off-by: Kyle Fortin <kyle.fortin@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>

target/user: Use sense_reason_t in tcmu_queue_cmd_ring

Instead of using -ERROR-style returns, use sense_reason_t. This lets us
remove tcmu_pass_op(), and return more correct sense values.

Signed-off-by: Andy Grover <agrover@redhat.com>
Signed-off-by: Bryant G. Ly <bryantly@linux.vnet.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Orabug: 25395066
(cherry picked from commit 02eb924fabc5b699c0d9d354491e6f0767e3c139)
Signed-off-by: Kyle Fortin <kyle.fortin@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>

target/user: Report capability of handling out-of-order completions to userspace

TCMU_MAILBOX_FLAG_CAP_OOOC was introduced, and userspace can check the flag
for out-of-order completion capability support.

Also update the document on how to use the feature.

Signed-off-by: Sheng Yang <sheng@yasker.org>
Reviewed-by: Andy Grover <agrover@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Orabug: 25395066
(cherry picked from commit 32c76de3466ed2a875e36c140ac4e3800fdfab6e)
Signed-off-by: Kyle Fortin <kyle.fortin@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>

target/user: Fix size_t format-spec build warning

Fix the following printk size_t warning as per 0-day build:

All warnings (new ones prefixed by >>):

   drivers/target/target_core_user.c: In function 'is_ring_space_avail':
>> drivers/target/target_core_user.c:385:12: warning: format '%lu'
>> expects argument of type 'long unsigned int', but argument 3 has type
>> 'size_t {aka unsigned int}' [-Wformat=]
      pr_debug("no data space: only %lu available, but ask for %lu\n",
               ^

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Orabug: 25395066
(cherry picked from commit 0241fd39ce7bc9b82b7e57305cb0d6bb1364d45b)
Signed-off-by: Kyle Fortin <kyle.fortin@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>

target/user: Don't free expired command when time out

Which would result in NPE after when userspace connected again.

Expired command would be freed either when handling command(by userspace),
or when device was tearing down

Reviewed-by: Andy Grover <agrover@redhat.com>
Signed-off-by: Sheng Yang <sheng@yasker.org>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Orabug: 25395066
(cherry picked from commit b25c786399367b9a8bd955d8496669d019409bec)
Signed-off-by: Kyle Fortin <kyle.fortin@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>

target/user: Introduce data_bitmap, replace data_length/data_head/data_tail

The data_bitmap was introduced to support asynchornization accessing of
data area.

We divide mailbox data area into blocks, and use data_bitmap to track the
usage of data area. All the new command's data would start with a new block,
and may left unusable space after it end. But it's easy to track using
data_bitmap.

Now we can allocate data area for asynchronization accessing from userspace,
since we can track the allocation using data_bitmap. The userspace part would
be the same as Maxim's previous asynchronized implementation.

Signed-off-by: Sheng Yang <sheng@yasker.org>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Orabug: 25395066
(cherry picked from commit 26418649eead52619d8dd6cbc6760a1b144dbcd2)
Signed-off-by: Kyle Fortin <kyle.fortin@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>

target/user: Free data ring in unified function

Prepare for data_bitmap in the next patch.

Reviewed-by: Andy Grover <agrover@redhat.com>
Signed-off-by: Sheng Yang <sheng@yasker.org>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Orabug: 25395066
(cherry picked from commit 0c28481ffb4683ef21c6664d15dbd5ae5a6cd027)
Signed-off-by: Kyle Fortin <kyle.fortin@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>

target/user: Use iovec[] to describe continuous area

We don't need use one iovec per scatter-gather list entry, since data
area are continuous.

Reviewed-by: Andy Grover <agrover@redhat.com>
Signed-off-by: Sheng Yang <sheng@yasker.org>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Orabug: 25395066
(cherry picked from commit f1dbd087cc7a28c6c174cb28cf98c19f4efb1fba)
Signed-off-by: Kyle Fortin <kyle.fortin@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>

target/user: Fix cast from pointer to phys_addr_t

The uio_mem structure has a member that is a phys_addr_t, but can
be a number of other types too. The target core driver attempts
to assign a pointer from vmalloc() to it, by casting it to
phys_addr_t, but that causes a warning when phys_addr_t is longer
than a pointer:

drivers/target/target_core_user.c: In function 'tcmu_configure_device':
drivers/target/target_core_user.c:906:22: error: cast from pointer to integer of different size [-Werror=pointer-to-int-cast]

This adds another cast to uintptr_t to shut up the warning.
A nicer fix might be to have additional fields in uio_mem
for the different purposes, so we can assign a pointer directly.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Orabug: 25395066
(cherry picked from commit 0633e123465b61a12a262b742bebf2a9945f7964)
Signed-off-by: Kyle Fortin <kyle.fortin@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>

target/user: Make sure netlink would reach all network namespaces

The current code only allow netlink to reach the initial network namespace,
which caused trouble for any client running inside container.

This patch would make sure TCMU netlink would work for all network
namespaces.

Signed-off-by: Sheng Yang <sheng@yasker.org>
Acked-by: Andy Grover <agrover@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Orabug: 25395066
(cherry picked from commit 20c08b362f4b0c41103fe9d75c61ca348d021441)
Signed-off-by: Kyle Fortin <kyle.fortin@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>

target/user: Allow user to set block size before enabling device

The capability of setting hw_block_size was added along with 9c1cd1b68
"target/user: Only support full command pass-through", though default
setting override the user specified value during the enabling of device,
which called by target_configure_device() to set block_size matching
hw_block_size, result in user not able to set different block size other
than default 512.

This patch would use existing hw_block_size value if already set, otherwise
it would be set to default value(512).

Update: Fix the coding style issue.

(Drop unnecessary re-export of dev->dev_attrib.block_size - nab)

Signed-off-by: Sheng Yang <sheng@yasker.org>
Cc: Andy Grover <agrover@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Orabug: 25395066
(cherry picked from commit 81ee28de860095cc0c063b92eea53cb97771f796)
Signed-off-by: Kyle Fortin <kyle.fortin@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>

target/user: Do not set unused fields in tcmu_ops

TCMU sets TRANSPORT_FLAG_PASSTHROUGH, so INQUIRY commands will not be
emulated by LIO but passed up to userspace. Therefore TCMU should not
set these, just like pscsi doesn't.

Signed-off-by: Andy Grover <agrover@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Orabug: 25395066
(cherry picked from commit 6ba4bd297d99ad522a6414001e6837ddaa8753fd)
Signed-off-by: Kyle Fortin <kyle.fortin@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>

target/user: Fix time calc in expired cmd processing

Reversed arguments meant that we were doing nothing for cmds whose deadline
had passed.

Signed-off-by: Andy Grover <agrover@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Orabug: 25395066
(cherry picked from commit 611e2267b68fc061aea86345b3a8b87151395187)
Signed-off-by: Kyle Fortin <kyle.fortin@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>

target: use stringify.h instead of own definition

Signed-off-by: David Disseldorp <ddiss@suse.de>
Acked-by: Andy Grover <agrover@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Orabug: 25395066
(cherry picked from commit ac64a2ce509104a746321a4f9646b6750cf281eb)
Signed-off-by: Kyle Fortin <kyle.fortin@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>

target/user: Fix UFLAG_UNKNOWN_OP handling

Calling transport_generic_request_failure() from here causes list
corruption. We should be using target_complete_cmd() instead.

Which we do in all other cases, so the UNKNOWN_OP case can become just
another member of the big else/if chain in tcmu_handle_completion().

Signed-off-by: Andy Grover <agrover@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Orabug: 25395066
(cherry picked from commit ed97d0cd78a337450e17eb613bdeec15e729af46)
Signed-off-by: Kyle Fortin <kyle.fortin@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>

target/user: Remove unused variable

We don't use it any more.

Signed-off-by: Andy Grover <agrover@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Orabug: 25395066
(cherry picked from commit 4824640ec3fc84337cb2baa9fb780e95864feb88)
Signed-off-by: Kyle Fortin <kyle.fortin@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>

target/user: Fix inconsistent kmap_atomic/kunmap_atomic

Pointers that are mapped by kmap_atomic() + offset must
be unmapped without the offset. That would cause problems
if the SG element length exceeds the PAGE_SIZE limit.

Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Orabug: 25395066
(cherry picked from commit e2e21bd8f979a24462070cc89fae11e819cae90a)
Signed-off-by: Kyle Fortin <kyle.fortin@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>

target/user: Add support for bidirectional commands

Enable TCMU to handle bidirectional SCSI commands. In such cases,
entries in iov[] cover both the Data-In and the Data-Out buffers. The
first iov_cnt entries correspond to the Data-Out buffer, while the
remaining iov_bidi_cnt entries correspond to the Data-In buffer.

Signed-off-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
Signed-off-by: Vangelis Koukis <vkoukis@arrikto.com>
Reviewed-by: Andy Grover <agrover@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Orabug: 25395066
(cherry picked from commit e4648b014e03baee45d5f5146c1219b19e4e5f2f)
Signed-off-by: Kyle Fortin <kyle.fortin@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>

target/user: Refactor data area allocation code

Introduce alloc_and_scatter_data_area()/gather_and_free_data_area()
functions that allocate/deallocate space from the data area and copy
data to/from a given scatter-gather list. These functions are needed so
the next patch, introducing support for bidirectional commands in TCMU,
can use the same code path both for t_data_sg and for t_bidi_data_sg.

Signed-off-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
Signed-off-by: Vangelis Koukis <vkoukis@arrikto.com>
Reviewed-by: Andy Grover <agrover@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Orabug: 25395066
(cherry picked from commit f97ec7db1606875666366bfcba8476f8c917db96)
Signed-off-by: Kyle Fortin <kyle.fortin@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>

driver/user: Don't warn for DMA_NONE data direction

Some SCSI commands (for example the TEST UNIT READY command) do not
carry data and so data_direction is DMA_NONE. Patch TCMU to not print a
warning message about unknown data direction, when it is DMA_NONE.

Signed-off-by: Ilias Tsitsimpis <iliastsi@arrikto.com>
Signed-off-by: Vangelis Koukis <vkoukis@arrikto.com>
Reviewed-by: Andy Grover <agrover@redhat.com>
Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org>
Orabug: 25395066
(cherry picked from commit 2bc396a2529ae8a2287f17a49d893ce790e19110)
Signed-off-by: Kyle Fortin <kyle.fortin@oracle.com>
Reviewed-by: Shan Hai <shan.hai@oracle.com>