]> www.infradead.org Git - users/willy/linux.git/log
users/willy/linux.git
6 years agoRDMA/hns: remove obsolete Kconfig comment
YueHaibing [Wed, 7 Aug 2019 03:22:28 +0000 (11:22 +0800)]
RDMA/hns: remove obsolete Kconfig comment

Since commit a07fc0bb483e ("RDMA/hns: Fix build error")
these kconfig comment is obsolete, so just remove it.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Link: https://lore.kernel.org/r/20190807032228.6788-1-yuehaibing@huawei.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
6 years agoRDMA/{cxgb3, cxgb4, i40iw}: Remove common code
Kamal Heib [Wed, 7 Aug 2019 10:31:38 +0000 (13:31 +0300)]
RDMA/{cxgb3, cxgb4, i40iw}: Remove common code

Now that we have a common iWARP query port function we can remove the
common code from the iWARP drivers.

Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Acked-by: Potnuri Bharat Teja <bharat@chelsio.com>
Link: https://lore.kernel.org/r/20190807103138.17219-5-kamalheib1@gmail.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
6 years agoRDMA/core: Add common iWARP query port
Kamal Heib [Wed, 7 Aug 2019 10:31:37 +0000 (13:31 +0300)]
RDMA/core: Add common iWARP query port

Add support for a common iWARP query port function, the new function
includes a common code that is used by the iWARP devices to update the
port attributes like max_mtu, active_mtu, state, and phys_state, the
function also includes a call for the driver-specific query_port callback
to query the device-specific port attributes.

Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Link: https://lore.kernel.org/r/20190807103138.17219-4-kamalheib1@gmail.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
6 years agoRDMA/cxgb3: Use ib_device_set_netdev()
Kamal Heib [Wed, 7 Aug 2019 10:31:36 +0000 (13:31 +0300)]
RDMA/cxgb3: Use ib_device_set_netdev()

This change is required to associate the cxgb3 ib_dev with the
underlying net_device, so in the upcoming patch we can call
ib_device_get_netdev().

Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Link: https://lore.kernel.org/r/20190807103138.17219-3-kamalheib1@gmail.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
6 years agoRDMA: Introduce ib_port_phys_state enum
Kamal Heib [Wed, 7 Aug 2019 10:31:35 +0000 (13:31 +0300)]
RDMA: Introduce ib_port_phys_state enum

In order to improve readability, add ib_port_phys_state enum to replace
the use of magic numbers.

Signed-off-by: Kamal Heib <kamalheib1@gmail.com>
Reviewed-by: Andrew Boyer <aboyer@tobark.org>
Acked-by: Michal Kalderon <michal.kalderon@marvell.com>
Acked-by: Bernard Metzler <bmt@zurich.ibm.com>
Link: https://lore.kernel.org/r/20190807103138.17219-2-kamalheib1@gmail.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
6 years agoRDMA/efa: Rate limit admin queue error prints
Gal Pressman [Thu, 1 Aug 2019 17:14:47 +0000 (20:14 +0300)]
RDMA/efa: Rate limit admin queue error prints

Admin queue error prints should never happen unless something wrong
happened to the device. However, in the unfortunate case that it does,
we should take extra care not to flood the log with error messages.

Reviewed-by: Firas JahJah <firasj@amazon.com>
Reviewed-by: Yossi Leybovich <sleybo@amazon.com>
Signed-off-by: Gal Pressman <galpress@amazon.com>
Link: https://lore.kernel.org/r/20190801171447.54440-3-galpress@amazon.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
6 years agoRDMA/core: Introduce ratelimited ibdev printk functions
Gal Pressman [Thu, 1 Aug 2019 17:14:46 +0000 (20:14 +0300)]
RDMA/core: Introduce ratelimited ibdev printk functions

Add ratelimited helpers to the ibdev_* printk functions.
Implementation inspired by counterpart dev_*_ratelimited functions.

Signed-off-by: Gal Pressman <galpress@amazon.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/20190801171447.54440-2-galpress@amazon.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
6 years agoRDMA/hns: Remove not used UAR assignment
Leon Romanovsky [Thu, 1 Aug 2019 11:48:27 +0000 (14:48 +0300)]
RDMA/hns: Remove not used UAR assignment

UAR in CQ is not used and generates the following compilation
warning, clean the code by removing uar assignment.

drivers/infiniband/hw/hns/hns_roce_cq.c: In function _create_user_cq_:
drivers/infiniband/hw/hns/hns_roce_cq.c:305:27: warning: parameter _uar_ set but not used [-Wunused-but-set-parameter]
  305 |      struct hns_roce_uar *uar,
      |      ~~~~~~~~~~~~~~~~~~~~~^~~

Fixes: 4f8f0d5e33dd ("RDMA/hns: Package the flow of creating cq")
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/20190801114827.24263-1-leon@kernel.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
6 years agordma: Enable ib_alloc_cq to spread work over a device's comp_vectors
Chuck Lever [Mon, 29 Jul 2019 17:22:09 +0000 (13:22 -0400)]
rdma: Enable ib_alloc_cq to spread work over a device's comp_vectors

Send and Receive completion is handled on a single CPU selected at
the time each Completion Queue is allocated. Typically this is when
an initiator instantiates an RDMA transport, or when a target
accepts an RDMA connection.

Some ULPs cannot open a connection per CPU to spread completion
workload across available CPUs and MSI vectors. For such ULPs,
provide an API that allows the RDMA core to select a completion
vector based on the device's complement of available comp_vecs.

ULPs that invoke ib_alloc_cq() with only comp_vector 0 are converted
to use the new API so that their completion workloads interfere less
with each other.

Suggested-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Cc: <linux-cifs@vger.kernel.org>
Cc: <v9fs-developer@lists.sourceforge.net>
Link: https://lore.kernel.org/r/20190729171923.13428.52555.stgit@manet.1015granger.net
Signed-off-by: Doug Ledford <dledford@redhat.com>
6 years agomlx5: Fix formats with line continuation whitespace
Joe Perches [Thu, 1 Nov 2018 07:24:08 +0000 (00:24 -0700)]
mlx5: Fix formats with line continuation whitespace

The line continuations unintentionally add whitespace so
instead use coalesced formats to remove the whitespace.

Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/f14db3287b23ed8af9bdbf8001e2e2fe7ae9e43a.camel@perches.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
6 years agoRDMA/hns: remove set but not used variable 'irq_num'
YueHaibing [Wed, 31 Jul 2019 07:37:48 +0000 (15:37 +0800)]
RDMA/hns: remove set but not used variable 'irq_num'

Fixes gcc '-Wunused-but-set-variable' warning:

drivers/infiniband/hw/hns/hns_roce_hw_v2.c: In function hns_roce_v2_cleanup_eq_table:
drivers/infiniband/hw/hns/hns_roce_hw_v2.c:5920:6:
 warning: variable irq_num set but not used [-Wunused-but-set-variable]

It is not used since
commit 33db6f94847c ("RDMA/hns: Refactor eq table init for hip08")

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Link: https://lore.kernel.org/r/20190731073748.17664-1-yuehaibing@huawei.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
6 years agoRDMA/mlx5: Remove DEBUG ODP code
Leon Romanovsky [Wed, 31 Jul 2019 11:56:27 +0000 (14:56 +0300)]
RDMA/mlx5: Remove DEBUG ODP code

Delete DEBUG ODP dead code which is leftover from development
stage and doesn't need to be part of the upstream kernel.

Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/20190731115627.5433-1-leon@kernel.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
6 years agoRDMA/core: fix spelling mistake "Nelink" -> "Netlink"
Colin Ian King [Wed, 31 Jul 2019 08:01:44 +0000 (09:01 +0100)]
RDMA/core: fix spelling mistake "Nelink" -> "Netlink"

There is a spelling mistake in a warning message, fix it.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/20190731080144.18327-1-colin.king@canonical.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
6 years agoinfiniband: Remove dev_err() usage after platform_get_irq()
Stephen Boyd [Tue, 30 Jul 2019 18:15:20 +0000 (11:15 -0700)]
infiniband: Remove dev_err() usage after platform_get_irq()

We don't need dev_err() messages when platform_get_irq() fails now that
platform_get_irq() prints an error message itself when something goes
wrong. Let's remove these prints with a simple semantic patch.

// <smpl>
@@
expression ret;
struct platform_device *E;
@@

ret =
(
platform_get_irq(E, ...)
|
platform_get_irq_byname(E, ...)
);

if ( \( ret < 0 \| ret <= 0 \) )
{
(
-if (ret != -EPROBE_DEFER)
-{ ...
-dev_err(...);
-... }
|
...
-dev_err(...);
)
...
}
// </smpl>

While we're here, remove braces on if statements that only have one
statement (manually).

Cc: Doug Ledford <dledford@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: linux-rdma@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Link: https://lore.kernel.org/r/20190730181557.90391-21-swboyd@chromium.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
6 years agoRDMA/efa: Expose device statistics
Gal Pressman [Thu, 25 Jul 2019 13:03:53 +0000 (16:03 +0300)]
RDMA/efa: Expose device statistics

Expose hardware statistics through the sysfs api:
/sys/class/infiniband/efa_0/hw_counters/*.
/sys/class/infiniband/efa_0/ports/1/hw_counters/*.

Reviewed-by: Firas JahJah <firasj@amazon.com>
Reviewed-by: Yossi Leybovich <sleybo@amazon.com>
Signed-off-by: Gal Pressman <galpress@amazon.com>
Link: https://lore.kernel.org/r/20190725130353.11544-1-galpress@amazon.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
6 years agoIB/bnxt_re: Do not notifify GID change event
Parav Pandit [Fri, 26 Jul 2019 18:26:52 +0000 (13:26 -0500)]
IB/bnxt_re: Do not notifify GID change event

GID table entry operations such as add/remove/modify are triggered
by the IB core for RoCE ports.
Hence, remove GID change notification from hw driver.

Reviewed-by: Selvin Xavier <selvin.xavier@broadcom.com>
Tested-by: Selvin Xavier <selvin.xavier@broadcom.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Link: https://lore.kernel.org/r/20190726182652.50037-1-parav@mellanox.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
6 years agoMerge branch 'wip/dl-for-rc' into wip/dl-for-next
Doug Ledford [Mon, 29 Jul 2019 17:38:42 +0000 (13:38 -0400)]
Merge branch 'wip/dl-for-rc' into wip/dl-for-next

The fix for IB port statistics initialization ("IB/core: Fix querying
total rdma stats") is needed before we take a follow-on patch to
for-next.

Signed-off-by: Doug Ledford <dledford@redhat.com>
6 years agoDo not dereference 'siw_crypto_shash' before checking
Bernard Metzler [Sat, 27 Jul 2019 10:38:32 +0000 (12:38 +0200)]
Do not dereference 'siw_crypto_shash' before checking

Reported-by: "Dan Carpenter" <dan.carpenter@oracle.com>
Fixes: f29dd55b0236 ("rdma/siw: queue pair methods")
Link: https://lore.kernel.org/r/OF61E386ED.49A73798-ON00258444.003BD6A6-00258444.003CC8D9@notes.na.collabserv.com
Signed-off-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
6 years agoRDMA/qedr: Fix the hca_type and hca_rev returned in device attributes
Michal Kalderon [Sun, 28 Jul 2019 11:13:38 +0000 (14:13 +0300)]
RDMA/qedr: Fix the hca_type and hca_rev returned in device attributes

There was a place holder for hca_type and vendor was returned
in hca_rev. Fix the hca_rev to return the hw revision and fix
the hca_type to return an informative string representing the
hca.

Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Link: https://lore.kernel.org/r/20190728111338.21930-1-michal.kalderon@marvell.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
6 years agoRDMA/hns: Fix build error
YueHaibing [Wed, 24 Jul 2019 06:54:43 +0000 (14:54 +0800)]
RDMA/hns: Fix build error

If INFINIBAND_HNS_HIP08 is selected and HNS3 is m,
but INFINIBAND_HNS is y, building fails:

drivers/infiniband/hw/hns/hns_roce_hw_v2.o: In function `hns_roce_hw_v2_exit':
hns_roce_hw_v2.c:(.exit.text+0xd): undefined reference to `hnae3_unregister_client'
drivers/infiniband/hw/hns/hns_roce_hw_v2.o: In function `hns_roce_hw_v2_init':
hns_roce_hw_v2.c:(.init.text+0xd): undefined reference to `hnae3_register_client'

Also if INFINIBAND_HNS_HIP06 is selected and HNS_DSAF
is m, but INFINIBAND_HNS is y, building fails:

drivers/infiniband/hw/hns/hns_roce_hw_v1.o: In function `hns_roce_v1_reset':
hns_roce_hw_v1.c:(.text+0x39fa): undefined reference to `hns_dsaf_roce_reset'
hns_roce_hw_v1.c:(.text+0x3a25): undefined reference to `hns_dsaf_roce_reset'

Reported-by: Hulk Robot <hulkci@huawei.com>
Fixes: dd74282df573 ("RDMA/hns: Initialize the PCI device for hip08 RoCE")
Fixes: 08805fdbeb2d ("RDMA/hns: Split hw v1 driver from hns roce driver")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Link: https://lore.kernel.org/r/20190724065443.53068-1-yuehaibing@huawei.com
Signed-off-by: Doug Ledford <dledford@redhat.com>
6 years agoIB/mlx5: Support per device q counters in switchdev mode
Parav Pandit [Tue, 23 Jul 2019 07:31:17 +0000 (10:31 +0300)]
IB/mlx5: Support per device q counters in switchdev mode

When parent mlx5_core_dev is in switchdev mode, q_counters are not
applicable to multiple non uplink vports.
Hence, have make them limited to device level.

While at it, correct __mlx5_ib_qp_set_counter() and
__mlx5_ib_modify_qp() to use u16 set_id as defined by the device.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/20190723073117.7175-3-leon@kernel.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
6 years agoIB/mlx5: Refactor code for counters allocation
Parav Pandit [Tue, 23 Jul 2019 07:31:16 +0000 (10:31 +0300)]
IB/mlx5: Refactor code for counters allocation

To support per device counters in switchdev mode (instead of
per port counter), refactor query routines to work on mlx5_ib_counter
structure instead of mlx5_ib_port structure.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Link: https://lore.kernel.org/r/20190723073117.7175-2-leon@kernel.org
Signed-off-by: Doug Ledford <dledford@redhat.com>
6 years agoqed*: Change dpi_addr to be denoted with __iomem
Michal Kalderon [Tue, 9 Jul 2019 14:17:33 +0000 (17:17 +0300)]
qed*: Change dpi_addr to be denoted with __iomem

Several casts were required around dpi_addr parameter in qed_rdma_if.h
This is an address on the doorbell bar and should therefore be marked with
__iomem.

Link: https://lore.kernel.org/r/20190709141735.19193-5-michal.kalderon@marvell.com
Reported-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Ariel Elior <ariel.elior@marvell.com>
Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoIB/mlx5: Add CREATE_PSV/DESTROY_PSV for devx interface
Max Gurtovoy [Tue, 23 Jul 2019 07:04:12 +0000 (10:04 +0300)]
IB/mlx5: Add CREATE_PSV/DESTROY_PSV for devx interface

Limit the number of PSV's created through devx to 1, to create a symmetry
between create/destroy cmds. In the kernel, one can create up to 4 PSV's
using CREATE_PSV cmd but the destruction is one by one. Add a protection
for this a-symmetric definition for devx.

Link: https://lore.kernel.org/r/20190723070412.6385-1-leon@kernel.org
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoRDMA/core: Support netlink commands in non init_net net namespaces
Parav Pandit [Tue, 23 Jul 2019 07:02:05 +0000 (10:02 +0300)]
RDMA/core: Support netlink commands in non init_net net namespaces

Now that IB core supports RDMA device binding with specific net namespace,
enable IB core to accept netlink commands in non init_net namespaces.

This is done by having per net namespace netlink socket.

At present only netlink device handling client RDMA_NL_NLDEV supports
device handling in multiple net namespaces.  Hence do not accept netlink
messages for other clients in non init_net net namespaces.

Link: https://lore.kernel.org/r/20190723070205.6247-1-leon@kernel.org
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoRDMA/mlx4: Annotate boolean arguments as bool and not int
Leon Romanovsky [Thu, 4 Jul 2019 13:09:36 +0000 (16:09 +0300)]
RDMA/mlx4: Annotate boolean arguments as bool and not int

Information provided by qp_has_rq() and used latter is boolean, so update
callers to proper type.

Link: https://lore.kernel.org/r/20190704130936.8705-3-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoRDMA/mlx4: Separate creation of RWQ and QP
Leon Romanovsky [Thu, 4 Jul 2019 13:09:35 +0000 (16:09 +0300)]
RDMA/mlx4: Separate creation of RWQ and QP

The mlx4 WQ is implemented with HW QP without special HW object.  Current
implementation which tried to reuse the code did it with common QP
creation flows. Such decision caused to the absence of mlx4_ib_wq struct,
which is needed to ensure proper allocation of ib_wq inside of IB/core.

Separate create_qp_common() to pure QP flow and to create_rq() for RWQ.

Link: https://lore.kernel.org/r/20190704130936.8705-2-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoIB/usnic: Use dev_get_drvdata
Chuhong Yuan [Tue, 23 Jul 2019 11:49:28 +0000 (19:49 +0800)]
IB/usnic: Use dev_get_drvdata

Instead of using to_pci_dev + pci_get_drvdata, use dev_get_drvdata to make
the code simpler.

Link: https://lore.kernel.org/r/20190723114928.18424-1-hslester96@gmail.com
Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoRDMA: Make most headers compile stand alone
Jason Gunthorpe [Mon, 22 Jul 2019 17:01:30 +0000 (17:01 +0000)]
RDMA: Make most headers compile stand alone

So that rdma can work with CONFIG_KERNEL_HEADER_TEST and
CONFIG_HEADERS_CHECK.

Link: https://lore.kernel.org/r/20190722170126.GA16453@ziepe.ca
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoRDMA/qedr: Remove Unneeded variable rc
Hariprasad Kelam [Tue, 16 Jul 2019 17:37:12 +0000 (23:07 +0530)]
RDMA/qedr: Remove Unneeded variable rc

Fix the below warning reported by coccicheck:

drivers/infiniband/hw/qedr/verbs.c:2454:5-7: Unneeded variable: "rc".
Return "0" on line 2499

Link: https://lore.kernel.org/r/20190716173712.GA12949@hari-Inspiron-1545
Signed-off-by: Hariprasad Kelam <hariprasad.kelam@gmail.com>
Acked-by: Michal Kalderon <michal.kalderon@marvell.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoRDMA/qib: Unneeded variable ret
Hariprasad Kelam [Tue, 16 Jul 2019 17:29:25 +0000 (22:59 +0530)]
RDMA/qib: Unneeded variable ret

Fix the below warning reported by coccicheck:

drivers/infiniband/hw/qib/qib_file_ops.c:1792:5-8: Unneeded variable:
"ret". Return "0" on line 1876

Link: https://lore.kernel.org/r/20190716172924.GA12241@hari-Inspiron-1545
Signed-off-by: Hariprasad Kelam <hariprasad.kelam@gmail.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoRDMA/hns: Refactor eq table init for hip08
Yixian Liu [Mon, 8 Jul 2019 13:41:25 +0000 (21:41 +0800)]
RDMA/hns: Refactor eq table init for hip08

To make the code more readable, move the part of naming irq and request
irq out of eq table init into a separate function.

Link: https://lore.kernel.org/r/1562593285-8037-10-git-send-email-oulijun@huawei.com
Signed-off-by: Yixian Liu <liuyixian@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoRDMA/hns: Refactor hem table mhop check and calculation
Lijun Ou [Mon, 8 Jul 2019 13:41:24 +0000 (21:41 +0800)]
RDMA/hns: Refactor hem table mhop check and calculation

The calculation of mhop for hem is duplicated in hns_roce_init_hem_table
and hns_roce_calc_hem_mhop, extracting it from them to a separate
function. Moreover, this patch refactors hns_roce_check_whether_mhop to
reduce complexity.

Link: https://lore.kernel.org/r/1562593285-8037-9-git-send-email-oulijun@huawei.com
Signed-off-by: Yixian Liu <liuyixian@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoRDMA/hns: Package for hns_roce_rereg_user_mr function
Lijun Ou [Mon, 8 Jul 2019 13:41:23 +0000 (21:41 +0800)]
RDMA/hns: Package for hns_roce_rereg_user_mr function

Move some code of the hns_roce_rereg_user_mr() function into an
independent function in oder to improve readability.

Link: https://lore.kernel.org/r/1562593285-8037-8-git-send-email-oulijun@huawei.com
Signed-off-by: Lang Cheng <chenglang@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoRDMA/hns: Optimize hns_roce_mhop_alloc function.
chenglang [Mon, 8 Jul 2019 13:41:22 +0000 (21:41 +0800)]
RDMA/hns: Optimize hns_roce_mhop_alloc function.

Move some lines for allocating multi-hop addressing into independent
functions in order to improve readability.

Link: https://lore.kernel.org/r/1562593285-8037-7-git-send-email-oulijun@huawei.com
Signed-off-by: Lang Cheng <chenglang@huawei.com>
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoRDMA/hns: optimize the duplicated code for qpc setting flow
Xi Wang [Mon, 8 Jul 2019 13:41:21 +0000 (21:41 +0800)]
RDMA/hns: optimize the duplicated code for qpc setting flow

Currently, more than 20 lines of duplicate code exist in function
'modify_qp_init_to_init' and function 'modify_qp_reset_to_init', which
affects the readability of the code. Consolidate them.

Link: https://lore.kernel.org/r/1562593285-8037-6-git-send-email-oulijun@huawei.com
Signed-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoRDMA/hns: Use a separated function for setting extend sge paramters
Lijun Ou [Mon, 8 Jul 2019 13:41:20 +0000 (21:41 +0800)]
RDMA/hns: Use a separated function for setting extend sge paramters

Moves the related lines of setting extended sge size into a separate
function as well as remove the unused variables.

Link: https://lore.kernel.org/r/1562593285-8037-5-git-send-email-oulijun@huawei.com
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoRDMA/hns: Refactor for hns_roce_v2_modify_qp function
Lijun Ou [Mon, 8 Jul 2019 13:41:19 +0000 (21:41 +0800)]
RDMA/hns: Refactor for hns_roce_v2_modify_qp function

Move some lines which exist hns_roce_v2_modify_qp function into a new
function. The code refactored mainly includes some absolute fields of qp
context and some optional fields of qp context.

Link: https://lore.kernel.org/r/1562593285-8037-4-git-send-email-oulijun@huawei.com
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoRDMA/hns: Refactor the code of creating srq
Lijun Ou [Mon, 8 Jul 2019 13:41:18 +0000 (21:41 +0800)]
RDMA/hns: Refactor the code of creating srq

Move the related codes of creating user srq and kernel srq into two
independent functions as well as remove some unused code and
simplifications.

Link: https://lore.kernel.org/r/1562593285-8037-3-git-send-email-oulijun@huawei.com
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoRDMA/hns: Package the flow of creating cq
Lijun Ou [Mon, 8 Jul 2019 13:41:17 +0000 (21:41 +0800)]
RDMA/hns: Package the flow of creating cq

Moves the related code of creating cq into separate functions in order to
improve comprehensibility.

Link: https://lore.kernel.org/r/1562593285-8037-2-git-send-email-oulijun@huawei.com
Signed-off-by: Lijun Ou <oulijun@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoIB/mlx5: Avoid unnecessary typecast
Parav Pandit [Tue, 23 Jul 2019 06:57:31 +0000 (09:57 +0300)]
IB/mlx5: Avoid unnecessary typecast

IB device pointer is already available while deallocating IB device,
Hence do not typecast it.

Link: https://lore.kernel.org/r/20190723065733.4899-9-leon@kernel.org
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Daniel Jurgens <danielj@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoRDMA/core: Annotate destroy of mutex to ensure that it is released as unlocked
Parav Pandit [Tue, 23 Jul 2019 06:57:24 +0000 (09:57 +0300)]
RDMA/core: Annotate destroy of mutex to ensure that it is released as unlocked

While compiled with CONFIG_DEBUG_MUTEXES, the kernel ensures that mutex is
not held during destroy. Hence add mutex_destroy() for mutexes used in
RDMA modules.

Link: https://lore.kernel.org/r/20190723065733.4899-2-leon@kernel.org
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoIB/mlx5: Fix RSS Toeplitz setup to be aligned with the HW specification
Yishai Hadas [Tue, 23 Jul 2019 06:57:29 +0000 (09:57 +0300)]
IB/mlx5: Fix RSS Toeplitz setup to be aligned with the HW specification

The specification for the Toeplitz function doesn't require to set the key
explicitly to be symmetric. In case a symmetric functionality is required
a symmetric key can be simply used.

Wrongly forcing the algorithm to symmetric causes the wrong packet
distribution and a performance degradation.

Link: https://lore.kernel.org/r/20190723065733.4899-7-leon@kernel.org
Cc: <stable@vger.kernel.org> # 4.7
Fixes: 28d6137008b2 ("IB/mlx5: Add RSS QP support")
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Reviewed-by: Alex Vainman <alexv@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoIB/counters: Always initialize the port counter object
Parav Pandit [Tue, 23 Jul 2019 06:57:33 +0000 (09:57 +0300)]
IB/counters: Always initialize the port counter object

Port counter objects should be initialized even if alloc_stats is
unsupported, otherwise QP bind operations in user space can trigger a NULL
pointer deference if they try to bind QP on RDMA device which doesn't
support counters.

Fixes: f34a55e497e8 ("RDMA/core: Get sum value of all counters when perform a sysfs stat read")
Link: https://lore.kernel.org/r/20190723065733.4899-11-leon@kernel.org
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Mark Zhang <markz@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoIB/core: Fix querying total rdma stats
Parav Pandit [Tue, 23 Jul 2019 06:57:32 +0000 (09:57 +0300)]
IB/core: Fix querying total rdma stats

rdma_counter_init() may fail for a device. In such case while calculating
total sum, ignore NULL hstats.

This fixes below observed call trace.

BUG: kernel NULL pointer dereference, address: 00000000000000a0
PGD 8000001009b30067 P4D 8000001009b30067 PUD 10549c9067 PMD 0
Oops: 0000 [#1] SMP PTI
CPU: 55 PID: 20887 Comm: cat Kdump: loaded Not tainted 5.2.0-rc6-jdc+ #13
RIP: 0010:rdma_counter_get_hwstat_value+0xf2/0x150 [ib_core]
Call Trace:
 show_hw_stats+0x5e/0x130 [ib_core]
 dev_attr_show+0x15/0x50
 sysfs_kf_seq_show+0xc6/0x1a0
 seq_read+0x132/0x370
 vfs_read+0x89/0x140
 ksys_read+0x5c/0xd0
 do_syscall_64+0x5a/0x240
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Fixes: f34a55e497e8 ("RDMA/core: Get sum value of all counters when perform a sysfs stat read")
Link: https://lore.kernel.org/r/20190723065733.4899-10-leon@kernel.org
Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Mark Zhang <markz@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoIB/mlx5: Prevent concurrent MR updates during invalidation
Moni Shoua [Tue, 23 Jul 2019 06:57:30 +0000 (09:57 +0300)]
IB/mlx5: Prevent concurrent MR updates during invalidation

The device requires that memory registration work requests that update the
address translation table of a MR will be fenced if posted together.  This
scenario can happen when address ranges are invalidated by the mmu in
separate concurrent calls to the invalidation callback.

We prefer to block concurrent address updates for a single MR over fencing
since making the decision if a WQE needs fencing will be more expensive
and fencing all WQEs is a too radical choice.

Further, it isn't clear that this code can even run safely concurrently,
so a lock is a safer choice.

Fixes: b4cfe447d47b ("IB/mlx5: Implement on demand paging by adding support for MMU notifiers")
Link: https://lore.kernel.org/r/20190723065733.4899-8-leon@kernel.org
Signed-off-by: Moni Shoua <monis@mellanox.com>
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoIB/mlx5: Fix clean_mr() to work in the expected order
Yishai Hadas [Tue, 23 Jul 2019 06:57:28 +0000 (09:57 +0300)]
IB/mlx5: Fix clean_mr() to work in the expected order

Any dma map underlying the MR should only be freed once the MR is fenced
at the hardware.

As of the above we first destroy the MKEY and just after that can safely
call to dma_unmap_single().

Link: https://lore.kernel.org/r/20190723065733.4899-6-leon@kernel.org
Cc: <stable@vger.kernel.org> # 4.3
Fixes: 8a187ee52b04 ("IB/mlx5: Support the new memory registration API")
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoIB/mlx5: Move MRs to a kernel PD when freeing them to the MR cache
Yishai Hadas [Tue, 23 Jul 2019 06:57:27 +0000 (09:57 +0300)]
IB/mlx5: Move MRs to a kernel PD when freeing them to the MR cache

Fix unreg_umr to move the MR to a kernel owned PD (i.e. the UMR PD) which
can't be accessed by userspace.

This ensures that nothing can continue to access the MR once it has been
placed in the kernels cache for reuse.

MRs in the cache continue to have their HW state, including DMA tables,
present. Even though the MR has been invalidated, changing the PD provides
an additional layer of protection against use of the MR.

Link: https://lore.kernel.org/r/20190723065733.4899-5-leon@kernel.org
Cc: <stable@vger.kernel.org> # 3.10
Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox Connect-IB adapters")
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoIB/mlx5: Use direct mkey destroy command upon UMR unreg failure
Yishai Hadas [Tue, 23 Jul 2019 06:57:26 +0000 (09:57 +0300)]
IB/mlx5: Use direct mkey destroy command upon UMR unreg failure

Use a direct firmware command to destroy the mkey in case the unreg UMR
operation has failed.

This prevents a case that a mkey will leak out from the cache post a
failure to be destroyed by a UMR WR.

In case the MR cache limit didn't reach a call to add another entry to the
cache instead of the destroyed one is issued.

In addition, replaced a warn message to WARN_ON() as this flow is fatal
and can't happen unless some bug around.

Link: https://lore.kernel.org/r/20190723065733.4899-4-leon@kernel.org
Cc: <stable@vger.kernel.org> # 4.10
Fixes: 49780d42dfc9 ("IB/mlx5: Expose MR cache for mlx5_ib")
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoIB/mlx5: Fix unreg_umr to ignore the mkey state
Yishai Hadas [Tue, 23 Jul 2019 06:57:25 +0000 (09:57 +0300)]
IB/mlx5: Fix unreg_umr to ignore the mkey state

Fix unreg_umr to ignore the mkey state and do not fail if was freed.  This
prevents a case that a user space application already changed the mkey
state to free and then the UMR operation will fail leaving the mkey in an
inappropriate state.

Link: https://lore.kernel.org/r/20190723065733.4899-3-leon@kernel.org
Cc: <stable@vger.kernel.org> # 3.19
Fixes: 968e78dd9644 ("IB/mlx5: Enhance UMR support to allow partial page table update")
Signed-off-by: Yishai Hadas <yishaih@mellanox.com>
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoRDMA/hns: Fix comparison of unsigned long variable 'end' with less than zero
Colin Ian King [Fri, 31 May 2019 09:21:00 +0000 (10:21 +0100)]
RDMA/hns: Fix comparison of unsigned long variable 'end' with less than zero

Currently the comparison of end with less than zero is always false
because end is an unsigned long.  Also, replace checks of end with
non-zero with end > 0 as it is possible that the #defined decrement may be
changed in the future causing end to step over zero and go negative.

The initialization of end with 0 is also redundant as this value is never
read and is later set to HW_SYNC_TIMEOUT_MSECS, so fix this by
initializing it with this value to begin with.

Link: https://lore.kernel.org/r/20190531092101.28772-1-colin.king@canonical.com
Addresses-Coverity: ("Unsigned compared against 0")
Fixes: 669cefb654cb ("RDMA/hns: Remove jiffies operation in disable interrupt context")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoRDMA/mlx4: Untag user pointers in mlx4_get_umem_mr
Andrey Konovalov [Tue, 23 Jul 2019 17:58:48 +0000 (19:58 +0200)]
RDMA/mlx4: Untag user pointers in mlx4_get_umem_mr

This patch is a part of a series that extends kernel ABI to allow to pass
tagged user pointers (with the top byte set to something else other than
0x00) as syscall arguments.

mlx4_get_umem_mr() uses provided user pointers for vma lookups, which can
only by done with untagged pointers.

Untag user pointers in this function.

Link: https://lore.kernel.org/r/7969018013a67ddbbf784ac7afeea5a57b1e2bcb.1563904656.git.andreyknvl@google.com
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoIB/hfi1: Remove unused define
Mike Marciniszyn [Mon, 15 Jul 2019 16:45:52 +0000 (12:45 -0400)]
IB/hfi1: Remove unused define

The patch noted in Fixes missed deleting the define it obsoleted.

Fixes: da9de5f8527f ("IB/hfi1: Close PSM sdma_progress sleep window")
Link: https://lore.kernel.org/r/20190715164552.74174.99396.stgit@awfm-01.aw.intel.com
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoIB/hfi1: Do not update hcrc for a KDETH packet during fault injection
Kaike Wan [Mon, 15 Jul 2019 16:45:46 +0000 (12:45 -0400)]
IB/hfi1: Do not update hcrc for a KDETH packet during fault injection

When a KDETH packet is subject to fault injection during transmission,
HCRC is supposed to be omitted from the packet so that the hardware on the
receiver side would drop the packet. When creating pbc, the PbcInsertHcrc
field is set to be PBC_IHCRC_NONE if the KDETH packet is subject to fault
injection, but overwritten with PBC_IHCRC_LKDETH when update_hcrc() is
called later.

This problem is fixed by not calling update_hcrc() when the packet is
subject to fault injection.

Fixes: 6b6cf9357f78 ("IB/hfi1: Set PbcInsertHcrc for TID RDMA packets")
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20190715164546.74174.99296.stgit@awfm-01.aw.intel.com
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoRDMA/siw: Remove set but not used variables 'rv'
Mao Wenan [Fri, 19 Jul 2019 01:29:38 +0000 (09:29 +0800)]
RDMA/siw: Remove set but not used variables 'rv'

Fixes gcc '-Wunused-but-set-variable' warning:

drivers/infiniband/sw/siw/siw_cm.c: In function siw_cep_set_inuse:
drivers/infiniband/sw/siw/siw_cm.c:223:6: warning: variable rv set but not used [-Wunused-but-set-variable]

Fixes: 6c52fdc244b5 ("rdma/siw: connection management")
Link: https://lore.kernel.org/r/20190719012938.100628-1-maowenan@huawei.com
Signed-off-by: Mao Wenan <maowenan@huawei.com>
Reviewed-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoIB/mlx5: Replace kfree with kvfree
Chuhong Yuan [Wed, 17 Jul 2019 08:21:01 +0000 (16:21 +0800)]
IB/mlx5: Replace kfree with kvfree

Memory allocated by kvzalloc should not be freed by kfree(), use kvfree()
instead.

Fixes: 813e90b1aeaa ("IB/mlx5: Add advise_mr() support")
Link: https://lore.kernel.org/r/20190717082101.14196-1-hslester96@gmail.com
Signed-off-by: Chuhong Yuan <hslester96@gmail.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoRDMA/bnxt_re: Honor vlan_id in GID entry comparison
Selvin Xavier [Mon, 15 Jul 2019 09:19:13 +0000 (05:19 -0400)]
RDMA/bnxt_re: Honor vlan_id in GID entry comparison

A GID entry consists of GID, vlan, netdev and smac.  Extend GID duplicate
check comparisons to consider vlan_id as well to support IPv6 VLAN based
link local addresses. Introduce a new structure (bnxt_qplib_gid_info) to
hold gid and vlan_id information.

The issue is discussed in the following thread
https://lore.kernel.org/r/AM0PR05MB4866CFEDCDF3CDA1D7D18AA5D1F20@AM0PR05MB4866.eurprd05.prod.outlook.com

Fixes: 823b23da7113 ("IB/core: Allow vlan link local address based RoCE GIDs")
Cc: <stable@vger.kernel.org> # v5.2+
Link: https://lore.kernel.org/r/20190715091913.15726-1-selvin.xavier@broadcom.com
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Co-developed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Selvin Xavier <selvin.xavier@broadcom.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoIB/hfi1: Drop all TID RDMA READ RESP packets after r_next_psn
Kaike Wan [Mon, 15 Jul 2019 16:45:40 +0000 (12:45 -0400)]
IB/hfi1: Drop all TID RDMA READ RESP packets after r_next_psn

When a TID sequence error occurs while receiving TID RDMA READ RESP
packets, all packets after flow->flow_state.r_next_psn should be dropped,
including those response packets for subsequent segments.

The current implementation will drop the subsequent response packets for
the segment to complete next, but may accept packets for subsequent
segments and therefore mistakenly advance the r_next_psn fields for the
corresponding software flows. This may result in failures to complete
subsequent segments after the current segment is completed.

The fix is to only use the flow pointed by req->clear_tail for checking
KDETH PSN instead of finding a flow from the request's flow array.

Fixes: b885d5be9ca1 ("IB/hfi1: Unify the software PSN check for TID RDMA READ/WRITE")
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20190715164540.74174.54702.stgit@awfm-01.aw.intel.com
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoIB/hfi1: Field not zero-ed when allocating TID flow memory
Kaike Wan [Mon, 15 Jul 2019 16:45:34 +0000 (12:45 -0400)]
IB/hfi1: Field not zero-ed when allocating TID flow memory

The field flow->resync_npkts is added for TID RDMA WRITE request and
zero-ed when a TID RDMA WRITE RESP packet is received by the requester.
This field is used to rewind a request during retry in the function
hfi1_tid_rdma_restart_req() shared by both TID RDMA WRITE and TID RDMA
READ requests. Therefore, when a TID RDMA READ request is retried, this
field may not be initialized at all, which causes the retry to start at an
incorrect psn, leading to the drop of the retry request by the responder.

This patch fixes the problem by zeroing out the field when the flow memory
is allocated.

Fixes: 838b6fd2d9ca ("IB/hfi1: TID RDMA RcvArray programming and TID allocation")
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20190715164534.74174.6177.stgit@awfm-01.aw.intel.com
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoIB/hfi1: Unreserve a flushed OPFN request
Kaike Wan [Mon, 15 Jul 2019 16:45:28 +0000 (12:45 -0400)]
IB/hfi1: Unreserve a flushed OPFN request

When an OPFN request is flushed, the request is completed without
unreserving itself from the send queue. Subsequently, when a new
request is post sent, the following warning will be triggered:

WARNING: CPU: 4 PID: 8130 at rdmavt/qp.c:1761 rvt_post_send+0x72a/0x880 [rdmavt]
Call Trace:
[<ffffffffbbb61e41>] dump_stack+0x19/0x1b
[<ffffffffbb497688>] __warn+0xd8/0x100
[<ffffffffbb4977cd>] warn_slowpath_null+0x1d/0x20
[<ffffffffc01c941a>] rvt_post_send+0x72a/0x880 [rdmavt]
[<ffffffffbb4dcabe>] ? account_entity_dequeue+0xae/0xd0
[<ffffffffbb61d645>] ? __kmalloc+0x55/0x230
[<ffffffffc04e1a4c>] ib_uverbs_post_send+0x37c/0x5d0 [ib_uverbs]
[<ffffffffc04e5e36>] ? rdma_lookup_put_uobject+0x26/0x60 [ib_uverbs]
[<ffffffffc04dbce6>] ib_uverbs_write+0x286/0x460 [ib_uverbs]
[<ffffffffbb6f9457>] ? security_file_permission+0x27/0xa0
[<ffffffffbb641650>] vfs_write+0xc0/0x1f0
[<ffffffffbb64246f>] SyS_write+0x7f/0xf0
[<ffffffffbbb74ddb>] system_call_fastpath+0x22/0x27

This patch fixes the problem by moving rvt_qp_wqe_unreserve() into
rvt_qp_complete_swqe() to simplify the code and make it less
error-prone.

Fixes: ca95f802ef51 ("IB/hfi1: Unreserve a reserved request when it is completed")
Link: https://lore.kernel.org/r/20190715164528.74174.31364.stgit@awfm-01.aw.intel.com
Cc: <stable@vger.kernel.org>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoIB/hfi1: Check for error on call to alloc_rsm_map_table
John Fleck [Mon, 15 Jul 2019 16:45:21 +0000 (12:45 -0400)]
IB/hfi1: Check for error on call to alloc_rsm_map_table

The call to alloc_rsm_map_table does not check if the kmalloc fails.
Check for a NULL on alloc, and bail if it fails.

Fixes: 372cc85a13c9 ("IB/hfi1: Extract RSM map table init from QOS")
Link: https://lore.kernel.org/r/20190715164521.74174.27047.stgit@awfm-01.aw.intel.com
Cc: <stable@vger.kernel.org>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: John Fleck <john.fleck@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoRDMA/hns: Fix sg offset non-zero issue
Xi Wang [Thu, 11 Jul 2019 01:32:17 +0000 (09:32 +0800)]
RDMA/hns: Fix sg offset non-zero issue

When run perftest in many times, the system will report a BUG as follows:

   BUG: Bad rss-counter state mm:(____ptrval____) idx:0 val:-1
   BUG: Bad rss-counter state mm:(____ptrval____) idx:1 val:1

We tested with different kernel version and found it started from the the
following commit:

commit d10bcf947a3e ("RDMA/umem: Combine contiguous PAGE_SIZE regions in
SGEs")

In this commit, the sg->offset is always 0 when sg_set_page() is called in
ib_umem_get() and the drivers are not allowed to change the sgl, otherwise
it will get bad page descriptor when unfolding SGEs in __ib_umem_release()
as sg_page_count() will get wrong result while sgl->offset is not 0.

However, there is a weird sgl usage in the current hns driver, the driver
modified sg->offset after calling ib_umem_get(), which caused we iterate
past the wrong number of pages in for_each_sg_page iterator.

This patch fixes it by correcting the non-standard sgl usage found in the
hns_roce_db_map_user() function.

Fixes: d10bcf947a3e ("RDMA/umem: Combine contiguous PAGE_SIZE regions in SGEs")
Fixes: 0425e3e6e0c7 ("RDMA/hns: Support flush cqe for hip08 in kernel space")
Link: https://lore.kernel.org/r/1562808737-45723-1-git-send-email-oulijun@huawei.com
Signed-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoRDMA/siw: Fix error return code in siw_init_module()
Wei Yongjun [Thu, 18 Jul 2019 09:27:10 +0000 (09:27 +0000)]
RDMA/siw: Fix error return code in siw_init_module()

Fix to return a negative error code from the error handling
case instead of 0, as done elsewhere in this function.

Fixes: bdcf26bf9b3a ("rdma/siw: network and RDMA core interface")
Link: https://lore.kernel.org/r/20190718092710.85709-1-weiyongjun1@huawei.com
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Reviewed-by: Bernard Metzler <bmt@zurich.ibm.com>
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
6 years agoLinus 5.3-rc1 v5.3-rc1
Linus Torvalds [Sun, 21 Jul 2019 21:05:38 +0000 (14:05 -0700)]
Linus 5.3-rc1

6 years agoMerge tag 'devicetree-fixes-for-5.3' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sun, 21 Jul 2019 17:28:39 +0000 (10:28 -0700)]
Merge tag 'devicetree-fixes-for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux

Pull Devicetree fixes from Rob Herring:
 "Fix several warnings/errors in validation of binding schemas"

* tag 'devicetree-fixes-for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
  dt-bindings: pinctrl: stm32: Fix missing 'clocks' property in examples
  dt-bindings: iio: ad7124: Fix dtc warnings in example
  dt-bindings: iio: avia-hx711: Fix avdd-supply typo in example
  dt-bindings: pinctrl: aspeed: Fix AST2500 example errors
  dt-bindings: pinctrl: aspeed: Fix 'compatible' schema errors
  dt-bindings: riscv: Limit cpus schema to only check RiscV 'cpu' nodes
  dt-bindings: Ensure child nodes are of type 'object'

6 years agoMerge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Linus Torvalds [Sun, 21 Jul 2019 17:09:43 +0000 (10:09 -0700)]
Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull vfs documentation typo fix from Al Viro.

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  typo fix: it's d_make_root, not d_make_inode...

6 years agoMerge tag '5.3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Linus Torvalds [Sun, 21 Jul 2019 17:01:17 +0000 (10:01 -0700)]
Merge tag '5.3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:
 "Two fixes for stable, one that had dependency on earlier patch in this
  merge window and can now go in, and a perf improvement in SMB3 open"

* tag '5.3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: update internal module number
  cifs: flush before set-info if we have writeable handles
  smb3: optimize open to not send query file internal info
  cifs: copy_file_range needs to strip setuid bits and update timestamps
  CIFS: fix deadlock in cached root handling

6 years agoiommu/amd: fix a crash in iova_magazine_free_pfns
Qian Cai [Thu, 11 Jul 2019 16:17:45 +0000 (12:17 -0400)]
iommu/amd: fix a crash in iova_magazine_free_pfns

The commit b3aa14f02254 ("iommu: remove the mapping_error dma_map_ops
method") incorrectly changed the checking from dma_ops_alloc_iova() in
map_sg() causes a crash under memory pressure as dma_ops_alloc_iova()
never return DMA_MAPPING_ERROR on failure but 0, so the error handling
is all wrong.

   kernel BUG at drivers/iommu/iova.c:801!
    Workqueue: kblockd blk_mq_run_work_fn
    RIP: 0010:iova_magazine_free_pfns+0x7d/0xc0
    Call Trace:
     free_cpu_cached_iovas+0xbd/0x150
     alloc_iova_fast+0x8c/0xba
     dma_ops_alloc_iova.isra.6+0x65/0xa0
     map_sg+0x8c/0x2a0
     scsi_dma_map+0xc6/0x160
     pqi_aio_submit_io+0x1f6/0x440 [smartpqi]
     pqi_scsi_queue_command+0x90c/0xdd0 [smartpqi]
     scsi_queue_rq+0x79c/0x1200
     blk_mq_dispatch_rq_list+0x4dc/0xb70
     blk_mq_sched_dispatch_requests+0x249/0x310
     __blk_mq_run_hw_queue+0x128/0x200
     blk_mq_run_work_fn+0x27/0x30
     process_one_work+0x522/0xa10
     worker_thread+0x63/0x5b0
     kthread+0x1d2/0x1f0
     ret_from_fork+0x22/0x40

Fixes: b3aa14f02254 ("iommu: remove the mapping_error dma_map_ops method")
Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
6 years agohexagon: switch to generic version of pte allocation
Mike Rapoport [Tue, 30 Apr 2019 14:27:50 +0000 (17:27 +0300)]
hexagon: switch to generic version of pte allocation

The hexagon implementation pte_alloc_one(), pte_alloc_one_kernel(),
pte_free_kernel() and pte_free() is identical to the generic except of
lack of __GFP_ACCOUNT for the user PTEs allocation.

Switch hexagon to use generic version of these functions.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
6 years agoMerge tag 'ntb-5.3' of git://github.com/jonmason/ntb
Linus Torvalds [Sun, 21 Jul 2019 16:46:59 +0000 (09:46 -0700)]
Merge tag 'ntb-5.3' of git://github.com/jonmason/ntb

Pull NTB updates from Jon Mason:
 "New feature to add support for NTB virtual MSI interrupts, the ability
  to test and use this feature in the NTB transport layer.

  Also, bug fixes for the AMD and Switchtec drivers, as well as some
  general patches"

* tag 'ntb-5.3' of git://github.com/jonmason/ntb: (22 commits)
  NTB: Describe the ntb_msi_test client in the documentation.
  NTB: Add MSI interrupt support to ntb_transport
  NTB: Add ntb_msi_test support to ntb_test
  NTB: Introduce NTB MSI Test Client
  NTB: Introduce MSI library
  NTB: Rename ntb.c to support multiple source files in the module
  NTB: Introduce functions to calculate multi-port resource index
  NTB: Introduce helper functions to calculate logical port number
  PCI/switchtec: Add module parameter to request more interrupts
  PCI/MSI: Support allocating virtual MSI interrupts
  ntb_hw_switchtec: Fix setup MW with failure bug
  ntb_hw_switchtec: Skip unnecessary re-setup of shared memory window for crosslink case
  ntb_hw_switchtec: Remove redundant steps of switchtec_ntb_reinit_peer() function
  NTB: correct ntb_dev_ops and ntb_dev comment typos
  NTB: amd: Silence shift wrapping warning in amd_ntb_db_vector_mask()
  ntb_hw_switchtec: potential shift wrapping bug in switchtec_ntb_init_sndev()
  NTB: ntb_transport: Ensure qp->tx_mw_dma_addr is initaliazed
  NTB: ntb_hw_amd: set peer limit register
  NTB: ntb_perf: Clear stale values in doorbell and command SPAD register
  NTB: ntb_perf: Disable NTB link after clearing peer XLAT registers
  ...

6 years agotypo fix: it's d_make_root, not d_make_inode...
Al Viro [Sun, 21 Jul 2019 03:17:30 +0000 (23:17 -0400)]
typo fix: it's d_make_root, not d_make_inode...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
6 years agodt-bindings: pinctrl: stm32: Fix missing 'clocks' property in examples
Rob Herring [Tue, 16 Jul 2019 21:34:40 +0000 (15:34 -0600)]
dt-bindings: pinctrl: stm32: Fix missing 'clocks' property in examples

Now that examples are validated against the DT schema, an error with
required 'clocks' property missing is exposed:

Documentation/devicetree/bindings/pinctrl/st,stm32-pinctrl.example.dt.yaml: \
pinctrl@40020000: gpio@0: 'clocks' is a required property
Documentation/devicetree/bindings/pinctrl/st,stm32-pinctrl.example.dt.yaml: \
pinctrl@50020000: gpio@1000: 'clocks' is a required property
Documentation/devicetree/bindings/pinctrl/st,stm32-pinctrl.example.dt.yaml: \
pinctrl@50020000: gpio@2000: 'clocks' is a required property

Add the missing 'clocks' properties to the examples to fix the errors.

Fixes: 2c9239c125f0 ("dt-bindings: pinctrl: Convert stm32 pinctrl bindings to json-schema")
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
Cc: linux-gpio@vger.kernel.org
Cc: linux-stm32@st-md-mailman.stormreply.com
Acked-by: Alexandre TORGUE <alexandre.torgue@st.com>
Signed-off-by: Rob Herring <robh@kernel.org>
6 years agodt-bindings: iio: ad7124: Fix dtc warnings in example
Rob Herring [Tue, 16 Jul 2019 20:21:56 +0000 (14:21 -0600)]
dt-bindings: iio: ad7124: Fix dtc warnings in example

With the conversion to DT schema, the examples are now compiled with
dtc. The ad7124 binding example has the following warning:

Documentation/devicetree/bindings/iio/adc/adi,ad7124.example.dts:19.11-21: \
Warning (reg_format): /example-0/adc@0:reg: property has invalid length (4 bytes) (#address-cells == 1, #size-cells == 1)

There's a default #size-cells and #address-cells values of 1 for
examples. For examples needing different values such as this one on a
SPI bus, they need to provide a SPI bus parent node.

Fixes: 26ae15e62d3c ("Convert AD7124 bindings documentation to YAML format.")
Cc: Jonathan Cameron <jic23@kernel.org>
Cc: linux-iio@vger.kernel.org
Signed-off-by: Rob Herring <robh@kernel.org>
6 years agodt-bindings: iio: avia-hx711: Fix avdd-supply typo in example
Rob Herring [Tue, 16 Jul 2019 20:13:29 +0000 (14:13 -0600)]
dt-bindings: iio: avia-hx711: Fix avdd-supply typo in example

Now that examples are validated against the DT schema, a typo in
avia-hx711 example generates a warning:

Documentation/devicetree/bindings/iio/adc/avia-hx711.example.dt.yaml: weight: 'avdd-supply' is a required property

Fix the typo.

Fixes: 5150ec3fe125 ("avia-hx711.yaml: transform DT binding to YAML")
Cc: Andreas Klinger <ak@it-klinger.de>
Cc: Jonathan Cameron <jic23@kernel.org>
Cc: linux-iio@vger.kernel.org
Signed-off-by: Rob Herring <robh@kernel.org>
6 years agodt-bindings: pinctrl: aspeed: Fix AST2500 example errors
Rob Herring [Mon, 15 Jul 2019 22:48:41 +0000 (16:48 -0600)]
dt-bindings: pinctrl: aspeed: Fix AST2500 example errors

The schema examples are now validated against the schema itself. The
AST2500 pinctrl schema has a couple of errors:

Documentation/devicetree/bindings/pinctrl/aspeed,ast2500-pinctrl.example.dt.yaml: \
example-0: $nodename:0: 'example-0' does not match '^(bus|soc|axi|ahb|apb)(@[0-9a-f]+)?$'
Documentation/devicetree/bindings/pinctrl/aspeed,ast2500-pinctrl.example.dt.yaml: \
pinctrl: aspeed,external-nodes: [[1, 2]] is too short

Fixes: 0a617de16730 ("dt-bindings: pinctrl: aspeed: Convert AST2500 bindings to json-schema")
Cc: Andrew Jeffery <andrew@aj.id.au>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Joel Stanley <joel@jms.id.au>
Cc: linux-aspeed@lists.ozlabs.org
Cc: linux-gpio@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Acked-by: Andrew Jeffery <andrew@aj.id.au>
Signed-off-by: Rob Herring <robh@kernel.org>
6 years agodt-bindings: pinctrl: aspeed: Fix 'compatible' schema errors
Rob Herring [Mon, 15 Jul 2019 22:37:25 +0000 (16:37 -0600)]
dt-bindings: pinctrl: aspeed: Fix 'compatible' schema errors

The Aspeed pinctl schema have errors in the 'compatible' schema:

Documentation/devicetree/bindings/pinctrl/aspeed,ast2400-pinctrl.yaml: \
properties:compatible:enum: ['aspeed', 'ast2400-pinctrl', 'aspeed', 'g4-pinctrl'] has non-unique elements
Documentation/devicetree/bindings/pinctrl/aspeed,ast2500-pinctrl.yaml: \
properties:compatible:enum: ['aspeed', 'ast2500-pinctrl', 'aspeed', 'g5-pinctrl'] has non-unique elements

Flow style sequences have to be quoted if the vales contain ','. Fix
this by using the more common one line per entry formatting.

Fixes: 0a617de16730 ("dt-bindings: pinctrl: aspeed: Convert AST2500 bindings to json-schema")
Fixes: 07457937bb5c ("dt-bindings: pinctrl: aspeed: Convert AST2400 bindings to json-schema")
Cc: Andrew Jeffery <andrew@aj.id.au>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Joel Stanley <joel@jms.id.au>
Cc: linux-aspeed@lists.ozlabs.org
Cc: linux-gpio@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Acked-by: Andrew Jeffery <andrew@aj.id.au>
Signed-off-by: Rob Herring <robh@kernel.org>
6 years agodt-bindings: riscv: Limit cpus schema to only check RiscV 'cpu' nodes
Rob Herring [Wed, 26 Jun 2019 23:57:59 +0000 (17:57 -0600)]
dt-bindings: riscv: Limit cpus schema to only check RiscV 'cpu' nodes

Matching on the 'cpus' node was a bad choice because the schema is
incorrectly applied to non-RiscV cpus nodes. As we now have a common cpus
schema which checks the general structure, it is also redundant to do so
in the Risc-V CPU schema.

The downside is one could conceivably mix different architecture's cpu
nodes or have typos in the compatible string. The latter problem pretty
much exists for every schema.

Acked-by: Paul Walmsley <paul.walmsley@sifive.com>
Signed-off-by: Rob Herring <robh@kernel.org>
6 years agodt-bindings: Ensure child nodes are of type 'object'
Rob Herring [Wed, 3 Jul 2019 20:17:06 +0000 (14:17 -0600)]
dt-bindings: Ensure child nodes are of type 'object'

Properties which are child node definitions need to have an explict
type. Otherwise, a matching (DT) property can silently match when an
error is desired. Fix this up tree-wide. Once this is fixed, the
meta-schema will enforce this on any child node definitions.

Cc: Chen-Yu Tsai <wens@csie.org>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: Brian Norris <computersforpeace@gmail.com>
Cc: Marek Vasut <marek.vasut@gmail.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Vignesh Raghavendra <vigneshr@ti.com>
Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
Cc: linux-mtd@lists.infradead.org
Cc: linux-gpio@vger.kernel.org
Cc: linux-stm32@st-md-mailman.stormreply.com
Cc: linux-spi@vger.kernel.org
Acked-by: Miquel Raynal <miquel.raynal@bootlin.com>
Acked-by: Maxime Ripard <maxime.ripard@bootlin.com>
Acked-by: Mark Brown <broonie@kernel.org>
Acked-by: Alexandre TORGUE <alexandre.torgue@st.com>
Signed-off-by: Rob Herring <robh@kernel.org>
6 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Linus Torvalds [Sat, 20 Jul 2019 19:22:30 +0000 (12:22 -0700)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input

Pull more input updates from Dmitry Torokhov:

 - Apple SPI keyboard and trackpad driver for newer Macs

 - ALPS driver will ignore trackpoint-only devices to give the
   trackpoint driver a chance to handle them properly

 - another Lenovo is switched over to SMbus from PS/2

 - assorted driver fixups.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
  Input: alps - fix a mismatch between a condition check and its comment
  Input: psmouse - fix build error of multiple definition
  Input: applespi - remove set but not used variables 'sts'
  Input: add Apple SPI keyboard and trackpad driver
  Input: alps - don't handle ALPS cs19 trackpoint-only device
  Input: hyperv-keyboard - remove dependencies on PAGE_SIZE for ring buffer
  Input: adp5589 - initialize GPIO controller parent device
  Input: iforce - remove empty multiline comments
  Input: synaptics - fix misuse of strlcpy
  Input: auo-pixcir-ts - switch to using  devm_add_action_or_reset()
  Input: gtco - bounds check collection indent level
  Input: mtk-pmic-keys - add of_node_put() before return
  Input: sun4i-lradc-keys - add of_node_put() before return
  Input: synaptics - whitelist Lenovo T580 SMBus intertouch

6 years agoMerge tag 'dma-mapping-5.3-1' of git://git.infradead.org/users/hch/dma-mapping
Linus Torvalds [Sat, 20 Jul 2019 19:09:52 +0000 (12:09 -0700)]
Merge tag 'dma-mapping-5.3-1' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping fixes from Christoph Hellwig:
 "Fix various regressions:

   - force unencrypted dma-coherent buffers if encryption bit can't fit
     into the dma coherent mask (Tom Lendacky)

   - avoid limiting request size if swiotlb is not used (me)

   - fix swiotlb handling in dma_direct_sync_sg_for_cpu/device (Fugang
     Duan)"

* tag 'dma-mapping-5.3-1' of git://git.infradead.org/users/hch/dma-mapping:
  dma-direct: correct the physical addr in dma_direct_sync_sg_for_cpu/device
  dma-direct: only limit the mapping size if swiotlb could be used
  dma-mapping: add a dma_addressing_limited helper
  dma-direct: Force unencrypted DMA under SME for certain DMA masks

6 years agoMerge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sat, 20 Jul 2019 18:24:49 +0000 (11:24 -0700)]
Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Thomas Gleixner:
 "A set of x86 specific fixes and updates:

   - The CR2 corruption fixes which store CR2 early in the entry code
     and hand the stored address to the fault handlers.

   - Revert a forgotten leftover of the dropped FSGSBASE series.

   - Plug a memory leak in the boot code.

   - Make the Hyper-V assist functionality robust by zeroing the shadow
     page.

   - Remove a useless check for dead processes with LDT

   - Update paravirt and VMware maintainers entries.

   - A few cleanup patches addressing various compiler warnings"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/entry/64: Prevent clobbering of saved CR2 value
  x86/hyper-v: Zero out the VP ASSIST PAGE on allocation
  x86, boot: Remove multiple copy of static function sanitize_boot_params()
  x86/boot/compressed/64: Remove unused variable
  x86/boot/efi: Remove unused variables
  x86/mm, tracing: Fix CR2 corruption
  x86/entry/64: Update comments and sanity tests for create_gap
  x86/entry/64: Simplify idtentry a little
  x86/entry/32: Simplify common_exception
  x86/paravirt: Make read_cr2() CALLEE_SAVE
  MAINTAINERS: Update PARAVIRT_OPS_INTERFACE and VMWARE_HYPERVISOR_INTERFACE
  x86/process: Delete useless check for dead process with LDT
  x86: math-emu: Hide clang warnings for 16-bit overflow
  x86/e820: Use proper booleans instead of 0/1
  x86/apic: Silence -Wtype-limits compiler warnings
  x86/mm: Free sme_early_buffer after init
  x86/boot: Fix memory leak in default_get_smp_config()
  Revert "x86/ptrace: Prevent ptrace from clearing the FS/GS selector" and fix the test

6 years agoMerge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sat, 20 Jul 2019 18:06:12 +0000 (11:06 -0700)]
Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf tooling updates from Thomas Gleixner:
 "A set of perf improvements and fixes:

  perf db-export:
   - Improvements in how COMM details are exported to databases for post
     processing and use in the sql-viewer.py UI.

   - Export switch events to the database.

  BPF:
   - Bump rlimit(MEMLOCK) for 'perf test bpf' and 'perf trace', just
     like selftests/bpf/bpf_rlimit.h do, which makes errors due to
     exhaustion of this limit, which are kinda cryptic (EPERM sometimes)
     less frequent.

  perf version:
   - Fix segfault due to missing OPT_END(), noticed on PowerPC.

  perf vendor events:
   - Add JSON files for IBM s/390 machine type 8561.

  perf cs-etm (ARM):
   - Fix two cases of error returns not bing done properly: Invalid
     ERR_PTR() use and loss of propagation error codes"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (28 commits)
  perf version: Fix segfault due to missing OPT_END()
  perf vendor events s390: Add JSON files for machine type 8561
  perf cs-etm: Return errcode in cs_etm__process_auxtrace_info()
  perf cs-etm: Remove errnoeous ERR_PTR() usage in cs_etm__process_auxtrace_info
  perf scripts python: export-to-postgresql.py: Export switch events
  perf scripts python: export-to-sqlite.py: Export switch events
  perf db-export: Export switch events
  perf db-export: Factor out db_export__threads()
  perf script: Add scripting operation process_switch()
  perf scripts python: exported-sql-viewer.py: Use new 'has_calls' column
  perf scripts python: exported-sql-viewer.py: Remove redundant semi-colons
  perf scripts python: export-to-postgresql.py: Add has_calls column to comms table
  perf scripts python: export-to-sqlite.py: Add has_calls column to comms table
  perf db-export: Also export thread's current comm
  perf db-export: Factor out db_export__comm()
  perf scripts python: export-to-postgresql.py: Export comm details
  perf scripts python: export-to-sqlite.py: Export comm details
  perf db-export: Export comm details
  perf db-export: Fix a white space issue in db_export__sample()
  perf db-export: Move export__comm_thread into db_export__sample()
  ...

6 years agoMerge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sat, 20 Jul 2019 17:45:15 +0000 (10:45 -0700)]
Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull core fixes from Thomas Gleixner:

 - A collection of objtool fixes which address recent fallout partially
   exposed by newer toolchains, clang, BPF and general code changes.

 - Force USER_DS for user stack traces

[ Note: the "objtool fixes" are not all to objtool itself, but for
  kernel code that triggers objtool warnings.

  Things like missing function size annotations, or code that confuses
  the unwinder etc.   - Linus]

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
  objtool: Support conditional retpolines
  objtool: Convert insn type to enum
  objtool: Fix seg fault on bad switch table entry
  objtool: Support repeated uses of the same C jump table
  objtool: Refactor jump table code
  objtool: Refactor sibling call detection logic
  objtool: Do frame pointer check before dead end check
  objtool: Change dead_end_function() to return boolean
  objtool: Warn on zero-length functions
  objtool: Refactor function alias logic
  objtool: Track original function across branches
  objtool: Add mcsafe_handle_tail() to the uaccess safe list
  bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()
  x86/uaccess: Remove redundant CLACs in getuser/putuser error paths
  x86/uaccess: Don't leak AC flag into fentry from mcsafe_handle_tail()
  x86/uaccess: Remove ELF function annotation from copy_user_handle_tail()
  x86/head/64: Annotate start_cpu0() as non-callable
  x86/entry: Fix thunk function ELF sizes
  x86/kvm: Don't call kvm_spurious_fault() from .fixup
  x86/kvm: Replace vmx_vmenter()'s call to kvm_spurious_fault() with UD2
  ...

6 years agoMerge branch 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sat, 20 Jul 2019 17:43:03 +0000 (10:43 -0700)]
Merge branch 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull smp fix from Thomas Gleixner:
 "Add warnings to the smp function calls so callers from wrong contexts
  get detected"

* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  smp: Warn on function calls from softirq context

6 years agoMerge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sat, 20 Jul 2019 17:33:44 +0000 (10:33 -0700)]
Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull CONFIG_PREEMPT_RT stub config from Thomas Gleixner:
 "The real-time preemption patch set exists for almost 15 years now and
  while the vast majority of infrastructure and enhancements have found
  their way into the mainline kernel, the final integration of RT is
  still missing.

  Over the course of the last few years, we have worked on reducing the
  intrusivenness of the RT patches by refactoring kernel infrastructure
  to be more real-time friendly. Almost all of these changes were
  benefitial to the mainline kernel on their own, so there was no
  objection to integrate them.

  Though except for the still ongoing printk refactoring, the remaining
  changes which are required to make RT a first class mainline citizen
  are not longer arguable as immediately beneficial for the mainline
  kernel. Most of them are either reordering code flows or adding RT
  specific functionality.

  But this now has hit a wall and turned into a classic hen and egg
  problem:

     Maintainers are rightfully wary vs. these changes as they make only
     sense if the final integration of RT into the mainline kernel takes
     place.

  Adding CONFIG_PREEMPT_RT aims to solve this as a clear sign that RT
  will be fully integrated into the mainline kernel. The final
  integration of the missing bits and pieces will be of course done with
  the same careful approach as we have used in the past.

  While I'm aware that you are not entirely enthusiastic about that, I
  think that RT should receive the same treatment as any other widely
  used out of tree functionality, which we have accepted into mainline
  over the years.

  RT has become the de-facto standard real-time enhancement and is
  shipped by enterprise, embedded and community distros. It's in use
  throughout a wide range of industries: telecommunications, industrial
  automation, professional audio, medical devices, data acquisition,
  automotive - just to name a few major use cases.

  RT development is backed by a Linuxfoundation project which is
  supported by major stakeholders of this technology. The funding will
  continue over the actual inclusion into mainline to make sure that the
  functionality is neither introducing regressions, regressing itself,
  nor becomes subject to bitrot. There is also a lifely user community
  around RT as well, so contrary to the grim situation 5 years ago, it's
  a healthy project.

  As RT is still a good vehicle to exercise rarely used code paths and
  to detect hard to trigger issues, you could at least view it as a QA
  tool if nothing else"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/rt, Kconfig: Introduce CONFIG_PREEMPT_RT

6 years agoMerge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Linus Torvalds [Sat, 20 Jul 2019 17:20:27 +0000 (10:20 -0700)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull more KVM updates from Paolo Bonzini:
 "Mostly bugfixes, but also:

   - s390 support for KVM selftests

   - LAPIC timer offloading to housekeeping CPUs

   - Extend an s390 optimization for overcommitted hosts to all
     architectures

   - Debugging cleanups and improvements"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (25 commits)
  KVM: x86: Add fixed counters to PMU filter
  KVM: nVMX: do not use dangling shadow VMCS after guest reset
  KVM: VMX: dump VMCS on failed entry
  KVM: x86/vPMU: refine kvm_pmu err msg when event creation failed
  KVM: s390: Use kvm_vcpu_wake_up in kvm_s390_vcpu_wakeup
  KVM: Boost vCPUs that are delivering interrupts
  KVM: selftests: Remove superfluous define from vmx.c
  KVM: SVM: Fix detection of AMD Errata 1096
  KVM: LAPIC: Inject timer interrupt via posted interrupt
  KVM: LAPIC: Make lapic timer unpinned
  KVM: x86/vPMU: reset pmc->counter to 0 for pmu fixed_counters
  KVM: nVMX: Ignore segment base for VMX memory operand when segment not FS or GS
  kvm: x86: ioapic and apic debug macros cleanup
  kvm: x86: some tsc debug cleanup
  kvm: vmx: fix coccinelle warnings
  x86: kvm: avoid constant-conversion warning
  x86: kvm: avoid -Wsometimes-uninitized warning
  KVM: x86: expose AVX512_BF16 feature to guest
  KVM: selftests: enable pgste option for the linker on s390
  KVM: selftests: Move kvm_create_max_vcpus test to generic code
  ...

6 years agoMerge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Linus Torvalds [Sat, 20 Jul 2019 17:04:58 +0000 (10:04 -0700)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI fixes from James Bottomley:
 "This is the final round of mostly small fixes in our initial submit.

  It's mostly minor fixes and driver updates. The only change of note is
  adding a virt_boundary_mask to the SCSI host and host template to
  parametrise this for NVMe devices instead of having them do a call in
  slave_alloc. It's a fairly straightforward conversion except in the
  two NVMe handling drivers that didn't set it who now have a virtual
  infinity parameter added"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (24 commits)
  scsi: megaraid_sas: set an unlimited max_segment_size
  scsi: mpt3sas: set an unlimited max_segment_size for SAS 3.0 HBAs
  scsi: IB/srp: set virt_boundary_mask in the scsi host
  scsi: IB/iser: set virt_boundary_mask in the scsi host
  scsi: storvsc: set virt_boundary_mask in the scsi host template
  scsi: ufshcd: set max_segment_size in the scsi host template
  scsi: core: take the DMA max mapping size into account
  scsi: core: add a host / host template field for the virt boundary
  scsi: core: Fix race on creating sense cache
  scsi: sd_zbc: Fix compilation warning
  scsi: libfc: fix null pointer dereference on a null lport
  scsi: zfcp: fix GCC compiler warning emitted with -Wmaybe-uninitialized
  scsi: zfcp: fix request object use-after-free in send path causing wrong traces
  scsi: zfcp: fix request object use-after-free in send path causing seqno errors
  scsi: megaraid_sas: Update driver version to 07.710.50.00
  scsi: megaraid_sas: Add module parameter for FW Async event logging
  scsi: megaraid_sas: Enable msix_load_balance for Invader and later controllers
  scsi: megaraid_sas: Fix calculation of target ID
  scsi: lpfc: reduce stack size with CONFIG_GCC_PLUGIN_STRUCTLEAK_VERBOSE
  scsi: devinfo: BLIST_TRY_VPD_PAGES for SanDisk Cruzer Blade
  ...

6 years agoMerge tag 'kbuild-v5.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy...
Linus Torvalds [Sat, 20 Jul 2019 16:34:55 +0000 (09:34 -0700)]
Merge tag 'kbuild-v5.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull more Kbuild updates from Masahiro Yamada:

 - match the directory structure of the linux-libc-dev package to that
   of Debian-based distributions

 - fix incorrect include/config/auto.conf generation when Kconfig
   creates it along with the .config file

 - remove misleading $(AS) from documents

 - clean up precious tag files by distclean instead of mrproper

 - add a new coccinelle patch for devm_platform_ioremap_resource
   migration

 - refactor module-related scripts to read modules.order instead of
   $(MODVERDIR)/*.mod files to get the list of created modules

 - remove MODVERDIR

 - update list of header compile-test

 - add -fcf-protection=none flag to avoid conflict with the retpoline
   flags when CONFIG_RETPOLINE=y

 - misc cleanups

* tag 'kbuild-v5.3-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (25 commits)
  kbuild: add -fcf-protection=none when using retpoline flags
  kbuild: update compile-test header list for v5.3-rc1
  kbuild: split out *.mod out of {single,multi}-used-m rules
  kbuild: remove 'prepare1' target
  kbuild: remove the first line of *.mod files
  kbuild: create *.mod with full directory path and remove MODVERDIR
  kbuild: export_report: read modules.order instead of .tmp_versions/*.mod
  kbuild: modpost: read modules.order instead of $(MODVERDIR)/*.mod
  kbuild: modsign: read modules.order instead of $(MODVERDIR)/*.mod
  kbuild: modinst: read modules.order instead of $(MODVERDIR)/*.mod
  scsi: remove pointless $(MODVERDIR)/$(obj)/53c700.ver
  kbuild: remove duplication from modules.order in sub-directories
  kbuild: get rid of kernel/ prefix from in-tree modules.{order,builtin}
  kbuild: do not create empty modules.order in the prepare stage
  coccinelle: api: add devm_platform_ioremap_resource script
  kbuild: compile-test headers listed in header-test-m as well
  kbuild: remove unused hostcc-option
  kbuild: remove tag files by distclean instead of mrproper
  kbuild: add --hash-style= and --build-id unconditionally
  kbuild: get rid of misleading $(AS) from documents
  ...

6 years agoMerge branch 'work.dcache2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Linus Torvalds [Sat, 20 Jul 2019 16:15:51 +0000 (09:15 -0700)]
Merge branch 'work.dcache2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull dcache and mountpoint updates from Al Viro:
 "Saner handling of refcounts to mountpoints.

  Transfer the counting reference from struct mount ->mnt_mountpoint
  over to struct mountpoint ->m_dentry. That allows us to get rid of the
  convoluted games with ordering of mount shutdowns.

  The cost is in teaching shrink_dcache_{parent,for_umount} to cope with
  mixed-filesystem shrink lists, which we'll also need for the Slab
  Movable Objects patchset"

* 'work.dcache2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  switch the remnants of releasing the mountpoint away from fs_pin
  get rid of detach_mnt()
  make struct mountpoint bear the dentry reference to mountpoint, not struct mount
  Teach shrink_dcache_parent() to cope with mixed-filesystem shrink lists
  fs/namespace.c: shift put_mountpoint() to callers of unhash_mnt()
  __detach_mounts(): lookup_mountpoint() can't return ERR_PTR() anymore
  nfs: dget_parent() never returns NULL
  ceph: don't open-code the check for dead lockref

6 years agox86/entry/64: Prevent clobbering of saved CR2 value
Thomas Gleixner [Sat, 20 Jul 2019 08:56:41 +0000 (10:56 +0200)]
x86/entry/64: Prevent clobbering of saved CR2 value

The recent fix for CR2 corruption introduced a new way to reliably corrupt
the saved CR2 value.

CR2 is saved early in the entry code in RDX, which is the third argument to
the fault handling functions. But it missed that between saving and
invoking the fault handler enter_from_user_mode() can be called. RDX is a
caller saved register so the invoked function can freely clobber it with
the obvious consequences.

The TRACE_IRQS_OFF call is safe as it calls through the thunk which
preserves RDX, but TRACE_IRQS_OFF_DEBUG is not because it also calls into
C-code outside of the thunk.

Store CR2 in R12 instead which is a callee saved register and move R12 to
RDX just before calling the fault handler.

Fixes: a0d14b8909de ("x86/mm, tracing: Fix CR2 corruption")
Reported-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1907201020540.1782@nanos.tec.linutronix.de
6 years agosmp: Warn on function calls from softirq context
Peter Zijlstra [Thu, 18 Jul 2019 09:20:09 +0000 (11:20 +0200)]
smp: Warn on function calls from softirq context

It's clearly documented that smp function calls cannot be invoked from
softirq handling context. Unfortunately nothing enforces that or emits a
warning.

A single function call can be invoked from softirq context only via
smp_call_function_single_async().

The only legit context is task context, so add a warning to that effect.

Reported-by: luferry <luferry@163.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190718160601.GP3402@hirez.programming.kicks-ass.net
6 years agoKVM: x86: Add fixed counters to PMU filter
Eric Hankland [Thu, 18 Jul 2019 18:38:18 +0000 (11:38 -0700)]
KVM: x86: Add fixed counters to PMU filter

Updates KVM_CAP_PMU_EVENT_FILTER so it can also whitelist or blacklist
fixed counters.

Signed-off-by: Eric Hankland <ehankland@google.com>
[No need to check padding fields for zero. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
6 years agoKVM: nVMX: do not use dangling shadow VMCS after guest reset
Paolo Bonzini [Fri, 19 Jul 2019 16:41:10 +0000 (18:41 +0200)]
KVM: nVMX: do not use dangling shadow VMCS after guest reset

If a KVM guest is reset while running a nested guest, free_nested will
disable the shadow VMCS execution control in the vmcs01.  However,
on the next KVM_RUN vmx_vcpu_run would nevertheless try to sync
the VMCS12 to the shadow VMCS which has since been freed.

This causes a vmptrld of a NULL pointer on my machime, but Jan reports
the host to hang altogether.  Let's see how much this trivial patch fixes.

Reported-by: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Liran Alon <liran.alon@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
6 years agoKVM: VMX: dump VMCS on failed entry
Paolo Bonzini [Fri, 19 Jul 2019 16:15:08 +0000 (18:15 +0200)]
KVM: VMX: dump VMCS on failed entry

This is useful for debugging, and is ratelimited nowadays.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
6 years agoKVM: x86/vPMU: refine kvm_pmu err msg when event creation failed
Like Xu [Thu, 18 Jul 2019 05:35:14 +0000 (13:35 +0800)]
KVM: x86/vPMU: refine kvm_pmu err msg when event creation failed

If a perf_event creation fails due to any reason of the host perf
subsystem, it has no chance to log the corresponding event for guest
which may cause abnormal sampling data in guest result. In debug mode,
this message helps to understand the state of vPMC and we may not
limit the number of occurrences but not in a spamming style.

Suggested-by: Joe Perches <joe@perches.com>
Signed-off-by: Like Xu <like.xu@linux.intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
6 years agoKVM: s390: Use kvm_vcpu_wake_up in kvm_s390_vcpu_wakeup
Wanpeng Li [Thu, 18 Jul 2019 11:39:07 +0000 (19:39 +0800)]
KVM: s390: Use kvm_vcpu_wake_up in kvm_s390_vcpu_wakeup

Use kvm_vcpu_wake_up() in kvm_s390_vcpu_wakeup().

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
6 years agoKVM: Boost vCPUs that are delivering interrupts
Wanpeng Li [Thu, 18 Jul 2019 11:39:06 +0000 (19:39 +0800)]
KVM: Boost vCPUs that are delivering interrupts

Inspired by commit 9cac38dd5d (KVM/s390: Set preempted flag during
vcpu wakeup and interrupt delivery), we want to also boost not just
lock holders but also vCPUs that are delivering interrupts. Most
smp_call_function_many calls are synchronous, so the IPI target vCPUs
are also good yield candidates.  This patch introduces vcpu->ready to
boost vCPUs during wakeup and interrupt delivery time; unlike s390 we do
not reuse vcpu->preempted so that voluntarily preempted vCPUs are taken
into account by kvm_vcpu_on_spin, but vmx_vcpu_pi_put is not affected
(VT-d PI handles voluntary preemption separately, in pi_pre_block).

Testing on 80 HT 2 socket Xeon Skylake server, with 80 vCPUs VM 80GB RAM:
ebizzy -M

            vanilla     boosting    improved
1VM          21443       23520         9%
2VM           2800        8000       180%
3VM           1800        3100        72%

Testing on my Haswell desktop 8 HT, with 8 vCPUs VM 8GB RAM, two VMs,
one running ebizzy -M, the other running 'stress --cpu 2':

w/ boosting + w/o pv sched yield(vanilla)

            vanilla     boosting   improved
              1570         4000      155%

w/ boosting + w/ pv sched yield(vanilla)

            vanilla     boosting   improved
              1844         5157      179%

w/o boosting, perf top in VM:

 72.33%  [kernel]       [k] smp_call_function_many
  4.22%  [kernel]       [k] call_function_i
  3.71%  [kernel]       [k] async_page_fault

w/ boosting, perf top in VM:

 38.43%  [kernel]       [k] smp_call_function_many
  6.31%  [kernel]       [k] async_page_fault
  6.13%  libc-2.23.so   [.] __memcpy_avx_unaligned
  4.88%  [kernel]       [k] call_function_interrupt

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Marc Zyngier <maz@kernel.org>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
6 years agoKVM: selftests: Remove superfluous define from vmx.c
Thomas Huth [Thu, 18 Jul 2019 11:55:27 +0000 (13:55 +0200)]
KVM: selftests: Remove superfluous define from vmx.c

The code in vmx.c does not use "program_invocation_name", so there
is no need to "#define _GNU_SOURCE" here.

Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
6 years agoKVM: SVM: Fix detection of AMD Errata 1096
Liran Alon [Tue, 16 Jul 2019 23:56:58 +0000 (02:56 +0300)]
KVM: SVM: Fix detection of AMD Errata 1096

When CPU raise #NPF on guest data access and guest CR4.SMAP=1, it is
possible that CPU microcode implementing DecodeAssist will fail
to read bytes of instruction which caused #NPF. This is AMD errata
1096 and it happens because CPU microcode reading instruction bytes
incorrectly attempts to read code as implicit supervisor-mode data
accesses (that is, just like it would read e.g. a TSS), which are
susceptible to SMAP faults. The microcode reads CS:RIP and if it is
a user-mode address according to the page tables, the processor
gives up and returns no instruction bytes.  In this case,
GuestIntrBytes field of the VMCB on a VMEXIT will incorrectly
return 0 instead of the correct guest instruction bytes.

Current KVM code attemps to detect and workaround this errata, but it
has multiple issues:

1) It mistakenly checks if guest CR4.SMAP=0 instead of guest CR4.SMAP=1,
which is required for encountering a SMAP fault.

2) It assumes SMAP faults can only occur when guest CPL==3.
However, in case guest CR4.SMEP=0, the guest can execute an instruction
which reside in a user-accessible page with CPL<3 priviledge. If this
instruction raise a #NPF on it's data access, then CPU DecodeAssist
microcode will still encounter a SMAP violation.  Even though no sane
OS will do so (as it's an obvious priviledge escalation vulnerability),
we still need to handle this semanticly correct in KVM side.

Note that (2) *is* a useful optimization, because CR4.SMAP=1 is an easy
triggerable condition and guests usually enable SMAP together with SMEP.
If the vCPU has CR4.SMEP=1, the errata could indeed be encountered onlt
at guest CPL==3; otherwise, the CPU would raise a SMEP fault to guest
instead of #NPF.  We keep this condition to avoid false positives in
the detection of the errata.

In addition, to avoid future confusion and improve code readbility,
include details of the errata in code and not just in commit message.

Fixes: 05d5a4863525 ("KVM: SVM: Workaround errata#1096 (insn_len maybe zero on SMAP violation)")
Cc: Singh Brijesh <brijesh.singh@amd.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
6 years agoKVM: LAPIC: Inject timer interrupt via posted interrupt
Wanpeng Li [Sat, 6 Jul 2019 01:26:51 +0000 (09:26 +0800)]
KVM: LAPIC: Inject timer interrupt via posted interrupt

Dedicated instances are currently disturbed by unnecessary jitter due
to the emulated lapic timers firing on the same pCPUs where the
vCPUs reside.  There is no hardware virtual timer on Intel for guest
like ARM, so both programming timer in guest and the emulated timer fires
incur vmexits.  This patch tries to avoid vmexit when the emulated timer
fires, at least in dedicated instance scenario when nohz_full is enabled.

In that case, the emulated timers can be offload to the nearest busy
housekeeping cpus since APICv has been found for several years in server
processors. The guest timer interrupt can then be injected via posted interrupts,
which are delivered by the housekeeping cpu once the emulated timer fires.

The host should tuned so that vCPUs are placed on isolated physical
processors, and with several pCPUs surplus for busy housekeeping.
If disabled mwait/hlt/pause vmexits keep the vCPUs in non-root mode,
~3% redis performance benefit can be observed on Skylake server, and the
number of external interrupt vmexits drops substantially.  Without patch

            VM-EXIT  Samples  Samples%  Time%   Min Time  Max Time   Avg time
EXTERNAL_INTERRUPT    42916    49.43%   39.30%   0.47us   106.09us   0.71us ( +-   1.09% )

While with patch:

            VM-EXIT  Samples  Samples%  Time%   Min Time  Max Time         Avg time
EXTERNAL_INTERRUPT    6871     9.29%     2.96%   0.44us    57.88us   0.72us ( +-   4.02% )

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>