www.infradead.org Git - users/jedix/linux-maple.git/log

Merge branch 'topic/uek-4.1/rpm-build' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

Merge branch 'topic/uek-4.1/uek-carry' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

Merge branch 'topic/uek-4.1/nfs-rdma' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

Merge branch 'topic/uek-4.1/ofed' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

Merge branch 'topic/uek-4.1/ocfs2' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

Merge branch 'topic/uek-4.1/drivers' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

Merge branch topic/uek-4.1/upstream-cherry-picks of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

config: Enable CONFIG_XEN_PCIDEV_BACKEND by to be built-in.

When doing PCI passthrough using the xen-pciback.hide=
parameters works great - except when the code is a module
- at which point you have to do a lot of 'unbind'.

OraBug: 22338679 - UEK4: CONFIG_XEN_PCIDEV_BACKEND should be set to y
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

kernel: VirtBox workaround for dynamically allocated text

Orabug: 22377612

VirtualBox dynamically allocates space for executable text within the
kernel.  When these text addresses are encountered by stack trace back
code, they are dropped or the stack trace is terminated.

Ideally, an interface would be created so that routines could registered
to validate kernel text addresses.  Drivers like VirtualBox, would register
routines which know about dynamically allocated text.  However, this will
need more use cases from code in the mainline linux tree.

As a workaround, assume executable pages in vmalloc or module areas are
valid text addresses.  The #ifdef CONFIG_X86 is acceptable as VirtualBox
only supports x86 architecture.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

svcrdma: Fix NFS server crash triggered by 1MB NFS WRITE

Now that the NFS server advertises a maximum payload size of 1MB
for RPC/RDMA again, it crashes in svc_process_common() when NFS
client sends a 1MB NFS WRITE on an NFS/RDMA mount.

The server has set up a 259 element array of struct page pointers
in rq_pages[] for each incoming request. The last element of the
array is NULL.

When an incoming request has been completely received,
rdma_read_complete() attempts to set the starting page of the
incoming page vector:

rqstp->rq_arg.pages = &rqstp->rq_pages[head->hdr_count];

and the page to use for the reply:

rqstp->rq_respages = &rqstp->rq_arg.pages[page_no];

But the value of page_no has already accounted for head->hdr_count.
Thus rq_respages now points past the end of the incoming pages.

For NFS WRITE operations smaller than the maximum, this is harmless.
But when the NFS WRITE operation is as large as the server's max
payload size, rq_respages now points at the last entry in rq_pages,
which is NULL.

Fixes: cc9a903d915c ('svcrdma: Change maximum server payload . . .')
BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=270
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Sagi Grimberg <sagig@dev.mellanox.co.il>
Reviewed-by: Steve Wise <swise@opengridcomputing.com>
Reviewed-by: Shirley Ma <shirley.ma@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Orabug: 22204799

RDS: Add interface for receive MSG latency trace

Socket option to tap receive path latency.
SO_RDS: SO_RDS_MSG_RXPATH_LATENCY
with parameter,
struct rds_rx_trace_so {
u8 rx_traces;
        u8 rx_trace_pos[RDS_MSG_RX_DGRAM_TRACE_MAX];
}

CMSG:
RDS_CMSG_RXPATH_LATENCY(recvmsg)
Returns rds message latencies in various stages of receive
path in nS. Its set per socket using SO_RDS_MSG_RXPATH_LATENCY
socket option. Legitimate points are defined in
enum rds_message_rxpath_latency. More points can be added in
future.

CSMG format:
struct rds_cmsg_rx_trace {
        u8 rx_traces;
        u8 rx_trace_pos[RDS_MSG_RX_DGRAM_TRACE_MAX];
        u64 rx_trace[RDS_MSG_RX_DGRAM_TRACE_MAX];
}

Receive MSG trace points: RDS message Receive Path Latency points
enum rds_message_rxpath_latency {
RDS_MSG_RX_HDR_TO_DGRAM_START = 0,
RDS_MSG_RX_DGRAM_REASSEMBLE,
RDS_MSG_RX_DGRAM_DELIVERED,
RDS_MSG_RX_DGRAM_TRACE_MAX
}

Tested-by: Namrata Jampani <namrata.jampani@oracle.com>
Reviewed-by: Ajaykumar Hotchandani <ajaykumar.hotchandani@oracle.com>
Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Orabug: 22630180
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

ocfs2: call ocfs2_abort when journal abort

orabug: 22293201

journal can not recover from abort state, so we should take following action to
prevent file system from corruption:

1. change to readonly filesystem when local mount. We can not afford further
   write, so change to RO state is reasonable.

2. panic when cluster mount. Because we can not release lock resource in this
   state, other node will hung when it require a lock owned by this node. So
   panic and remaster is a reasonable choise.

ocfs2_abort() will do all the above work.

Signed-off-by: Ryan Ding <ryan.ding@oracle.com>

ocfs2: o2hb: increase unsteady iterations

Oracle-bug: 21886612

When run multiple xattr test of ocfs2-test on a three-nodes cluster,
mount failed sometimes with the following message.

o2hb: Unable to stabilize heartbeart on region D18B775E758D4D80837E8CF3D086AD4A (xvdb)

Stabilize heartbeat depends on the timing order to mount ocfs2 from
cluster nodes and how fast the tcp connections are established. So
increase unsteady interations to leave more time for it.

Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mfasheh@suse.de>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit a84ac334dcb44c76f0b051513a6c27a2d747f883)
Reviewed-by: Ryan Ding <ryan.ding@oracle.com>

vmxnet3: Bump up driver version number

Bump up the driver version number to reflect the changes done to
work with vmxnet3 adapter version 2

Signed-off-by: Shreyas N Bhatewara <sbhatewara@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit a694717437c14efd489566540e821bc83ec234f3)

Orabug: 22380674
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

vmxnet3: Changes for vmxnet3 adapter version 2 (fwd)

Make the driver understand adapter version 2.

Cc: Rachel Lunnon <rachel_lunnon@stormagic.com>
Signed-off-by: Guolin Yang <gyang@vmware.com>
Signed-off-by: Shreyas N Bhatewara <sbhatewara@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 45dac1d6ea045ae56e4df8d9c70c92c7412bd4fc)

Orabug: 22380674
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

vmxnet3: Fix memory leaks in rx path (fwd)

If rcd length was zero, the page used for frag was not being released. It
was being replaced with a newly allocated page. This change takes care
of that memory leak.

Signed-off-by: Guolin Yang <gyang@vmware.com>
Signed-off-by: Shreyas N Bhatewara <sbhatewara@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit c41fcce997d2caa039a46495d40423348c51ad61)

Orabug: 22380674
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

vmxnet3: Register shutdown handler for device (fwd)

Implement a handler for pci shutdown so that the driver has an
opportunity to make sure that device is quiesced before the PCI
switches to legacy IRQs. This way the possibility of
"screaming interrupt" is avoided.

Acked-by: Shrikrishna Khare <skhare@vmware.com>
Signed-off-by: Shreyas N Bhatewara <sbhatewara@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit e9ba47bfe381888d8dc79123a20b2ec8b6751a47)

Orabug: 22380674
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

lpfc driver updates for UEK4 R1 11.0.0.13

Orabug: 22493326

Signed-off-by: Manjunath Govindashetty <manjunath.govindashetty@oracle.com>

lpfc: Use kzalloc instead of kmalloc

This patch is to the lpfc_els.c which resolves following warning
reported by coccicheck:

WARNING: kzalloc should be used for rdp_context, instead of
kmalloc/memset

Signed-off-by: Punit Vara <punitvara@gmail.com>
Signed-off-by: James Smart <james.smart@avagotech.com>
Reviewed-by: Hannes Reinicke <hare@suse.de>
Reviewed-by: Matthew R. Ochs <mrochs@linux.vnet.ibm.com>
Reviewed-by: Sebastian Herbszt <herbszt@gmx.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 699acd6220ea5b20b25d5eec0ab448827d745357)

Orabug: 22493326

Signed-off-by: Manjunath Govindashetty <manjunath.govindashetty@oracle.com>

lpfc: Add logging for misconfigured optics.

Add logging for misconfigured optics acqe reported by fw.

Signed-off-by: Dick Kennedy <dick.kennedy@avagotech.com>
Signed-off-by: James Smart <james.smart@avagotech.com>
Reviewed-by: Hannes Reinicke <hare@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 448193b5b5e2471fc90ea11e78c39bcfd167efb6)

Orabug: 22493326

Signed-off-by: Manjunath Govindashetty <manjunath.govindashetty@oracle.com>

lpfc: Fix external loopback failure.

Fix external loopback failure.

Rx sequence reassembly was incorrect.

Signed-off-by: Dick Kennedy <dick.kennedy@avagotech.com>
Signed-off-by: James Smart <james.smart@avagotech.com>
Reviewed-by: Hannes Reinicke <hare@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 4360ca9c24388e44cb0e14861a62fff43cf225c0)

Orabug: 22493326

Signed-off-by: Manjunath Govindashetty <manjunath.govindashetty@oracle.com>

lpfc: Fix mbox reuse in PLOGI completion

Fix mbox reuse in PLOGI completion. Moved allocations so that buffer
properly init'd.

Signed-off-by: Dick Kennedy <dick.kennedy@avagotech.com>
Signed-off-by: James Smart <james.smart@avagotech.com>
Reviewed-by: Hannes Reinicke <hare@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 01c73bbcd7cc4f31f45a1b0caeacdba46acd9c9c)

Orabug: 22493326

Signed-off-by: Manjunath Govindashetty <manjunath.govindashetty@oracle.com>

lpfc: Use new FDMI speed definitions for 10G, 25G and 40G FCoE.

Use new FDMI speed definitions for 10G, 25G and 40G FCoE.

Signed-off-by: Dick Kennedy <dick.kennedy@avagotech.com>
Signed-off-by: James Smart <james.smart@avagotech.com>
Reviewed-by: Hannes Reinicke <hare@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit a085e87c814567c94e5d375e7362f9f25030aac1)

Orabug: 22493326

Signed-off-by: Manjunath Govindashetty <manjunath.govindashetty@oracle.com>

lpfc: Make write check error processing more resilient

Make write check error processing more resilient.

Checks to catch writes that fw reports weren't fully complete yet SCSI
status indicated fine needed correction.

Signed-off-by: Dick Kennedy <dick.kennedy@avagotech.com>
Signed-off-by: James Smart <james.smart@avagotech.com>
Reviewed-by: Hannes Reinicke <hare@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 5afab6bbf3f026b7d50451acbfdc12300c5f4353)

Orabug: 22493326
Signed-off-by: Manjunath Govindashetty <manjunath.govindashetty@oracle.com>

lpfc: Fix RDP ACC being too long.

Fix RDP ACC being too long.

Signed-off-by: Dick Kennedy <dick.kennedy@avagotech.com>
Signed-off-by: James Smart <james.smart@avagotech.com>
Reviewed-by: Hannes Reinicke <hare@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit eb8d68c9930f7f9c8f3f4a6059b051b32077a735)

Orabug: 22493326
Signed-off-by: Manjunath Govindashetty <manjunath.govindashetty@oracle.com>

lpfc: Fix RDP Speed reporting.

Fix RDP Speed reporting.

Signed-off-by: Dick Kennedy <dick.kennedy@avagotech.com>
Signed-off-by: James Smart <james.smart@avagotech.com>
Reviewed-by: Hannes Reinicke <hare@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 81e7517723fc17396ba91f59312b3177266ddbda)

Orabug: 22493326
Signed-off-by: Manjunath Govindashetty <manjunath.govindashetty@oracle.com>

lpfc: Modularize and cleanup FDMI code in driver

Modularize, cleanup, add comments - for FDMI code in driver

Note: I don't like the comments with leading # - but as we have a lot if
present, I'm deferring to handle it in one big fix later.

Signed-off-by: Dick Kennedy <dick.kennedy@avagotech.com>
Signed-off-by: James Smart <james.smart@avagotech.com>
Reviewed-by: Hannes Reinicke <hare@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 4258e98ee3862ca7036654b43c839ab7668043e0)

Orabug: 22493326
Signed-off-by: Manjunath Govindashetty <manjunath.govindashetty@oracle.com>

lpfc: Fix crash in fcp command completion path.

Fix crash in fcp command completion path.

Missed null check.

Signed-off-by: Dick Kennedy <dick.kennedy@avagotech.com>
Signed-off-by: James Smart <james.smart@avagotech.com>
Reviewed-by: Hannes Reinicke <hare@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit c90261dcd86e4eb5c9c1627fde037e902db8aefa)

Orabug: 22493326
Signed-off-by: Manjunath Govindashetty <manjunath.govindashetty@oracle.com>

lpfc: Fix driver crash when module parameter lpfc_fcp_io_channel set to 16

Fix driver crash when module parameter lpfc_fcp_io_channel set to 16

Signed-off-by: Dick Kennedy <dick.kennedy@avagotech.com>
Signed-off-by: James Smart <james.smart@avagotech.com>
Reviewed-by: Hannes Reinicke <hare@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 6690e0d4fc5cccf74534abe0c9f9a69032bc02f0)

Orabug: 22493326
Signed-off-by: Manjunath Govindashetty <manjunath.govindashetty@oracle.com>

lpfc: Fix RegLogin failed error seen on Lancer FC during port bounce

Fix RegLogin failed error seen on Lancer FC during port bounce

Fix the statemachine and ref counting.

Signed-off-by: Dick Kennedy <dick.kennedy@avagotech.com>
Signed-off-by: James Smart <james.smart@avagotech.com>
Reviewed-by: Hannes Reinicke <hare@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 4b7789b71c916f79a3366da080101014473234c3)

Orabug: 22493326
Signed-off-by: Manjunath Govindashetty <manjunath.govindashetty@oracle.com>

lpfc: Fix the FLOGI discovery logic to comply with T11 standards

Fix the FLOGI discovery logic to comply with T11 standards

We weren't properly setting fabric parameters, such as R_A_TOV and E_D_TOV,
when we registered the vfi object in default configs and pt2pt configs.
Revise to now pass service params with the values to the firmware and
ensure they are reset on link bounce. Required reworking the call sequence
in the discovery threads.

Signed-off-by: Dick Kennedy <dick.kennedy@avagotech.com>
Signed-off-by: James Smart <james.smart@avagotech.com>
Reviewed-by: Hannes Reinicke <hare@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit d6de08cc46269899988b4f40acc7337279693d4b)

Orabug: 22493326

Signed-off-by: Manjunath Govindashetty <manjunath.govindashetty@oracle.com>

lpfc: Fix FCF Infinite loop in lpfc_sli4_fcf_rr_next_index_get.

Orabug: 22493326

Fix FCF Infinite loop in lpfc_sli4_fcf_rr_next_index_get.

Signed-off-by: Dick Kennedy <dick.kennedy@avagotech.com>
Signed-off-by: James Smart <james.smart@avagotech.com>
Reviewed-by: Hannes Reinicke <hare@suse.de>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit f5cb5304eb26d307c9b30269fb0e007e0b262b7d)

Signed-off-by: Manjunath Govindashetty <manjunath.govindashetty@oracle.com>

lpfc: fix memory leak and NULL dereference

Orabug: 22493326

kmalloc() can return NULL and without checking we were dereferencing it.
Moreover if kmalloc succeeds but the function fails in other parts then
we were returning the error code but we missed freeing lcb_context.
While at it fixed one related checkpatch warning.

Signed-off-by: Sudip Mukherjee <sudip@vectorindia.org>
Reviewed-by: James Smart <james.smart@avagotech.com>
Signed-off-by: James Bottomley <JBottomley@Odin.com>
(cherry picked from commit e79504236548e4c909959ba444f87a12224555ac)

Signed-off-by: Manjunath Govindashetty <manjunath.govindashetty@oracle.com>

lpfc: Fix rport leak.

Orabug: 22493326

Correct locking and refcounting in tracking our rports

Signed-off-by: Dick Kennedy <dick.kennedy@avagotech.com>
Signed-off-by: James Smart <james.smart@avagotech.com>
Signed-off-by: James Bottomley <JBottomley@Odin.com>
(cherry picked from commit 466e840b7809e00ab3a1af9b4a5b5751e681730d)

Signed-off-by: Manjunath Govindashetty <manjunath.govindashetty@oracle.com>

Drivers: hv: vmbus: Fix a Host signaling bug

Currently we have two policies for deciding when to signal the host:
One based on the ring buffer state and the other based on what the
VMBUS client driver wants to do. Consider the case when the client
wants to explicitly control when to signal the host. In this case,
if the client were to defer signaling, we will not be able to signal
the host subsequently when the client does want to signal since the
ring buffer state will prevent the signaling. Implement logic to
have only one signaling policy in force for a given channel.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Tested-by: Haiyang Zhang <haiyangz@microsoft.com>
Cc: <stable@vger.kernel.org> # v4.2+
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 8599846d73997cdbccf63f23394d871cfad1e5e6)

Orabug: 22725962
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

cpu-hotplug: export cpu_hotplug_enable/cpu_hotplug_disable

Hyper-V module needs to disable cpu hotplug (offlining) as there is no
support from hypervisor side to reassign already opened event channels
to a different CPU. Currently it is been done by altering
smp_ops.cpu_disable but it is hackish.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 32145c4677d2c46b9d877a33ae82c6fcacd002f9)

As hv driver depends on it
Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

clockevents: Add helpers to check the state of a clockevent device

Some clockevent drivers, once migrated to use per-state callbacks,
need to check the state of the clockevent device in their callbacks or
interrupt handler.

Add accessor functions clockevent_state_*() to get this information.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Cc: linaro-kernel@lists.linaro.org
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/04a717d490335c688dd7af899fbcede97e1bb8ee.1432192527.git.viresh.kumar@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit 3434d23b694e5cb6e44e966914563406c31c4053)

As hv drivers depends on it
Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

clockevents: Introduce CLOCK_EVT_STATE_ONESHOT_STOPPED state

When no timers/hrtimers are pending, the expiry time is set to a
special value: 'KTIME_MAX'. This normally happens with
NO_HZ_{IDLE|FULL} in both LOWRES/HIGHRES modes.

When 'expiry == KTIME_MAX', we either cancel the 'tick-sched' hrtimer
(NOHZ_MODE_HIGHRES) or skip reprogramming clockevent device
(NOHZ_MODE_LOWRES). But, the clockevent device is already
reprogrammed from the tick-handler for next tick.

As the clock event device is programmed in ONESHOT mode it will at
least fire one more time (unnecessarily). Timers on few
implementations (like arm_arch_timer, etc.) only support PERIODIC mode
and their drivers emulate ONESHOT over that. Which means that on these
platforms we will get spurious interrupts periodically (at last
programmed interval rate, normally tick rate).

In order to avoid spurious interrupts, the clockevent device should be
stopped or its interrupts should be masked.

A simple (yet hacky) solution to get this fixed could be: update
hrtimer_force_reprogram() to always reprogram clockevent device and
update clockevent drivers to STOP generating events (or delay it to
max time) when 'expires' is set to KTIME_MAX. But the drawback here is
that every clockevent driver has to be hacked for this particular case
and its very easy for new ones to miss this.

However, Thomas suggested to add an optional state ONESHOT_STOPPED to
solve this problem: lkml.org/lkml/2014/5/9/508.

This patch adds support for ONESHOT_STOPPED state in clockevents
core. It will only be available to drivers that implement the
state-specific callbacks instead of the legacy ->set_mode() callback.

Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Reviewed-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
Cc: linaro-kernel@lists.linaro.org
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Kevin Hilman <khilman@linaro.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/b8b383a03ac07b13312c16850b5106b82e4245b5.1428031396.git.viresh.kumar@linaro.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit 8fff52fd50934580c5108afed12043a774edf728)

As hv driver depends on it
Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

Drivers: hv: util: introduce hv_utils_transport abstraction

The intention is to make KVP/VSS drivers work through misc char devices.
Introduce an abstraction for kernel/userspace communication to make the
migration smoother. Transport operational mode (netlink or char device)
is determined by the first received message. To support driver upgrades
the switch from netlink to chardev operational mode is supported.

Every hv_util daemon is supposed to register 2 callbacks:
1) on_msg() to get notified when the userspace daemon sent a message;
2) on_reset() to get notified when the userspace daemon drops the connection.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Tested-by: Alex Ng <alexng@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 14b50f80c32dd4e84b6baeaa8bf4049cc5ecf56d)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

Drivers: hv: util: introduce state machine for util drivers

KVP/VSS/FCOPY drivers work in fully serialized mode: we wait till userspace
daemon registers, wait for a message from the host, send this message to the
daemon, get the reply, send it back to host, wait for another message.
Introduce enum hvutil_device_state to represend this state in all 3 drivers.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Tested-by: Alex Ng <alexng@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 636c88da6df3bb2f978b48d3a7ed55423da84d19)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

Drivers: hv: vmbus: use cpu_hotplug_enable/disable

Commit e513229b4c38 ("Drivers: hv: vmbus: prevent cpu offlining on newer
hypervisors") was altering smp_ops.cpu_disable to prevent CPU offlining.
We can bo better by using cpu_hotplug_enable/disable functions instead of
such hard-coding.

Reported-by: Radim Kr.má <rkrcmar@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit f39c4280a3872b0e6c7b01076132c12ad7a90392)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

Drivers: hv: vmbus: add a sysfs attr to show the binding of channel/VP

This is useful to analyze performance issue.

Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 042ab0313bbb7e776e9510da3f07fb300d08a8ba)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

Drivers: hv: vmbus: Implement a clocksource based on the TSC page

The current Hyper-V clock source is based on the per-partition reference counter
and this counter is being accessed via s synthetic MSR - HV_X64_MSR_TIME_REF_COUNT.
Hyper-V has a more efficient way of computing the per-partition reference
counter value that does not involve reading a synthetic MSR. We implement
a time source based on this mechanism.

Tested-by: Vivek Yadav <vyadav@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit ca9357bd26c2f8e7b909321eedd651f52cc30d04)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

drivers/hv: Migrate to new 'set-state' interface

Migrate hv driver to the new 'set-state' interface provided by
clockevents core, the earlier 'set-mode' interface is marked obsolete
now.

This also enables us to implement callbacks for new states of clockevent
devices, for example: ONESHOT_STOPPED.

Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: devel@linuxdriverproject.org
Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit bc609cb47fb2e74654e23cef0a1d4db38b6570a3)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

Drivers: hv_vmbus: Fix signal to host condition

Fixes a bug where previously hv_ringbuffer_read would pass in the old
number of bytes available to read instead of the expected old read index
when calculating when to signal to the host that the ringbuffer is empty.
Since the previous write size is already saved, also changes the
hv_need_to_signal_on_read to use the previously read value rather than
recalculating it.

Signed-off-by: Christopher Oo <t-chriso@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit a5cca686ce0ef4909deaee4ed46dd991e3a9ece4)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

Drivers: hv: vmbus: Improve the CPU affiliation for channels

The current code tracks the assigned CPUs within a NUMA node in the context of
the primary channel. So, if we have a VM with a single NUMA node with 8 VCPUs, we may
end up unevenly distributing the channel load. Fix the issue by tracking affiliations
globally.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 9f01ec53458d9e9b68f1c555e773b5d1a1f66e94)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

drivers:hv: Move MMIO range picking from hyper_fb to hv_vmbus

This patch deletes the logic from hyperv_fb which picked a range of MMIO space
for the frame buffer and adds new logic to hv_vmbus which picks ranges for
child drivers. The new logic isn't quite the same as the old, as it considers
more possible ranges.

Signed-off-by: Jake Oshins <jakeo@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 3546448338e76a52d4f86eb3680cb2934e22d89b)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

drivers:hv: Modify hv_vmbus to search for all MMIO ranges available.

This patch changes the logic in hv_vmbus to record all of the ranges in the
VM's firmware (BIOS or UEFI) that offer regions of memory-mapped I/O space for
use by paravirtual front-end drivers.  The old logic just found one range
above 4GB and called it good.  This logic will find any ranges above 1MB.

It would have been possible with this patch to just use existing resource
allocation functions, rather than keep track of the entire set of Hyper-V
related MMIO regions in VMBus.  This strategy, however, is not sufficient
when the resource allocator needs to be aware of the constraints of a
Hyper-V virtual machine, which is what happens in the next patch in the series.
So this first patch exists to show the first steps in reworking the MMIO
allocation paths for Hyper-V front-end drivers.

Signed-off-by: Jake Oshins <jakeo@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 7f163a6fd957a85f7f66a129db1ad243a44399ee)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

Drivers: hv: vmbus: Consider ND NIC in binding channels to CPUs

We cycle through all the "high performance" channels to distribute
load across the available CPUs. Process the NetworkDirect as a
high performance device.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 379e4f756b915bcc35958365e5d1326b3b54efce)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

mshyperv: fix recognition of Hyper-V guest crash MSR's

Hypervisor Top Level Functional Specification v3.1/4.0 notes that cpuid
(0x40000003) EDX's 10th bit should be used to check that Hyper-V guest
crash MSR's functionality available.

This patch should fix this recognition. Currently the code checks EAX
register instead of EDX.

Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit cc2dd4027a43bb36c846f195a764edabc0828602)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

Drivers: hv: vmbus: prefer 'die' notification chain to 'panic'

current_pt_regs() sometimes returns regs of the userspace process and in
case of a kernel crash this is not what we need to report. E.g. when we
trigger crash with sysrq we see the following:
...
RIP: 0010:[<ffffffff815b8696>] [<ffffffff815b8696>] sysrq_handle_crash+0x16/0x20
RSP: 0018:ffff8800db0a7d88 EFLAGS: 00010246
RAX: 000000000000000f RBX: ffffffff820a0660 RCX: 0000000000000000
...
at the same time current_pt_regs() give us:
ip=7f899ea7e9e0, ax=ffffffffffffffda, bx=26c81a0, cx=7f899ea7e9e0, ...
These registers come from the userspace process triggered the crash. As we
don't even know which process it was this information is rather useless.

When kernel crash happens through 'die' proper regs are being passed to
all receivers on the die_chain (and panic_notifier_list is being notified
with the string passed to panic() only). If panic() is called manually
(e.g. on BUG()) we won't get 'die' notification so keep the 'panic'
notification reporter as well but guard against double reporting.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 510f7aef65bb7ed22cf9c7f94f955727f963ede4)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

Drivers: hv: vmbus: unregister panic notifier on module unload

Commit 96c1d0581d00f7abe033350edb021a9d947d8d81 ("Drivers: hv: vmbus: Add
support for VMBus panic notifier handler") introduced
atomic_notifier_chain_register() call on module load. We also need to call
atomic_notifier_chain_unregister() on module unload as otherwise the following
crash is observed when we bring hv_vmbus back:

[   39.788877] BUG: unable to handle kernel paging request at ffffffffa00078a8
[   39.788877] IP: [<ffffffff8109d63f>] notifier_call_chain+0x3f/0x80
...
[   39.788877] Call Trace:
[   39.788877]  [<ffffffff8109de7d>] __atomic_notifier_call_chain+0x5d/0x90
...
[   39.788877]  [<ffffffff8109d788>] ? atomic_notifier_chain_register+0x38/0x70
[   39.788877]  [<ffffffff8109d767>] ? atomic_notifier_chain_register+0x17/0x70
[   39.788877]  [<ffffffffa002814f>] hv_acpi_init+0x14f/0x1000 [hv_vmbus]
[   39.788877]  [<ffffffff81002144>] do_one_initcall+0xd4/0x210

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 096c605feb3d85b309e95db2afc01584b967cc23)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

Drivers: hv: vmbus: fix typo in hv_port_info struct

This fixes a typo: base_flag_bumber to base_flag_number

Signed-off-by: Nik Nyby <nikolas@gnu.org>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit e26009aad095feae45a6e79bb022c55a969ecded)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

Drivers: hv: vmbus: Permit sending of packets without payload

The guest may have to send a completion packet back to the host.
To support this usage, permit sending a packet without a payload -
we would be only sending the descriptor in this case.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit b81658cf5d44e07c70c93e3b2aefe848eaaba99f)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

Drivers: hv: balloon: Enable dynamic memory protocol negotiation with Windows 10 hosts

Support Win10 protocol for Dynamic Memory. Thia patch allows guests on Win10 hosts
to hot-add memory even when dynamic memory is not enabled on the guest.

Signed-off-by: Alex Ng <alexng@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit b6ddeae1603dfa55e857ba1520f5acea83f8cf1c)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

Drivers: hv: don't do hypercalls when hypercall_page is NULL

At the very late stage of kexec a driver (which are not being unloaded) can
try to post a message or signal an event. This will crash the kernel as we
already did hv_cleanup() and the hypercall page is NULL.

Move all common (between 32 and 64 bit code) declarations to the beginning
of the do_hypercall() function. Unfortunately we have to write the
!hypercall_page check twice to not mix declarations and code.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit d7646eaa7678fe5adc42247b4bdfbe9d9db8c253)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

Drivers: hv: vmbus: add special kexec handler

When general-purpose kexec (not kdump) is being performed in Hyper-V guest
the newly booted kernel fails with an MCE error coming from the host. It
is the same error which was fixed in the "Drivers: hv: vmbus: Implement
the protocol for tearing down vmbus state" commit - monitor pages remain
special and when they're being written to (as the new kernel doesn't know
these pages are special) bad things happen. We need to perform some
minimalistic cleanup before booting a new kernel on kexec. To do so we
need to register a special machine_ops.shutdown handler to be executed
before the native_machine_shutdown(). Registering a shutdown notification
handler via the register_reboot_notifier() call is not sufficient as it
happens to early for our purposes. machine_ops is not being exported to
modules (and I don't think we want to export it) so let's do this in
mshyperv.c

The minimalistic cleanup consists of cleaning up clockevents, synic MSRs,
guest os id MSR, and hypercall MSR.

Kdump doesn't require all this stuff as it lives in a separate memory
space.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 2517281d63a2b09d94aedfb522943617048f337e)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

Drivers: hv: vmbus: remove hv_synic_free_cpu() call from hv_synic_cleanup()

We already have hv_synic_free() which frees all per-cpu pages for all
CPUs, let's remove the hv_synic_free_cpu() call from hv_synic_cleanup()
so it will be possible to do separate cleanup (writing to MSRs) and final
freeing. This is going to be used to assist kexec.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 06210b42f33ea1c29a90f4db2d88be91c511154b)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

Drivers: hv: vmbus: kill tasklets on module unload

Explicitly kill tasklets we create on module unload.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 1959a28e2671004c1e9c30ccd2914b868f100742)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

hv_netvsc: Add structs and handlers for VF messages

This patch adds data structures and handlers for messages related
to SRIOV Virtual Function.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 71790a2792c8772e29bf5aa726215d9256ef93dc)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

hv_netvsc: Wait for sub-channels to be processed during probe

The current code returns from probe without waiting for the proper handling
of subchannels that may be requested. If the netvsc driver were to be rapidly
loaded/unloaded, we can trigger a panic as the unload will be tearing
down state that may not have been fully setup yet. We fix this issue by making
sure that we return from the probe call only after ensuring that the
sub-channel offers in flight are properly handled.

Reviewed-and-tested-by: Haiyang Zhang <haiyangz@microsoft.com
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit b3e6b82a0099dfef038e40c630a554ed1e402504)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

hv_netvsc: Add close of RNDIS filter into change mtu call

The current change mtu call only stops tx before removing RNDIS filter.
In case ringbufer is not empty, the rndis_filter_device_remove() may
hang on removing the buffers.

This patch adds close of RNDIS filter before removing it, also a
gradual waiting loop until the ring is empty. The change_mtu hang
issue under heavy traffic is solved by this patch.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 2de8530ba0c71a2fba02590681af0f3a2a187a9b)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

x86: hyperv: add CPUID bit for crash handlers

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit 5d75a747596be046546eeb9d6ba39a3af851a1af)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

hv_netvsc: Add support to set MTU reservation from guest side

When packet encapsulation is in use, the MTU needs to be reduced for
headroom reservation.
The existing code takes the updated MTU value only from the host side.
But vSwitch extensions, such as Open vSwitch, require the flexibility
to change the MTU to different values from within a guest during the
lifecycle of a vNIC, when the encapsulation protocol is changed. The
patch supports this kind of MTU changes.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit f9cbce34c34bcc05ea0dd78c8999bfe88b5b6b86)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

kvm: add hyper-v crash msrs values

Added Hyper-V crash msrs values - HV_X64_MSR_CRASH*.

Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
Reviewed-by: Peter Hornyack <peterhornyack@google.com>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Gleb Natapov <gleb@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit a88464a8b0ffb2f8dfb69d3ab982169578b50f22)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

storvsc: use shost_for_each_device() instead of open coding

Comment in struct Scsi_Host says that drivers are not supposed to access
__devices directly. storvsc_host_scan() doesn't happen in irq context
so we can just use shost_for_each_device().

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Long Li <longli@microsoft.com>
Acked-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: James Bottomley <JBottomley@Odin.com>
(cherry picked from commit 8d6a9f5676f0e734967ac3739f5c6a28a0b047d9)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

storvsc: be more picky about scmnd->sc_data_direction

Under the 'default' case in scmnd->sc_data_direction we have 3 options:
- DMA_NONE which we handle correctly.
- DMA_BIDIRECTIONAL which is never supposed to be set by SCSI stack.
- Garbage value.

Do WARN() and return -EINVAL in the last two cases. virtio_scsi does
BUG_ON() here but it looks like an overkill.

Reported-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Acked-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: James Bottomley <JBottomley@Odin.com>
(cherry picked from commit cb1cf0804fe582f8a626c3cc591cb3127536137c)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

Drivers: hv: vmbus: Allocate ring buffer memory in NUMA aware fashion

Allocate ring buffer memory from the NUMA node assigned to the channel.
Since this is a performance and not a correctness issue, if the node specific
allocation were to fail, fall back and allocate without specifying the node.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 294409d20572e9bcf857328286433f851168d54a)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

Drivers: hv: vmbus: Implement NUMA aware CPU affinity for channels

Channels/sub-channels can be affinitized to VCPUs in the guest. Implement
this affinity in a way that is NUMA aware. The current protocol distributed
the primary channels uniformly across all available CPUs. The new protocol
is NUMA aware: primary channels are distributed across the available NUMA
nodes while the sub-channels within a primary channel are distributed amongst
CPUs within the NUMA node assigned to the primary channel.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 1f656ff3fdddc2f59649cc84b633b799908f1f7b)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

Drivers: hv: vmbus: Use the vp_index map even for channels bound to CPU 0

Map target_cpu to target_vcpu using the mapping table.
We should use the mapping table to transform guest CPU ID to VP Index
as is done for the non-performance critical channels.
While the value CPU 0 is special and will
map to VP index 0, it is good to be consistent.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 9c6e64adf200d3bac0dd47d52cdbd3bd428384a5)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

Drivers: hv: vmbus: distribute subchannels among all vcpus

Primary channels are distributed evenly across all vcpus we have. When the host
asks us to create subchannels it usually makes us num_cpus-1 offers and we are
supposed to distribute the work evenly among the channel itself and all its
subchannels. Make sure they are all assigned to different vcpus.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit ce59fec836a9b4dc51cbcf9cb245b59e0ef53bea)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

Drivers: hv: balloon: check if ha_region_mutex was acquired in MEM_CANCEL_ONLINE case

Memory notifiers are being executed in a sequential order and when one of
them fails returning something different from NOTIFY_OK the remainder of
the notification chain is not being executed. When a memory block is being
onlined in online_pages() we do memory_notify(MEM_GOING_ONLINE, ) and if
one of the notifiers in the chain fails we end up doing
memory_notify(MEM_CANCEL_ONLINE, ) so it is possible for a notifier to see
MEM_CANCEL_ONLINE without seeing the corresponding MEM_GOING_ONLINE event.
E.g. when CONFIG_KASAN is enabled the kasan_mem_notifier() is being used
to prevent memory hotplug, it returns NOTIFY_BAD for all MEM_GOING_ONLINE
events. As kasan_mem_notifier() comes before the hv_memory_notifier() in
the notification chain we don't see the MEM_GOING_ONLINE event and we do
not take the ha_region_mutex. We, however, see the MEM_CANCEL_ONLINE event
and unconditionally try to release the lock, the following is observed:

[  110.850927] =====================================
[  110.850927] [ BUG: bad unlock balance detected! ]
[  110.850927] 4.1.0-rc3_bugxxxxxxx_test_xxxx #595 Not tainted
[  110.850927] -------------------------------------
[  110.850927] systemd-udevd/920 is trying to release lock
(&dm_device.ha_region_mutex) at:
[  110.850927] [<ffffffff81acda0e>] mutex_unlock+0xe/0x10
[  110.850927] but there are no more locks to release!

At the same time we can have the ha_region_mutex taken when we get the
MEM_CANCEL_ONLINE event in case one of the memory notifiers after the
hv_memory_notifier() in the notification chain failed so we need to add
the mutex_is_locked() check. In case of MEM_ONLINE we are always supposed
to have the mutex locked.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 4e4bd36f97b1492f19b3329ac74ed313da13de34)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

hv_netvsc: Allocate the sendbuf in a NUMA aware way

Allocate the send buffer in a NUMA aware way.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 5defde5946676ee23cd6a9d0e1de899410f4a33f)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

hv_netvsc: Allocate the receive buffer from the correct NUMA node

Allocate the receive bufer from the NUMA node assigned to the primary
channel.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 0a726c2b499e390b1c1fc3092bd789f2192a2d03)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

hv_netvsc: Properly size the vrss queues

The current algorithm for deciding on the number of VRSS channels is
not optimal since we open up the min of number of CPUs online and the
number of VRSS channels the host is offering. So on a 32 VCPU guest
we could potentially open 32 VRSS subchannels. Experimentation has
shown that it is best to limit the number of VRSS channels to the number
of CPUs within a NUMA node.

Here is the new algorithm for deciding on the number of sub-channels we
would open up:
        1) Pick the minimum of what the host is offering and what the driver
           in the guest is specifying as the default value.
        2) Pick the minimum of (1) and the numbers of CPUs in the NUMA
           node the primary channel is bound to.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit e01ec2199ef22e2cabd7d6e68a192f3eb728029f)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

Drivers: hv: vmbus:Update preferred vmbus protocol version to windows 10.

Add support for Windows 10.

Signed-off-by: Keith Mange <keith.mange@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 6c4e5f9c9ff41ea997fd0f345b3b2b88c113eb68)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

hv_netvsc: use per_cpu stats to calculate TX/RX data

Current code does not lock anything when calculating the TX and RX stats.
As a result, the RX and TX data reported by ifconfig are not accuracy in a
system with high network throughput and multiple CPUs (in my test,
RX/TX = 83% between 2 HyperV VM nodes which have 8 vCPUs and 40G Ethernet).

This patch fixed the above issue by using per_cpu stats.
netvsc_get_stats64() summarizes TX and RX data by iterating over all CPUs
to get their respective stats.

This v2 patch addressed David's comments on the cleanup path when
netdev_alloc_pcpu_stats() failed.

Signed-off-by: Simon Xiao <sixiao@microsoft.com>
Reviewed-by: K. Y. Srinivasan <kys@microsoft.com>
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 7eafd9b4005643cfc24f1daf78f4dd56ff71f559)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

hv_netvsc: Use the xmit_more skb flag to optimize signaling the host

Based on the information given to this driver (via the xmit_more skb flag),
we can defer signaling the host if more packets are on the way. This will help
make the host more efficient since it can potentially process a larger batch of
packets. Implement this optimization.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 82fa3c776e5abba7ed6e4b4f4983d14731c37d6a)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

Drivers: hv: vmbus: move init_vp_index() call to vmbus_process_offer()

We need to call init_vp_index() after we added the channel to the appropriate
list (global or subchannel) to be able to use this information when assigning
the channel to the particular vcpu. To do so we need to move a couple of
functions around. The only real change is the init_vp_index() call. This is a
small refactoring without a functional change.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit f38e7dd72337d83cced910cfbf6016475ef85bf7)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

Drivers: hv: vmbus: briefly comment num_sc and next_oc

next_oc and num_sc fields of struct vmbus_channel deserve a description. Move
them closer to sc_list as these fields are related to it.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit fea844a2b0edd6540d5cde2cd54a8a3c86e9c53f)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

Drivers: hv: vmbus: unify calls to percpu_channel_enq()

Remove some code duplication, no functional change intended.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 8dfd332674758135039d0d2d2a7479934ff0b9c5)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

Drivers: hv: vmbus: do cleanup on all vmbus_open() failure paths

In case there was an error reported in the response to the CHANNELMSG_OPENCHANNEL
call we need to do the cleanup as a vmbus_open() user won't be doing it after
receiving an error. The cleanup should be done on all failure paths. We also need
to avoid returning open_info->response.open_result.status as the return value as
all other errors we return from vmbus_open() are -EXXX and vmbus_open() callers
are not supposed to analyze host error codes.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit ffc151f3c83c25ec06d5ad13a78d0fc066c7167e)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

Drivers: hv: vmbus: Implement the protocol for tearing down vmbus state

Implement the protocol for tearing down the monitor state established with
the host.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Tested-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 2db84eff127e3f4b3635edc589cd6a56db8755a3)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

drivers: hv: vmbus: Get rid of some unused definitions

Get rid of some unused definitions.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit db9ba2088f6507fee370904f02db1eb9b49bd088)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

Drivers: hv: fcopy: full handshake support

Introduce FCOPY_VERSION_1 to support kernel replying to the negotiation
message with its own version.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Tested-by: Alex Ng <alexng@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit a4d1ee5b0255a135fead1d62a7fc7e6fe718b66e)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

Tools: hv: vss: use misc char device to communicate with kernel

Use /dev/vmbus/hv_vss instead of netlink.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Tested-by: Alex Ng <alexng@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit f5722b9bd418e29b7429bd9a43bd100599b26d4f)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

Tools: hv: kvp: use misc char device to communicate with kernel

Use /dev/vmbus/hv_kvp instead of netlink.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Tested-by: Alex Ng <alexng@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 8ddca8088586303cfe3db4209a4682f7a4cf7d2d)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

Drivers: hv: fcopy: convert to hv_utils_transport

Unify the code with the recently introduced hv_utils_transport. Netlink
communication is disabled for fcopy.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Tested-by: Alex Ng <alexng@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit c7e490fc23eb44e6ae6ab41b9fd450e361f8a01f)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

Drivers: hv: fcopy: set .owner reference for file operations

Get an additional reference otherwise a crash is observed when hv_utils module is being unloaded while
fcopy daemon is still running. .owner gives us an additional reference when
someone holds a descriptor for the device.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Tested-by: Alex Ng <alexng@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 3f0dccf86cab38d96b67efdfc944954f5490d057)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

Drivers: hv: fcopy: switch to using the hvutil_device_state state machine

Switch to using the hvutil_device_state state machine from using 3 different state variables:
fcopy_transaction.active, opened, and in_hand_shake.

State transitions are:
-> HVUTIL_DEVICE_INIT when driver loads or on device release
-> HVUTIL_READY if the handshake was successful
-> HVUTIL_HOSTMSG_RECEIVED when there is a non-negotiation message from the host
-> HVUTIL_USERSPACE_REQ after userspace daemon read the message
-> HVUTIL_USERSPACE_RECV after/if userspace has replied
-> HVUTIL_READY after we respond to the host
-> HVUTIL_DEVICE_DYING on driver unload

In hv_fcopy_onchannelcallback() process ICMSGTYPE_NEGOTIATE messages even when
the userspace daemon is disconnected, otherwise we can make the host think
we don't support FCOPY and disable the service completely.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Tested-by: Alex Ng <alexng@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 4c93ccccf47e32d605b81207eb575cce3b12facc)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

Drivers: hv: fcopy: rename fcopy_work -> fcopy_timeout_work

'fcopy_work' (and fcopy_work_func) is a misnomer as it sounds like we expect
this useful work to happen and in reality it is just an emergency escape when
timeout happens.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Tested-by: Alex Ng <alexng@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 1d072339fa9e32365d282d8ade32822de7d21e5f)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

Drivers: hv: fcopy: process deferred messages when we complete the transaction

In theory, the host is not supposed to issue any requests before be reply to
the previous one. In KVP we, however, support the following scenarios:
1) A message was received before userspace daemon registered;
2) A message was received while the previous one is still being processed.
In FCOPY we support only the former. Add support for the later, use
hv_poll_channel() to do the job.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Tested-by: Alex Ng <alexng@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 242f31221d48793d07e161bc668e1aabd502c18b)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

Drivers: hv: kvp: switch to using the hvutil_device_state state machine

Switch to using the hvutil_device_state state machine from using 2 different state variables: kvp_transaction.active and
in_hand_shake.

State transitions are:
-> HVUTIL_DEVICE_INIT when driver loads or on device release
-> HVUTIL_READY if the handshake was successful
-> HVUTIL_HOSTMSG_RECEIVED when there is a non-negotiation message from the host
-> HVUTIL_USERSPACE_REQ after we sent the message to the userspace daemon
-> HVUTIL_USERSPACE_RECV after/if the userspace daemon has replied
-> HVUTIL_READY after we respond to the host
-> HVUTIL_DEVICE_DYING on driver unload

In hv_kvp_onchannelcallback() process ICMSGTYPE_NEGOTIATE messages even when
the userspace daemon is disconnected, otherwise we can make the host think
we don't support KVP and disable the service completely.

Unfortunately there is no good way we can figure out that the userspace daemon
has died (unless we start treating all timeouts as such). In case the daemon
restarts we skip the negotiation procedure (so the daemon is supposed to has
the same version). This behavior is unchanged from in_handshake approach.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Tested-by: Alex Ng <alexng@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 97bf16cd309805ebf82ffcc4063a65e06169651f)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

Drivers: hv: kvp: rename kvp_work -> kvp_timeout_work

'kvp_work' (and kvp_work_func) is a misnomer as it sounds like we expect
this useful work to happen and in reality it is just an emergency escape when
timeout happens.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Tested-by: Alex Ng <alexng@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 68c8b39a0db71e0e76295bf277e8280ae36e7c10)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

Drivers: hv: kvp: move poll_channel() to hyperv_vmbus.h

Move poll_channel() to hyperv_vmbus.h and make it inline and rename it to hv_poll_channel() so it can be reused
in other hv_util modules.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Tested-by: Alex Ng <alexng@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 8efe78fdb1490e271615fab32433ebc0f15fa822)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

Drivers: hv: util: move kvp/vss function declarations to hyperv_vmbus.h

These declarations are internal to hv_util module and hv_fcopy_* declarations
already reside there.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Tested-by: Alex Ng <alexng@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 3647a83d9dcf00b8e17777ec8aa1e48f1ed4fe06)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

Drivers: hv: kvp: reset kvp_context

We set kvp_context when we want to postpone receiving a packet from vmbus due
to the previous transaction being unfinished. We, however, never reset this
state, all consequent kvp_respond_to_host() calls will result in poll_channel()
calling hv_kvp_onchannelcallback(). This doesn't cause real issues as:
1) Host is supposed to serialize transactions as well
2) If no message is pending vmbus_recvpacket() will return 0 recvlen.
This is just a cleanup.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Tested-by: Alex Ng <alexng@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit 5fa97480b9b125bb9bdb446930289b3274ed7eb7)

Orabug: 21886720
Signed-off-by: Jason Luo <zhangqing.luo@oracle.com>

veth: don’t modify ip_summed; doing so treats packets with bad checksums as good.

Packets that arrive from real hardware devices have ip_summed ==
CHECKSUM_UNNECESSARY if the hardware verified the checksums, or
CHECKSUM_NONE if the packet is bad or it was unable to verify it. The
current version of veth will replace CHECKSUM_NONE with
CHECKSUM_UNNECESSARY, which causes corrupt packets routed from hardware to
a veth device to be delivered to the application. This caused applications
at Twitter to receive corrupt data when network hardware was corrupting
packets.

We believe this was added as an optimization to skip computing and
verifying checksums for communication between containers. However, locally
generated packets have ip_summed == CHECKSUM_PARTIAL, so the code as
written does nothing for them. As far as we can tell, after removing this
code, these packets are transmitted from one stack to another unmodified
(tcpdump shows invalid checksums on both sides, as expected), and they are
delivered correctly to applications. We didn’t test every possible network
configuration, but we tried a few common ones such as bridging containers,
using NAT between the host and a container, and routing from hardware
devices to containers. We have effectively deployed this in production at
Twitter (by disabling RX checksum offloading on veth devices).

This code dates back to the first version of the driver, commit
<e314dbdc1c0dc6a548ecf> ("[NET]: Virtual ethernet device driver"), so I
suspect this bug occurred mostly because the driver API has evolved
significantly since then. Commit <0b7967503dc97864f283a> ("net/veth: Fix
packet checksumming") (in December 2010) fixed this for packets that get
created locally and sent to hardware devices, by not changing
CHECKSUM_PARTIAL. However, the same issue still occurs for packets coming
in from hardware devices.

Co-authored-by: Evan Jones <ej@evanjones.ca>
Signed-off-by: Evan Jones <ej@evanjones.ca>
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Cc: Phil Sutter <phil@nwl.cc>
Cc: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Vijay Pandurangan <vijayp@vijayp.ca>
Acked-by: Cong Wang <cwang@twopensource.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit ce8c839b74e3017996fad4e1b7ba2e2625ede82f)

Orabug: 22720928

Signed-off-by: Thomas Tanaka <thomas.tanaka@oracle.com>

btrfs: Avoid truncate tailing page if fallocate range doesn't exceed inode size

Current code will always truncate tailing page if its alloc_start is
smaller than inode size.

For example, the file extent layout is like:
0 4K 8K 16K 32K
|<-----Extent A---------------->|
|<--Inode size: 18K---------->|

But if calling fallocate even for range [0,4K), it will cause btrfs to
re-truncate the range [16,32K), causing COW and a new extent.

0 4K 8K 16K 32K
|///////| <- Fallocate call range
|<-----Extent A-------->|<--B-->|

The cause is quite easy, just a careless btrfs_truncate_inode() in a
else branch without extra judgment.
Fix it by add judgment on whether the fallocate range is beyond isize.

Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: Chris Mason <clm@fb.com>
Orabug: 22573877
(cherry picked from mainline commit 0f6925fa2907df58496cabc33fa4677c635e2223)
Signed-off-by: Ashish Samant <ashish.samant@oracle.com>

Btrfs: send, fix file corruption due to incorrect cloning operations

If we have a file that shares an extent with other files, when processing
the extent item relative to a shared extent, we blindly issue a clone
operation that will target a length matching the length in the extent item
and uses as a source some other file the receiver already has and points
to the same extent. However that range in the other file might not
exclusively point only to the shared extent, and so using that length
will result in the receiver getting a file with different data from the
one in the send snapshot. This issue happened both for incremental and
full send operations.

So fix this by issuing clone operations with lengths that don't cover
regions of the source file that point to different extents (or have holes).

The following test case for fstests reproduces the problem.

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"

  tmp=/tmp/$$
  status=1 # failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      rm -fr $send_files_dir
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter

  # real QA test starts here
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch
  _need_to_be_root
  _require_cp_reflink
  _require_xfs_io_command "fpunch"

  send_files_dir=$TEST_DIR/btrfs-test-$seq

  rm -f $seqres.full
  rm -fr $send_files_dir
  mkdir $send_files_dir

  _scratch_mkfs >>$seqres.full 2>&1
  _scratch_mount

  # Create our test file with a single 100K extent.
  $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K" \
     $SCRATCH_MNT/foo | _filter_xfs_io

  # Clone our file into a new file named bar.
  cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar

  # Now overwrite parts of our foo file.
  $XFS_IO_PROG -c "pwrite -S 0xbb 50K 10K" \
     -c "pwrite -S 0xcc 90K 10K" \
     -c "fpunch 70K 10k" \
     $SCRATCH_MNT/foo | _filter_xfs_io

  _run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
     $SCRATCH_MNT/snap

  echo "File digests in the original filesystem:"
  md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
  md5sum $SCRATCH_MNT/snap/bar | _filter_scratch

  _run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap

  # Now recreate the filesystem by receiving the send stream and verify
  # we get the same file contents that the original filesystem had.
  _scratch_unmount
  _scratch_mkfs >>$seqres.full 2>&1
  _scratch_mount

  _run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap

  # We expect the destination filesystem to have exactly the same file
  # data as the original filesystem.
  # The btrfs send implementation had a bug where it sent a clone
  # operation from file foo into file bar covering the whole [0, 100K[
  # range after creating and writing the file foo. This was incorrect
  # because the file bar now included the updates done to file foo after
  # we cloned foo to bar, breaking the COW nature of reflink copies
  # (cloned extents).
  echo "File digests in the new filesystem:"
  md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
  md5sum $SCRATCH_MNT/snap/bar | _filter_scratch

  status=0
  exit

Another test case that reproduces the problem when we have compressed
extents:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo "QA output created by $seq"

  tmp=/tmp/$$
  status=1 # failure is the default!
  trap "_cleanup; exit \$status" 0 1 2 3 15

  _cleanup()
  {
      rm -fr $send_files_dir
      rm -f $tmp.*
  }

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter

  # real QA test starts here
  _supported_fs btrfs
  _supported_os Linux
  _require_scratch
  _need_to_be_root
  _require_cp_reflink

  send_files_dir=$TEST_DIR/btrfs-test-$seq

  rm -f $seqres.full
  rm -fr $send_files_dir
  mkdir $send_files_dir

  _scratch_mkfs >>$seqres.full 2>&1
  _scratch_mount "-o compress"

  # Create our file with an extent of 100K starting at file offset 0K.
  $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 100K"       \
                  -c "fsync"                        \
                  $SCRATCH_MNT/foo | _filter_xfs_io

  # Rewrite part of the previous extent (its first 40K) and write a new
  # 100K extent starting at file offset 100K.
  $XFS_IO_PROG -c "pwrite -S 0xbb 0K 40K"    \
          -c "pwrite -S 0xcc 100K 100K"      \
          $SCRATCH_MNT/foo | _filter_xfs_io

  # Our file foo now has 3 file extent items in its metadata:
  #
  # 1) One covering the file range 0 to 40K;
  # 2) One covering the file range 40K to 100K, which points to the first
  #    extent we wrote to the file and has a data offset field with value
  #    40K (our file no longer uses the first 40K of data from that
  #    extent);
  # 3) One covering the file range 100K to 200K.

  # Now clone our file foo into file bar.
  cp --reflink=always $SCRATCH_MNT/foo $SCRATCH_MNT/bar

  # Create our snapshot for the send operation.
  _run_btrfs_util_prog subvolume snapshot -r $SCRATCH_MNT \
          $SCRATCH_MNT/snap

  echo "File digests in the original filesystem:"
  md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
  md5sum $SCRATCH_MNT/snap/bar | _filter_scratch

  _run_btrfs_util_prog send $SCRATCH_MNT/snap -f $send_files_dir/1.snap

  # Now recreate the filesystem by receiving the send stream and verify we
  # get the same file contents that the original filesystem had.
  # Btrfs send used to issue a clone operation from foo's range
  # [80K, 140K[ to bar's range [40K, 100K[ when cloning the extent pointed
  # to by foo's second file extent item, this was incorrect because of bad
  # accounting of the file extent item's data offset field. The correct
  # range to clone from should have been [40K, 100K[.
  _scratch_unmount
  _scratch_mkfs >>$seqres.full 2>&1
  _scratch_mount "-o compress"

  _run_btrfs_util_prog receive $SCRATCH_MNT -f $send_files_dir/1.snap

  echo "File digests in the new filesystem:"
  # Must match the digests we got in the original filesystem.
  md5sum $SCRATCH_MNT/snap/foo | _filter_scratch
  md5sum $SCRATCH_MNT/snap/bar | _filter_scratch

  status=0
  exit

Orabug: 22579887

Signed-off-by: Filipe Manana <fdmanana@suse.com>
(cherry picked from mainline commit d906d49fc5f49a0129527e8fbc13312f36b9b9ce)
Signed-off-by: Ashish Samant <ashish.samant@oracle.com>