www.infradead.org Git - users/jedix/linux-maple.git/log

xenbus: don't bail early from xenbus_dev_request_and_reply()

xenbus_dev_request_and_reply() needs to track whether a transaction is
open. For XS_TRANSACTION_START messages it calls transaction_start()
and for XS_TRANSACTION_END messages it calls transaction_end().

If sending an XS_TRANSACTION_START message fails or responds with an
an error, the transaction is not open and transaction_end() must be
called.

If sending an XS_TRANSACTION_END message fails, the transaction is
still open, but if an error response is returned the transaction is
closed.

Commit 027bd7e89906 ("xen/xenbus: Avoid synchronous wait on XenBus
stalling shutdown/restart") introduced a regression where failed
XS_TRANSACTION_START messages were leaving the transaction open. This
can cause problems with suspend (and migration) as all transactions
must be closed before suspending.

It appears that the problematic change was added accidentally, so just
remove it.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 7469be95a487319514adce2304ad2af3553d2fc9)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
OraBug: 23585393

xenbus: don't BUG() on user mode induced condition

Inability to locate a user mode specified transaction ID should not
lead to a kernel crash. For other than XS_TRANSACTION_START also
don't issue anything to xenbus if the specified ID doesn't match that
of any active transaction.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 0beef634b86a1350c31da5fcc2992f0d7c8a622b)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
OraBug: 23585393

xen-pciback: return proper values during BAR sizing

Reads following writes with all address bits set to 1 should return all
changeable address bits as one, not the BAR size (nor, as was the case
for the upper half of 64-bit BARs, the high half of the region's end
address). Presumably this didn't cause any problems so far because
consumers use the value to calculate the size (usually via val & -val),
and do nothing else with it.

But also consider the exception here: Unimplemented BARs should always
return all zeroes.

And finally, the check for whether to return the sizing address on read
for the ROM BAR should ignore all non-address bits, not just the ROM
Enable one.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit d2bd05d88d245c13b64c3bf9c8927a1c56453d8c)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
OraBug: 23585393

x86/xen: avoid m2p lookup when setting early page table entries

When page tables entries are set using xen_set_pte_init() during early
boot there is no page fault handler that could handle a fault when
performing an M2P lookup.

In 64 bit guests (usually dom0) early_ioremap() would fault in
xen_set_pte_init() because an M2P lookup faults because the MFN is in
MMIO space and not mapped in the M2P. This lookup is done to see if
the PFN in in the range used for the initial page table pages, so that
the PTE may be set as read-only.

The M2P lookup can be avoided by moving the check (and clear of RW)
earlier when the PFN is still available.

Reported-by: Kevin Moraga <kmoragas@riseup.net>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
(cherry picked from commit d6b186c1e2d852a92c43f090d0d8fad4704d51ef)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
OraBug: 23585393

xen/pciback: Fix conf_space read/write overlap check.

Current overlap check is evaluating to false a case where a filter
field is fully contained (proper subset) of a r/w request. This
change applies classical overlap check instead to include all the
scenarios.

More specifically, for (Hilscher GmbH CIFX 50E-DP(M/S)) device driver
the logic is such that the entire confspace is read and written in 4
byte chunks. In this case as an example, CACHE_LINE_SIZE,
LATENCY_TIMER and PCI_BIST are arriving together in one call to
xen_pcibk_config_write() with offset == 0xc and size == 4. With the
exsisting overlap check the LATENCY_TIMER field (offset == 0xd, length
== 1) is fully contained in the write request and hence is excluded
from write, which is incorrect.

Signed-off-by: Andrey Grodzovsky <andrey2805@gmail.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Jan Beulich <JBeulich@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 02ef871ecac290919ea0c783d05da7eedeffc10e)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
OraBug: 23585393

x86/xen: fix upper bound of pmd loop in xen_cleanhighmap()

xen_cleanhighmap() is operating on level2_kernel_pgt only. The upper
bound of the loop setting non-kernel-image entries to zero should not
exceed the size of level2_kernel_pgt.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 1cf38741308c64d08553602b3374fb39224eeb5a)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
OraBug: 23585393

xen/balloon: Fix declared-but-not-defined warning

Fix a declared-but-not-defined warning when building with
XEN_BALLOON_MEMORY_HOTPLUG=n. This fixes a regression introduced by
commit dfd74a1edfab ("xen/balloon: Fix crash when ballooning on x86 32
bit PAE").

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Acked-by: Juergen Gross <jgross@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 842775f1509054ea969f1787f38d6a0ec2ccfaba)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
OraBug: 23585393

xen-blkfront: fix resume issues after a migration

After a migrate to another host (which may not have multiqueue
support), the number of rings (block hardware queues)
may be changed and the ring info structure will also be reallocated.

This patch fixes two related bugs:
* call blk_mq_update_nr_hw_queues() to make blk-core know the number
of hardware queues have been changed.
* Don't store rinfo pointer to hctx->driver_data, because rinfo may be
reallocated so use hctx->queue_num to get the rinfo structure instead.

Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 2a6f71ad99cabe436e70c3f5fcf58072cb3bc07f)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
OraBug: 23585393

xen-blkfront: don't call talk_to_blkback when already connected to blkback

Sometimes blkfront may twice receive blkback_changed() notification
(XenbusStateConnected) after migration, which will cause
talk_to_blkback() to be called twice too and confuse xen-blkback.

The flow is as follow:
   blkfront                                        blkback
blkfront_resume()
> talk_to_blkback()
  > Set blkfront to XenbusStateInitialised
                                                front changed()
                                                 > Connect()
                                                  > Set blkback to XenbusStateConnected

blkback_changed()
> Skip talk_to_blkback()
   because frontstate == XenbusStateInitialised
> blkfront_connect()
  > Set blkfront to XenbusStateConnected

-----
And here we get another XenbusStateConnected notification leading
to:
-----
blkback_changed()
> because now frontstate != XenbusStateInitialised
   talk_to_blkback() is also called again
  > blkfront state changed from
  XenbusStateConnected to XenbusStateInitialised
    (Which is not correct!)

front_changed():
                                                 > Do nothing because blkback
                                                   already in XenbusStateConnected

Now blkback is in XenbusStateConnected but blkfront is still
in XenbusStateInitialised - leading to no disks.

Poking of the XenbusStateConnected state is allowed (to deal with
block disk change) and has to be dealt with. The most likely
cause of this bug are custom udev scripts hooking up the disks
and then validating the size.

Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit efd1535270c1deb0487527bf0c3c827301a69c93)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
OraBug: 23585393

xen: use same main loop for counting and remapping pages

Instead of having two functions for cycling through the E820 map in
order to count to be remapped pages and remap them later, just use one
function with a caller supplied sub-function called for each region to
be processed. This eliminates the possibility of a mismatch between
both loops which showed up in certain configurations.

Suggested-by: Ed Swierk <eswierk@skyportsystems.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit dd14be92fbf5bc1ef7343f34968440e44e21b46a)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
OraBug: 23585393

Xen: don't warn about 2-byte wchar_t in efi

The XEN UEFI code has become available on the ARM architecture
recently, but now causes a link-time warning:

ld: warning: drivers/xen/efi.o uses 2-byte wchar_t yet the output is to use 4-byte wchar_t; use of wchar_t values across objects may fail

This seems harmless, because the efi code only uses 2-byte
characters when interacting with EFI, so we don't pass on those
strings to elsewhere in the system, and we just need to
silence the warning.

It is not clear to me whether we actually need to build the file
with the -fshort-wchar flag, but if we do, then we should also
pass --no-wchar-size-warning to the linker, to avoid the warning.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Fixes: 37060935dc04 ("ARM64: XEN: Add a function to initialize Xen specific UEFI runtime services")
(cherry picked from commit 971a69db7dc02faaeed325c195f5db5da597cb58)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
OraBug: 23585393

xen/gntdev: reduce copy batch size to 16

IOCTL_GNTDEV_GRANT_COPY batches copy operations to reduce the number
of hypercalls.  The stack is used to avoid a memory allocation in a
hot path. However, a batch size of 24 requires more than 1024 bytes of
stack which in some configurations causes a compiler warning.

    xen/gntdev.c: In function ‘gntdev_ioctl_grant_copy’:
    xen/gntdev.c:949:1: warning: the frame size of 1248 bytes is
    larger than 1024 bytes [-Wframe-larger-than=]

This is a harmless warning as there is still plenty of stack spare,
but people keep trying to "fix" it.  Reduce the batch size to 16 to
reduce stack usage to less than 1024 bytes.  This should have minimal
impact on performance.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 36ae220aa62d382a8bacbf7ec080d9d36a2b4d49)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
OraBug: 23585393

xen/x86: actually allocate legacy interrupts on PV guests

b4ff8389ed14 is incomplete: relies on nr_legacy_irqs() to get the number
of legacy interrupts when actually nr_legacy_irqs() returns 0 after
probe_8259A(). Use NR_IRQS_LEGACY instead.

Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
CC: stable@vger.kernel.org
(cherry picked from commit 702f926067d2a4b28c10a3c41a1172dd62d9e735)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
OraBug: 23527575 - UEK4 can't run under Xen nested virtualization

xen/x86: don't lose event interrupts

On slow platforms with unreliable TSC, such as QEMU emulated machines,
it is possible for the kernel to request the next event in the past. In
that case, in the current implementation of xen_vcpuop_clockevent, we
simply return -ETIME. To be precise the Xen returns -ETIME and we pass
it on. However the result of this is a missed event, which simply causes
the kernel to hang.

Instead it is better to always ask the hypervisor for a timer event,
even if the timeout is in the past. That way there are no lost
interrupts and the kernel survives. To do that, remove the
VCPU_SSHOTTMR_future flag.

Signed-off-by: Stefano Stabellini <sstabellini@kernel.org>
Acked-by: Juergen Gross <jgross@suse.com>
(cherry picked from commit c06b6d70feb32d28f04ba37aa3df17973fd37b6b)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
OraBug: 23527575 - UEK4 can't run under Xen nested virtualization

xen/events: Don't move disabled irqs

Commit ff1e22e7a638 ("xen/events: Mask a moving irq") open-coded
irq_move_irq() but left out checking if the IRQ is disabled. This broke
resuming from suspend since it tries to move a (disabled) irq without
holding the IRQ's desc->lock. Fix it by adding in a check for disabled
IRQs.

The resulting stacktrace was:
kernel BUG at /build/linux-UbQGH5/linux-4.4.0/kernel/irq/migration.c:31!
invalid opcode: 0000 [#1] SMP
Modules linked in: xenfs xen_privcmd ...
CPU: 0 PID: 9 Comm: migration/0 Not tainted 4.4.0-22-generic #39-Ubuntu
Hardware name: Xen HVM domU, BIOS 4.6.1-xs125180 05/04/2016
task: ffff88003d75ee00 ti: ffff88003d7bc000 task.ti: ffff88003d7bc000
RIP: 0010:[<ffffffff810e26e2>]  [<ffffffff810e26e2>] irq_move_masked_irq+0xd2/0xe0
RSP: 0018:ffff88003d7bfc50  EFLAGS: 00010046
RAX: 0000000000000000 RBX: ffff88003d40ba00 RCX: 0000000000000001
RDX: 0000000000000001 RSI: 0000000000000100 RDI: ffff88003d40bad8
RBP: ffff88003d7bfc68 R08: 0000000000000000 R09: ffff88003d000000
R10: 0000000000000000 R11: 000000000000023c R12: ffff88003d40bad0
R13: ffffffff81f3a4a0 R14: 0000000000000010 R15: 00000000ffffffff
FS:  0000000000000000(0000) GS:ffff88003da00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fd4264de624 CR3: 0000000037922000 CR4: 00000000003406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Stack:
ffff88003d40ba38 0000000000000024 0000000000000000 ffff88003d7bfca0
ffffffff814c8d92 00000010813ef89d 00000000805ea732 0000000000000009
0000000000000024 ffff88003cc39b80 ffff88003d7bfce0 ffffffff814c8f66
Call Trace:
[<ffffffff814c8d92>] eoi_pirq+0xb2/0xf0
[<ffffffff814c8f66>] __startup_pirq+0xe6/0x150
[<ffffffff814ca659>] xen_irq_resume+0x319/0x360
[<ffffffff814c7e75>] xen_suspend+0xb5/0x180
[<ffffffff81120155>] multi_cpu_stop+0xb5/0xe0
[<ffffffff811200a0>] ? cpu_stop_queue_work+0x80/0x80
[<ffffffff811203d0>] cpu_stopper_thread+0xb0/0x140
[<ffffffff810a94e6>] ? finish_task_switch+0x76/0x220
[<ffffffff810ca731>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
[<ffffffff810a3935>] smpboot_thread_fn+0x105/0x160
[<ffffffff810a3830>] ? sort_range+0x30/0x30
[<ffffffff810a0588>] kthread+0xd8/0xf0
[<ffffffff810a04b0>] ? kthread_create_on_node+0x1e0/0x1e0
[<ffffffff8182568f>] ret_from_fork+0x3f/0x70
[<ffffffff810a04b0>] ? kthread_create_on_node+0x1e0/0x1e0

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
(cherry picked from commit 97fef3f8a75eaed293890474f8fc45605101f26d)
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Orabug: 23214472

xen/events: Mask a moving irq

Moving an unmasked irq may result in irq handler being invoked on both
source and target CPUs.

With 2-level this can happen as follows:

On source CPU:
        evtchn_2l_handle_events() ->
            generic_handle_irq() ->
                handle_edge_irq() ->
                   eoi_pirq():
                       irq_move_irq(data);

                       /***** WE ARE HERE *****/

                       if (VALID_EVTCHN(evtchn))
                           clear_evtchn(evtchn);

If at this moment target processor is handling an unrelated event in
evtchn_2l_handle_events()'s loop it may pick up our event since target's
cpu_evtchn_mask claims that this event belongs to it *and* the event is
unmasked and still pending. At the same time, source CPU will continue
executing its own handle_edge_irq().

With FIFO interrupt the scenario is similar: irq_move_irq() may result
in a EVTCHNOP_unmask hypercall which, in turn, may make the event
pending on the target CPU.

We can avoid this situation by moving and clearing the event while
keeping event masked.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit ff1e22e7a638a0782f54f81a6c9cb139aca2da35)
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Orabug: 23214472

xen-netback: reduce log spam

Remove the "prepare for reconnect" pr_info in xenbus.c. It's largely
uninteresting and the states of the frontend and backend can easily be
observed by watching the (o)xenstored log.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 8e4ee59c1e75b74966476dcc3552c3b30d2768e7)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
OraBug: 23268939

xen-netback: support multiple extra info fragments passed from frontend

The code does not currently support a frontend passing multiple extra info
fragments to the backend in a tx request. The xenvif_get_extras() function
handles multiple extra_info fragments but make_tx_response() assumes there
is only ever a single extra info fragment.

This patch modifies xenvif_get_extras() to pass back a count of extra
info fragments, which is then passed to make_tx_response() (after
possibly being stashed in pending_tx_info for deferred responses).

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 562abd39a1902745bdcab266c7824cd6c5bc34d3)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
OraBug: 23268939

xen-netback: implement dynamic multicast control

My recent patch to the Xen Project documents a protocol for 'dynamic
multicast control' in netif.h. This extends the previous multicast control
protocol to not require a shared ring reconnection to turn the feature off.
Instead the backend watches the "request-multicast-control" key in xenstore
and turns the feature off if the key value is written to zero.

This patch adds support for dynamic multicast control in xen-netback.

Signed-off-by: Paul Durrant <paul.durrant@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Acked-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 22fae97d863679994b951799dd4bbe7afd95897b)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
OraBug: 23268939

xen-blkfront: rename indirect descriptor parameter

"max" is rather ambiguous and carries pretty little meaning, the more
that there are also "max_queues" and "max_ring_page_order". Make this
"max_indirect_segments" instead, and at once change the type from int
to uint (to match the respective variable's type).

Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 14e710fe7897e37762512d336ab081c57de579a4)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
OraBug: 23268939

xen-blkback: advertise indirect segment support earlier

There's no reason to defer this until the connect phase, and in fact
there are frontend implementations expecting this to be available
earlier. Move it into the probe function.

Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 5a7058450cbc8702f976d1f444974485c70cb525)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
OraBug: 23268939

xen/blback: Fit the important information of the thread in 17 characters

The processes names are truncated to 17, while we had the length
of the process as name 20 - which meant that while we filled
it out with various details - the last 3 characters (which had
the queue number) never surfaced to the user-space.

To simplify this and be able to fit the device name, domain id,
and the queue number we remove the 'blkback' from the name.

Prior to this patch the device name is "blkback.<domid>.<name>"
for example: blkback.8.xvda, blkback.11.hda.

With the multiqueue block backend we add "-%d" for the queue.
But sadly this is already way past the limit so it gets stripped.

Possible solution had been identified by Ian:
http://lists.xenproject.org/archives/html/xen-devel/2015-05/msg03516.html

  "
  If you are pressed for space then the "xvd" is probably a bit redundant
  in a string which starts blkbk.

  The guest may not even call the device xvdN (iirc BSD has another
  prefix) any how, so having blkback say so seems of limited use anyway.

  Since this seems to not include a partition number how does this work in
  the split partition scheme? (i.e. one where the guest is given xvda1 and
  xvda2 rather than xvda with a partition table)

[It will be 'blkback.8.xvda1', and 'blkback.11.xvda2']

  Perhaps something derived from one of the schemes in
  http://xenbits.xen.org/docs/unstable/misc/vbd-interface.txt might be a
  better fit?

After a bit of discussion (see
http://lists.xenproject.org/archives/html/xen-devel/2015-12/msg01588.html)
we settled on dropping the "blback" part.

This will make it possible to have the <domid>.<name>-<queue>:

[1.xvda-0]
[1.xvda-1]

And we enough space to make it go up to:

[32100.xvdfg9-5]

Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit fa3184b898717d696242241541b8cbcb65c5d497)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
OraBug: 23268939

xen/balloon: Fix crash when ballooning on x86 32 bit PAE

Commit 55b3da98a40dbb3776f7454daf0d95dde25c33d2 (xen/balloon: find
non-conflicting regions to place hotplugged memory) caused a
regression in 4.4.

When ballooning on an x86 32 bit PAE system with close to 64 GiB of
memory, the address returned by allocate_resource may be above 64 GiB.
When using CONFIG_SPARSEMEM, this setup is limited to using physical
addresses < 64 GiB. When adding memory at this address, it runs off
the end of the mem_section array and causes a crash. Instead, fail
the ballooning request.

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Cc: <stable@vger.kernel.org> # 4.4+
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit dfd74a1edfaba5864276a2859190a8d242d18952)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
OraBug: 23268939

xen/x86: Zero out .bss for PV guests

ELF spec is unclear about whether .bss must me cleared by the loader.
Currently the domain builder does it when loading the guest but because
it is not (or rather may not be) guaranteed we should zero it out
explicitly.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 04b6b4a56884327c1648c517f1f46a2638f04c9d)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
OraBug: 23268939

xen/evtchn: fix ring resize when binding new events

The copying of ring data was wrong for two cases: For a full ring
nothing got copied at all (as in that case the canonicalized producer
and consumer indexes are identical). And in case one or both of the
canonicalized (after the resize) indexes would point into the second
half of the buffer, the copied data ended up in the wrong (free) part
of the new buffer. In both cases uninitialized data would get passed
back to the caller.

Fix this by simply copying the old ring contents twice: Once to the
low half of the new buffer, and a second time to the high half.

This addresses the inability to boot a HVM guest with 64 or more
vCPUs. This regression was caused by 8620015499101090 (xen/evtchn:
dynamically grow pending event channel ring).

Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: <stable@vger.kernel.org> # 4.4+
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 27e0e6385377c4dc68a4ddaf1a35a2dfa951f3c5)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
OraBug: 23268939

Revert "x86/paravirt: Remove paravirt ops pmd_update[_defer] and pte_update_defer"

This reverts commit 9ff226047d9f2fb8366b71daf42039911f5aa088.

As it breaks the kABI: pv_mmu_ops

And we don't need this patch.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Revert "x86/paravirt: Remove unused pv_apic_ops structure"

This reverts commit a3c33f534a6dde2a88cb34e95a94fd2d803cb390.

As it breaks the kABI pv_mmu_ops.

We don't this patch so lets revert.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Merge branch 'uek4/4.5-xen-backport.v2' of git://ca-git.us.oracle.com/linux-dkiper-public into uek4/4.5-xen-backport

* 'uek4/4.5-xen-backport.v2' of git://ca-git.us.oracle.com/linux-dkiper-public: (42 commits)
  xen: fix potential integer overflow in queue_reply
  xen/scsiback: avoid warnings when adding multiple LUNs to a domain
  xen/scsiback: correct frontend counting
  xen/blkfront: realloc ring info in blkif_resume
  xen-netfront: request Tx response events more often
  cleancache: constify cleancache_ops structure
  xen-netback: free queues after freeing the net device
  xen-netback: delete NAPI instance when queue fails to initialize
  xen-netback: use skb to determine number of required guest Rx requests
  xen/gntdev: add ioctl for grant copy
  xen/blkfront: Fix crash if backend doesn't follow the right states.
  xen/blkback: Fix two memory leaks.
  xen/blkback: make st_ statistics per ring
  xen/blkfront: Handle non-indirect grant with 64KB pages
  xen-blkfront: Introduce blkif_ring_get_request
  xen-blkback: clear PF_NOFREEZE for xen_blkif_schedule()
  xen/blkback: Free resources if connect_ring failed.
  xen/blocks: Return -EXX instead of -1
  xen/blkback: make pool of persistent grants and free pages per-queue
  xen/blkback: get the number of hardware queues/rings from blkfront
  ...

OraBug: 23268939 - Backport Linux v4.5 Linux Xen patches

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Revert "xen/events: Mask a moving irq"

This reverts commit f01077a059d5a39e4eed4095a332ceb028a16fb0.

Requested by Boris. It causes an regression upstream. Awaiting a fix.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

xen: fix potential integer overflow in queue_reply

When len is greater than UINT_MAX - sizeof(*rb), in next allocation,
it can overflow integer range and allocates small size of heap.
After that, memcpy will overflow the allocated heap.
Therefore, it needs to check the size of given length.

Signed-off-by: Insu Yun <wuninsu@gmail.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 85c0a87cd117e83361932b2b160c9af178fdb21a)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen/scsiback: avoid warnings when adding multiple LUNs to a domain

When adding more than one LUN to a frontend a warning for a failed
assignment is issued in dom0 for each already existing LUN. Avoid this
warning by checking for a LUN already existing when existence is
allowed (scsiback_do_add_lun() called with try == 1).

As the LUN existence check is needed now for a third time, factor it
out into a function. This in turn leads to a more or less complete
rewrite of scsiback_del_translation_entry() which will now return a
proper error code in case of failure.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit c9e2f531be000af652927ee0af3a0f24f8e9e046)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
Conflicts:
drivers/xen/xen-scsiback.c

xen/scsiback: correct frontend counting

When adding a new frontend to xen-scsiback don't decrement the number
of active frontends in case of no error. Doing so results in a failure
when trying to remove the xen-pvscsi nexus even if no domain is using
it.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit f285aa8db7cc4432c1a03f8b55ff34fe96317c11)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen/blkfront: realloc ring info in blkif_resume

Need to reallocate ring info in the resume path, because info->rinfo was freed
in blkif_free(). And 'multi-queue-max-queues' backend reports may have been
changed.

Signed-off-by: Bob Liu <bob.liu@oracle.com>
Reported-and-Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 3db70a853202c252a8ebefa71ccb088ad149cdd2)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen-netfront: request Tx response events more often

Trying to batch Tx response events results in poor performance because
this delays freeing the transmitted skbs.

Instead use the standard RING_FINAL_CHECK_FOR_RESPONSES() macro to be
notified once the next Tx response is placed on the ring.

Signed-off-by: Malcolm Crossley <malcolm.crossley@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 7d0105b5334b9722b7d33acad613096dfcf3330e)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

cleancache: constify cleancache_ops structure

The cleancache_ops structure is never modified, so declare it as const.

Done with the help of Coccinelle.

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit b3c6de492b9ea30a8dcc535d4dc2eaaf0bb3f116)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen-netback: free queues after freeing the net device

If a queue still has a NAPI instance added to the net device, freeing
the queues early results in a use-after-free.

The shouldn't ever happen because we disconnect and tear down all queues
before freeing the net device, but doing this makes it obviously safe.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 9c6f3ffe8200327d1cf2aad2ff2b414adaacbe96)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen-netback: delete NAPI instance when queue fails to initialize

When xenvif_connect() fails it may leave a stale NAPI instance added to
the device. Make sure we delete it in the error path.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 4a658527271bce43afb1cf4feec89afe6716ca59)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen-netback: use skb to determine number of required guest Rx requests

Using the MTU or GSO size to determine the number of required guest Rx
requests for an skb was subtly broken since these value may change at
runtime.

After 1650d5455bd2dc6b5ee134bd6fc1a3236c266b5b (xen-netback: always
fully coalesce guest Rx packets) we always fully pack a packet into
its guest Rx slots. Calculating the number of required slots from the
packet length is then easy.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 99a2dea50d5deff134b6c346f53a3ad1f583ee96)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen/gntdev: add ioctl for grant copy

Add IOCTL_GNTDEV_GRANT_COPY to allow applications to copy between user
space buffers and grant references.

This interface is similar to the GNTTABOP_copy hypercall ABI except
the local buffers are provided using a virtual address (instead of a
GFN and offset). To avoid userspace from having to page align its
buffers the driver will use two or more ops if required.

If the ioctl returns 0, the application must check the status of each
segment with the segments status field. If the ioctl returns a -ve
error code (EINVAL or EFAULT), the status of individual ops is
undefined.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
(cherry picked from commit a4cdb556cae05cd3e7b602b3a44c01420c4e2258)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen/blkfront: Fix crash if backend doesn't follow the right states.

We have split the setting up of all the resources in two steps:
1) talk_to_blkback  - which figures out the num_ring_pages (from
   the default value of zero), sets up shadow and so
2) blkfront_connect - does the real part of filling out the
   internal structures.

The problem is if we bypass the 1) step and go straight to 2)
and call blkfront_setup_indirect where we use the macro
BLK_RING_SIZE - which returns an negative value (because
sz is zero  - since num_ring_pages is zero - since it has never
been set).

We can fix this by making sure that we always have called
talk_to_blkback before going to blkfront_connect.

Or we could set in blkfront_probe info->nr_ring_pages = 1
to have a default value. But that looks odd - as we haven't
actually negotiated any ring size.

This patch changes XenbusStateConnected state to detect if
we haven't done the initial handshake - and if so continue
on as if were in XenbusStateInitWait state.

We also roll the error recovery (freeing the structure) into
talk_to_blkback error path - which is safe since that function
is only called from blkback_changed.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit c31ecf6c126dbc7f30234eaf6c4a079649a38de7)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen/blkback: Fix two memory leaks.

This patch fixs two memleaks:
  backtrace:
    [<ffffffff817ba5e8>] kmemleak_alloc+0x28/0x50
    [<ffffffff81205e3b>] kmem_cache_alloc+0xbb/0x1d0
    [<ffffffff81534028>] xen_blkbk_probe+0x58/0x230
    [<ffffffff8146adb6>] xenbus_dev_probe+0x76/0x130
    [<ffffffff81511716>] driver_probe_device+0x166/0x2c0
    [<ffffffff815119bc>] __device_attach_driver+0xac/0xb0
    [<ffffffff8150fa57>] bus_for_each_drv+0x67/0x90
    [<ffffffff81511ab7>] __device_attach+0xc7/0x120
    [<ffffffff81511b23>] device_initial_probe+0x13/0x20
    [<ffffffff8151059a>] bus_probe_device+0x9a/0xb0
    [<ffffffff8150f0a1>] device_add+0x3b1/0x5c0
    [<ffffffff8150f47e>] device_register+0x1e/0x30
    [<ffffffff8146a9e8>] xenbus_probe_node+0x158/0x170
    [<ffffffff8146abaf>] xenbus_dev_changed+0x1af/0x1c0
    [<ffffffff8146b1bb>] backend_changed+0x1b/0x20
    [<ffffffff81468ca6>] xenwatch_thread+0xb6/0x160
unreferenced object 0xffff880007ba8ef8 (size 224):

  backtrace:
    [<ffffffff817ba5e8>] kmemleak_alloc+0x28/0x50
    [<ffffffff81205c73>] __kmalloc+0xd3/0x1e0
    [<ffffffff81534d87>] frontend_changed+0x2c7/0x580
    [<ffffffff8146af12>] xenbus_otherend_changed+0xa2/0xb0
    [<ffffffff8146b2c0>] frontend_changed+0x10/0x20
    [<ffffffff81468ca6>] xenwatch_thread+0xb6/0x160
    [<ffffffff810d3e97>] kthread+0xd7/0xf0
    [<ffffffff817c4a9f>] ret_from_fork+0x3f/0x70
    [<ffffffffffffffff>] 0xffffffffffffffff
unreferenced object 0xffff8800048dcd38 (size 224):

The first leak is caused by not put() the be->blkif reference
which we had gotten in xen_blkif_alloc(), while the second is
us not freeing blkif->rings in the right place.

Signed-off-by: Bob Liu <bob.liu@oracle.com>
Reported-and-Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 93bb277f97a6d319361766bde228717faf4abdeb)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen/blkback: make st_ statistics per ring

Make st_* statistics per ring and the VBD sysfs would iterate over all the
rings.

Note: xenvbd_sysfs_delif() is called in xen_blkbk_remove() before all rings
are torn down, so it's safe.

Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
v2: Aligned the variables on the same column.
(cherry picked from commit db6fbc106786f26d95889c50c18b1f28aa543a17)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen/blkfront: Handle non-indirect grant with 64KB pages

The minimal size of request in the block framework is always PAGE_SIZE.
It means that when 64KB guest is support, the request will at least be
64KB.

Although, if the backend doesn't support indirect descriptor (such as QDISK
in QEMU), a ring request is only able to accommodate 11 segments of 4KB
(i.e 44KB).

The current frontend is assuming that an I/O request will always fit in
a ring request. This is not true any more when using 64KB page
granularity and will therefore crash during boot.

On ARM64, the ABI is completely neutral to the page granularity used by
the domU. The guest has the choice between different page granularity
supported by the processors (for instance on ARM64: 4KB, 16KB, 64KB).
This can't be enforced by the hypervisor and therefore it's possible to
run guests using different page granularity.

So we can't mandate the block backend to support indirect descriptor
when the frontend is using 64KB page granularity and have to fix it
properly in the frontend.

The solution exposed below is based on modifying directly the frontend
guest rather than asking the block framework to support smaller size
(i.e < PAGE_SIZE). This is because the change is the block framework are
not trivial as everything seems to relying on a struct *page (see [1]).
Although, it may be possible that someone succeed to do it in the future
and we would therefore be able to use it.

Given that a block request may not fit in a single ring request, a
second request is introduced for the data that cannot fit in the first
one. This means that the second ring request should never be used on
Linux if the page size is smaller than 44KB.

To achieve the support of the extra ring request, the block queue size
is divided by two. Therefore, the ring will always contain enough space
to accommodate 2 ring requests. While this will reduce the overall
performance, it will make the implementation more contained. The way
forward to get better performance is to implement in the backend either
indirect descriptor or multiple grants ring.

Note that the parameters blk_queue_max_* helpers haven't been updated.
The block code will set the mimimum size supported and we may be able
to support directly any change in the block framework that lower down
the minimal size of a request.

[1] http://lists.xen.org/archives/html/xen-devel/2015-08/msg02200.html

Signed-off-by: Julien Grall <julien.grall@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 6cc5683390472c450fd69975d1283db79202667f)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen-blkfront: Introduce blkif_ring_get_request

The code to get a request is always the same. Therefore we can factorize
it in a single function.

Signed-off-by: Julien Grall <julien.grall@citrix.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 2e073969d57f60fc0b863985779657624cbd4886)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen-blkback: clear PF_NOFREEZE for xen_blkif_schedule()

xen_blkif_schedule() kthread calls try_to_freeze() at the beginning of
every attempt to purge the LRU. This operation can't ever succeed though,
as the kthread hasn't marked itself as freezable.

Before (hopefully eventually) kthread freezing gets converted to fileystem
freezing, we'd rather mark xen_blkif_schedule() freezable (as it can
generate I/O during suspend).

Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit a6e7af1288eeb7fca8361356998d31a92a291531)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen/blkback: Free resources if connect_ring failed.

With the multi-queue support we could fail at setting up
some of the rings and fail the connection. That meant that
all resources tied to rings[0..n-1] (where n is the ring
that failed to be setup). Eventually the frontend will switch
to the states and we will call xen_blkif_disconnect.

However we do not want to be at the mercy of the frontend
deciding when to change states. This allows us to do the
cleanup right away and freeing resources.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 2d0382fac17cef20d507a0211b82e0942b2ab271)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen/blocks: Return -EXX instead of -1

Lets return sensible values instead of -1.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit bde21f73b9be146fda0c689f2724cda9d7737565)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen/blkback: make pool of persistent grants and free pages per-queue

Make pool of persistent grants and free pages per-queue/ring instead of
per-device to get better scalability.

Test was done based on null_blk driver:
dom0: v4.2-rc8 16vcpus 10GB "modprobe null_blk"
domu: v4.2-rc8 16vcpus 10GB

[test]
rw=read
direct=1
ioengine=libaio
bs=4k
time_based
runtime=30
filename=/dev/xvdb
numjobs=16
iodepth=64
iodepth_batch=64
iodepth_batch_complete=64
group_reporting

Results:
iops1: After patch "xen/blkfront: make persistent grants per-queue".
iops2: After this patch.

Queues:   1    4      8 16
Iops orig(k): 810 1064 780 700
Iops1(k): 810     1230(~20%) 1024(~20%) 850(~20%)
Iops2(k): 810     1410(~35%) 1354(~75%)      1440(~100%)

With 4 queues after this commit we can get ~75% increase in IOPS, and
performance won't drop if increasing queue numbers.

Please find the respective chart in this link:
https://www.dropbox.com/s/agrcy2pbzbsvmwv/iops.png?dl=0

Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit d4bf0065b7251afb723a29b2fd58f7c38f8ce297)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen/blkback: get the number of hardware queues/rings from blkfront

Backend advertises "multi-queue-max-queues" to front, also get the negotiated
number from "multi-queue-num-queues" written by blkfront.

Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit d62d86000316d7ef38e1c2e9602c3ce6d1cb57bd)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen/blkback: pseudo support for multi hardware queues/rings

Preparatory patch for multiple hardware queues (rings). The number of
rings is unconditionally set to 1, larger number will be enabled in
"xen/blkback: get the number of hardware queues/rings from blkfront".

Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
v2: Align variables in the structures.
(cherry picked from commit 2fb1ef4f1226ea6d6d3481036cabe01a4415b68c)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen/blkback: separate ring information out of struct xen_blkif

Split per ring information to an new structure "xen_blkif_ring", so that one vbd
device can be associated with one or more rings/hardware queues.

Introduce 'pers_gnts_lock' to protect the pool of persistent grants since we
may have multi backend threads.

This patch is a preparation for supporting multi hardware queues/rings.

Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
v2: Align the variables in the structure.
(cherry picked from commit 597957000ab5b1b38085c20868f3f7b9c305bae5)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen/blkfront: correct setting for xen_blkif_max_ring_order

According to this piece code:
"
pr_info("Invalid max_ring_order (%d), will use default max: %d.\n",
xen_blkif_max_ring_order, XENBUS_MAX_RING_GRANT_ORDER);
"
if xen_blkif_max_ring_order is bigger that XENBUS_MAX_RING_GRANT_ORDER,
need to set xen_blkif_max_ring_order using XENBUS_MAX_RING_GRANT_ORDER,
but not 0.

Signed-off-by: Peng Fan <van.freenix@gmail.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: "Roger Pau Monné" <roger.pau@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 45fc82642e54018740a25444d1165901501b601b)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen/blkfront: make persistent grants pool per-queue

Make persistent grants per-queue/ring instead of per-device, so that we can
drop the 'dev_lock' and get better scalability.

Test was done based on null_blk driver:
dom0: v4.2-rc8 16vcpus 10GB "modprobe null_blk"
domu: v4.2-rc8 16vcpus 10GB

[test]
rw=read
direct=1
ioengine=libaio
bs=4k
time_based
runtime=30
filename=/dev/xvdb
numjobs=16
iodepth=64
iodepth_batch=64
iodepth_batch_complete=64
group_reporting

Queues: 1 4 8 16
Iops orig(k): 810 1064 780 700
Iops patched(k): 810 1230(~20%) 1024(~20%) 850(~20%)

Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 73716df7da4f60dd2d59a9302227d0394f1b8fcc)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen/blkfront: Remove duplicate setting of ->xbdev.

We do the same exact operations a bit earlier in the
function.

Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 75f070b3967b0c3bf0e1bc43411b06bab6c2c2cd)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen/blkfront: Cleanup of comments, fix unaligned variables, and syntax errors.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 6f03a7ff89485f0a7a559bf5c7631d2986c4ecfa)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen/blkfront: negotiate number of queues/rings to be used with backend

The max number of hardware queues for xen/blkfront is set by parameter
'max_queues'(default 4), while it is also capped by the max value that the
xen/blkback exposes through XenStore key 'multi-queue-max-queues'.

The negotiated number is the smaller one and would be written back to xenstore
as "multi-queue-num-queues", blkback needs to read this negotiated number.

Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 28d949bcc28bbc2d206f9c3f69b892575e81c040)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen/blkfront: split per device io_lock

After patch "xen/blkfront: separate per ring information out of device
info", per-ring data is protected by a per-device lock ('io_lock').

This is not a good way and will effect the scalability, so introduce a
per-ring lock ('ring_lock').

The old 'io_lock' is renamed to 'dev_lock' which protects the ->grants list and
->persistent_gnts_c which are shared by all rings.

Note that in 'blkfront_probe' the 'blkfront_info' is setup via kzalloc
so setting ->persistent_gnts_c to zero is not needed.

Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 11659569f7202d0cb6553e81f9b8aa04dfeb94ce)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen/blkfront: pseudo support for multi hardware queues/rings

Preparatory patch for multiple hardware queues (rings). The number of
rings is unconditionally set to 1, larger number will be enabled in
patch "xen/blkfront: negotiate number of queues/rings to be used with backend"
so as to make review easier.

Note that blkfront_gather_backend_features does not call
blkfront_setup_indirect anymore (as that needs to be done per ring).
That means that in blkif_recover/blkif_connect we have to do it in a loop
(bounded by nr_rings).

Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 3df0e5059908b8fdba351c4b5dd77caadd95a949)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen/blkfront: separate per ring information out of device info

Split per ring information to a new structure "blkfront_ring_info".

A ring is the representation of a hardware queue, every vbd device can associate
with one or more rings depending on how many hardware queues/rings to be used.

This patch is a preparation for supporting real multi hardware queues/rings.

We also add a backpointer to 'struct blkfront_info' (dev_info) which
is not needed (we could use containers_of) but further patch
("xen/blkfront: pseudo support for multi hardware queues/rings")
will make allocation of 'blkfront_ring_info' dynamic.

Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com>
Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 81f351615772365d46ceeac3e50c9dd4e8f9dc89)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
Conflicts:
drivers/block/xen-blkfront.c

xen/blkif: document blkif multi-queue/ring extension

Document the multi-queue/ring feature in terms of XenStore keys to be written by
the backend and by the frontend.

Signed-off-by: Bob Liu <bob.liu@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit eb5df87fab0ae7114b83dc7f338b27d039374767)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

x86/xen: don't reset vcpu_info on a cancelled suspend

On a cancelled suspend the vcpu_info location does not change (it's
still in the per-cpu area registered by xen_vcpu_setup()). So do not
call xen_hvm_init_shared_info() which would make the kernel think its
back in the shared info. With the wrong vcpu_info, events cannot be
received and the domain will hang after a cancelled suspend.

Signed-off-by: Charles Ouyang <ouyangzhaowei@huawei.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 6a1f513776b78c994045287073e55bae44ed9f8c)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
Conflicts:
arch/x86/xen/suspend.c

xen/gntdev: constify mmu_notifier_ops structures

This mmu_notifier_ops structure is never modified, so declare it as
const, like the other mmu_notifier_ops structures.

Done with the help of Coccinelle.

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit b9c0a92a9aa953e5a98f2af2098c747d4358c7bb)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen/grant-table: constify gnttab_ops structure

The gnttab_ops structure is never modified, so declare it as const.

Done with the help of Coccinelle.

Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 86fc2136736d2767bf797e6d2b1f80b49f52953c)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen/time: use READ_ONCE

Use READ_ONCE through the code, rather than explicit barriers.

Suggested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 2dd887e32175b624375570a0361083eb2cd64a07)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen/x86: convert remaining timespec to timespec64 in xen_pvclock_gtod_notify

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
(cherry picked from commit 187b26a97244b1083d573175650f41b2267ac635)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen/x86: support XENPF_settime64

Try XENPF_settime64 first, if it is not available fall back to
XENPF_settime32.

No need to call __current_kernel_time() when all the info needed are
already passed via the struct timekeeper * argument.

Return NOTIFY_BAD in case of errors.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
(cherry picked from commit 760968631323f710ea0824369bbd65f812c82f08)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen: introduce XENPF_settime64

Rename the current XENPF_settime hypercall and related struct to
XENPF_settime32.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
(cherry picked from commit f3d6027ee0568b5442077120beeb5d9d17c2d0da)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen: rename dom0_op to platform_op

The dom0_op hypercall has been renamed to platform_op since Xen 3.2,
which is ancient, and modern upstream Linux kernels cannot run as dom0
and it anymore anyway.

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
(cherry picked from commit cfafae940381207d48b11a73a211142dba5947d3)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen: move xen_setup_runstate_info and get_runstate_snapshot to drivers/xen/time.c

Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 4ccefbe597392d2914cf7ad904e33c734972681d)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>
Conflicts:
arch/x86/xen/time.c

x86/paravirt: Remove paravirt ops pmd_update[_defer] and pte_update_defer

pte_update_defer can be removed as it is always set to the same
function as pte_update. So any usage of pte_update_defer() can be
replaced by pte_update().

pmd_update and pmd_update_defer are always set to paravirt_nop, so they
can just be nuked.

Signed-off-by: Juergen Gross <jgross@suse.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: jeremy@goop.org
Cc: chrisw@sous-sol.org
Cc: akataria@vmware.com
Cc: virtualization@lists.linux-foundation.org
Cc: xen-devel@lists.xen.org
Cc: konrad.wilk@oracle.com
Cc: david.vrabel@citrix.com
Cc: boris.ostrovsky@oracle.com
Link: http://lkml.kernel.org/r/1447771879-1806-1-git-send-email-jgross@suse.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit d6ccc3ec95251d8d3276f2900b59cbc468dd74f4)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

x86/paravirt: Remove unused pv_apic_ops structure

The only member of that structure is startup_ipi_hook which is always
set to paravirt_nop.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Cc: jeremy@goop.org
Cc: chrisw@sous-sol.org
Cc: akataria@vmware.com
Cc: rusty@rustcorp.com.au
Cc: virtualization@lists.linux-foundation.org
Cc: xen-devel@lists.xen.org
Cc: konrad.wilk@oracle.com
Cc: boris.ostrovsky@oracle.com
Link: http://lkml.kernel.org/r/1447767872-16730-1-git-send-email-jgross@suse.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit 460958659270b7d750d4ccfe052171cb6f655cbb)
Signed-off-by: Daniel Kiper <daniel.kiper@oracle.com>

xen/events: Mask a moving irq

Moving an unmasked irq may result in irq handler being invoked on both
source and target CPUs.

With 2-level this can happen as follows:

On source CPU:
        evtchn_2l_handle_events() ->
            generic_handle_irq() ->
                handle_edge_irq() ->
                   eoi_pirq():
                       irq_move_irq(data);

                       /***** WE ARE HERE *****/

                       if (VALID_EVTCHN(evtchn))
                           clear_evtchn(evtchn);

If at this moment target processor is handling an unrelated event in
evtchn_2l_handle_events()'s loop it may pick up our event since target's
cpu_evtchn_mask claims that this event belongs to it *and* the event is
unmasked and still pending. At the same time, source CPU will continue
executing its own handle_edge_irq().

With FIFO interrupt the scenario is similar: irq_move_irq() may result
in a EVTCHNOP_unmask hypercall which, in turn, may make the event
pending on the target CPU.

We can avoid this situation by moving and clearing the event while
keeping event masked.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit ff1e22e7a638a0782f54f81a6c9cb139aca2da35)
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Orabug: 23214472

xen-blkback: read from indirect descriptors only once

Since indirect descriptors are in memory shared with the frontend, the
frontend could alter the first_sect and last_sect values after they have
been validated but before they are recorded in the request. This may
result in I/O requests that overflow the foreign page, possibly
overwriting local pages when the I/O request is executed.

When parsing indirect descriptors, only read first_sect and last_sect
once.

This is part of XSA155.

(cherry-pick from 18779149101c0dd43ded43669ae2a92d21b6f9cb)
CC: stable@vger.kernel.org
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
----
v2: This is against v4.3

xen/pcifront: Report the errors better.

The messages should be different depending on the type of error.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 2cfec6a2f989d5c921ba11a329ff8ea986702b9b)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
OraBug: 23017418 - Backport Linux v4.4 Xen patches

xen/pcifront: Fix mysterious crashes when NUMA locality information was extracted.

Occasionaly PV guests would crash with:

pciback 0000:00:00.1: Xen PCI mapped GSI0 to IRQ16
BUG: unable to handle kernel paging request at 0000000d1a8c0be0
.. snip..
  <ffffffff8139ce1b>] find_next_bit+0xb/0x10
  [<ffffffff81387f22>] cpumask_next_and+0x22/0x40
  [<ffffffff813c1ef8>] pci_device_probe+0xb8/0x120
  [<ffffffff81529097>] ? driver_sysfs_add+0x77/0xa0
  [<ffffffff815293e4>] driver_probe_device+0x1a4/0x2d0
  [<ffffffff813c1ddd>] ? pci_match_device+0xdd/0x110
  [<ffffffff81529657>] __device_attach_driver+0xa7/0xb0
  [<ffffffff815295b0>] ? __driver_attach+0xa0/0xa0
  [<ffffffff81527622>] bus_for_each_drv+0x62/0x90
  [<ffffffff8152978d>] __device_attach+0xbd/0x110
  [<ffffffff815297fb>] device_attach+0xb/0x10
  [<ffffffff813b75ac>] pci_bus_add_device+0x3c/0x70
  [<ffffffff813b7618>] pci_bus_add_devices+0x38/0x80
  [<ffffffff813dc34e>] pcifront_scan_root+0x13e/0x1a0
  [<ffffffff817a0692>] pcifront_backend_changed+0x262/0x60b
  [<ffffffff814644c6>] ? xenbus_gather+0xd6/0x160
  [<ffffffff8120900f>] ? put_object+0x2f/0x50
  [<ffffffff81465c1d>] xenbus_otherend_changed+0x9d/0xa0
  [<ffffffff814678ee>] backend_changed+0xe/0x10
  [<ffffffff81463a28>] xenwatch_thread+0xc8/0x190
  [<ffffffff810f22f0>] ? woken_wake_function+0x10/0x10

which was the result of two things:

When we call pci_scan_root_bus we would pass in 'sd' (sysdata)
pointer which was an 'pcifront_sd' structure. However in the
pci_device_add it expects that the 'sd' is 'struct sysdata' and
sets the dev->node to what is in sd->node (offset 4):

set_dev_node(&dev->dev, pcibus_to_node(bus));

__pcibus_to_node(const struct pci_bus *bus)
{
        const struct pci_sysdata *sd = bus->sysdata;

        return sd->node;
}

However our structure was pcifront_sd which had nothing at that
offset:

struct pcifront_sd {
        int                        domain;    /*     0     4 */
        /* XXX 4 bytes hole, try to pack */
        struct pcifront_device *   pdev;      /*     8     8 */
}

That is an hole - filled with garbage as we used kmalloc instead of
kzalloc (the second problem).

This patch fixes the issue by:
1) Use kzalloc to initialize to a well known state.
2) Put 'struct pci_sysdata' at the start of 'pcifront_sd'. That
    way access to the 'node' will access the right offset.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: <stable@vger.kernel.org>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 4d8c8bd6f2062c9988817183a91fe2e623c8aa5e)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
OraBug:  23017418 - Backport Linux v4.4 Xen patches

xen/pciback: Save the number of MSI-X entries to be copied later.

Commit 8135cf8b092723dbfcc611fe6fdcb3a36c9951c5 (xen/pciback: Save
xen_pci_op commands before processing it) broke enabling MSI-X because
it would never copy the resulting vectors into the response. The
number of vectors requested was being overwritten by the return value
(typically zero for success).

Save the number of vectors before processing the op, so the correct
number of vectors are copied afterwards.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: <stable@vger.kernel.org>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit d159457b84395927b5a52adb72f748dd089ad5e5)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
OraBug: 23017418 - Backport Linux v4.4 Xen patches

xen/pciback: Check PF instead of VF for PCI_COMMAND_MEMORY

Commit 408fb0e5aa7fda0059db282ff58c3b2a4278baa0 (xen/pciback: Don't
allow MSI-X ops if PCI_COMMAND_MEMORY is not set) prevented enabling
MSI-X on passed-through virtual functions, because it checked the VF
for PCI_COMMAND_MEMORY but this is not a valid bit for VFs.

Instead, check the physical function for PCI_COMMAND_MEMORY.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: <stable@vger.kernel.org>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 8d47065f7d1980dde52abb874b301054f3013602)
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
OraBug: 23017418 - Backport Linux v4.4 Xen patches

Merge branch 'linux-4.1/4.4-xen-backport' of git://ca-git.us.oracle.com/linux-joaomart-public into uek4/4.4-xen-backport

* 'linux-4.1/4.4-xen-backport' of git://ca-git.us.oracle.com/linux-joaomart-public: (113 commits)
  arch/x86/xen/suspend.c: include xen/xen.h
  x86/paravirt: Prevent rtc_cmos platform device init on PV guests
  xen-pciback: fix up cleanup path when alloc fails
  xen/pciback: Don't allow MSI-X ops if PCI_COMMAND_MEMORY is not set.
  xen/pciback: For XEN_PCI_OP_disable_msi[|x] only disable if device has MSI(X) enabled.
  xen/pciback: Do not install an IRQ handler for MSI interrupts.
  xen/pciback: Return error on XEN_PCI_OP_enable_msix when device has MSI or MSI-X enabled
  xen/pciback: Return error on XEN_PCI_OP_enable_msi when device has MSI or MSI-X enabled
  xen/pciback: Save xen_pci_op commands before processing it
  xen-scsiback: safely copy requests
  xen-blkback: read from indirect descriptors only once
  xen-blkback: only read request operation from shared ring once
  xen-netback: use RING_COPY_REQUEST() throughout
  xen-netback: don't use last request to determine minimum Tx credit
  xen: Add RING_COPY_REQUEST()
  xen/x86/pvh: Use HVM's flush_tlb_others op
  xen: Resume PMU from non-atomic context
  xen/events/fifo: Consume unprocessed events when a CPU dies
  xen/evtchn: dynamically grow pending event channel ring
  xen/gntdev: Grant maps should not be subject to NUMA balancing
  ...

Backport from Linux v4.4

OraBug: 23017418 - Backport Linux v4.4 Xen patches

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

arch/x86/xen/suspend.c: include xen/xen.h

Fix the build warning:

  arch/x86/xen/suspend.c: In function 'xen_arch_pre_suspend':
  arch/x86/xen/suspend.c:70:9: error: implicit declaration of function 'xen_pv_domain' [-Werror=implicit-function-declaration]
          if (xen_pv_domain())
              ^

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit facca61683f937f31f90307cc64851436c8a3e21)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

x86/paravirt: Prevent rtc_cmos platform device init on PV guests

Adding the rtc platform device in non-privileged Xen PV guests causes
an IRQ conflict because these guests do not have legacy PIC and may
allocate irqs in the legacy range.

In a single VCPU Xen PV guest we should have:

/proc/interrupts:
           CPU0
  0:       4934  xen-percpu-virq      timer0
  1:          0  xen-percpu-ipi       spinlock0
  2:          0  xen-percpu-ipi       resched0
  3:          0  xen-percpu-ipi       callfunc0
  4:          0  xen-percpu-virq      debug0
  5:          0  xen-percpu-ipi       callfuncsingle0
  6:          0  xen-percpu-ipi       irqwork0
  7:        321   xen-dyn-event     xenbus
  8:         90   xen-dyn-event     hvc_console
  ...

But hvc_console cannot get its interrupt because it is already in use
by rtc0 and the console does not work.

  genirq: Flags mismatch irq 8. 00000000 (hvc_console) vs. 00000000 (rtc0)

We can avoid this problem by realizing that unprivileged PV guests (both
Xen and lguests) are not supposed to have rtc_cmos device and so
adding it is not necessary.

Privileged guests (i.e. Xen's dom0) do use it but they should not have
irq conflicts since they allocate irqs above legacy range (above
gsi_top, in fact).

Instead of explicitly testing whether the guest is privileged we can
extend pv_info structure to include information about guest's RTC
support.

Reported-and-tested-by: Sander Eikelenboom <linux@eikelenboom.it>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: vkuznets@redhat.com
Cc: xen-devel@lists.xenproject.org
Cc: konrad.wilk@oracle.com
Cc: stable@vger.kernel.org # 4.2+
Link: http://lkml.kernel.org/r/1449842873-2613-1-git-send-email-boris.ostrovsky@oracle.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
(cherry picked from commit d8c98a1d1488747625ad6044d423406e17e99b7a)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen-pciback: fix up cleanup path when alloc fails

When allocating a pciback device fails, clear the private
field. This could lead to an use-after free, however
the 'really_probe' takes care of setting
dev_set_drvdata(dev, NULL) in its failure path (which we would
exercise if the ->probe function failed), so we we
are OK. However lets be defensive as the code can change.

Going forward we should clean up the pci_set_drvdata(dev, NULL)
in the various code-base. That will be for another day.

Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reported-by: Jonathan Creekmore <jonathan.creekmore@gmail.com>
Signed-off-by: Doug Goldstein <cardoe@cardoe.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 584a561a6fee0d258f9ca644f58b73d9a41b8a46)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen/pciback: Don't allow MSI-X ops if PCI_COMMAND_MEMORY is not set.

commit f598282f51 ("PCI: Fix the NIU MSI-X problem in a better way")
teaches us that dealing with MSI-X can be troublesome.

Further checks in the MSI-X architecture shows that if the
PCI_COMMAND_MEMORY bit is turned of in the PCI_COMMAND we
may not be able to access the BAR (since they are memory regions).

Since the MSI-X tables are located in there.. that can lead
to us causing PCIe errors. Inhibit us performing any
operation on the MSI-X unless the MEMORY bit is set.

Note that Xen hypervisor with:
"x86/MSI-X: access MSI-X table only after having enabled MSI-X"
will return:
xen_pciback: 0000:0a:00.1: error -6 enabling MSI-X for guest 3!

When the generic MSI code tries to setup the PIRQ without
MEMORY bit set. Which means with later versions of Xen
(4.6) this patch is not neccessary.

This is part of XSA-157

CC: stable@vger.kernel.org
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 408fb0e5aa7fda0059db282ff58c3b2a4278baa0)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen/pciback: For XEN_PCI_OP_disable_msi[|x] only disable if device has MSI(X) enabled.

Otherwise just continue on, returning the same values as
previously (return of 0, and op->result has the PIRQ value).

This does not change the behavior of XEN_PCI_OP_disable_msi[|x].

The pci_disable_msi or pci_disable_msix have the checks for
msi_enabled or msix_enabled so they will error out immediately.

However the guest can still call these operations and cause
us to disable the 'ack_intr'. That means the backend IRQ handler
for the legacy interrupt will not respond to interrupts anymore.

This will lead to (if the device is causing an interrupt storm)
for the Linux generic code to disable the interrupt line.

Naturally this will only happen if the device in question
is plugged in on the motherboard on shared level interrupt GSI.

This is part of XSA-157

CC: stable@vger.kernel.org
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 7cfb905b9638982862f0331b36ccaaca5d383b49)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen/pciback: Do not install an IRQ handler for MSI interrupts.

Otherwise an guest can subvert the generic MSI code to trigger
an BUG_ON condition during MSI interrupt freeing:

for (i = 0; i < entry->nvec_used; i++)
BUG_ON(irq_has_action(entry->irq + i));

Xen PCI backed installs an IRQ handler (request_irq) for
the dev->irq whenever the guest writes PCI_COMMAND_MEMORY
(or PCI_COMMAND_IO) to the PCI_COMMAND register. This is
done in case the device has legacy interrupts the GSI line
is shared by the backend devices.

To subvert the backend the guest needs to make the backend
to change the dev->irq from the GSI to the MSI interrupt line,
make the backend allocate an interrupt handler, and then command
the backend to free the MSI interrupt and hit the BUG_ON.

Since the backend only calls 'request_irq' when the guest
writes to the PCI_COMMAND register the guest needs to call
XEN_PCI_OP_enable_msi before any other operation. This will
cause the generic MSI code to setup an MSI entry and
populate dev->irq with the new PIRQ value.

Then the guest can write to PCI_COMMAND PCI_COMMAND_MEMORY
and cause the backend to setup an IRQ handler for dev->irq
(which instead of the GSI value has the MSI pirq). See
'xen_pcibk_control_isr'.

Then the guest disables the MSI: XEN_PCI_OP_disable_msi
which ends up triggering the BUG_ON condition in 'free_msi_irqs'
as there is an IRQ handler for the entry->irq (dev->irq).

Note that this cannot be done using MSI-X as the generic
code does not over-write dev->irq with the MSI-X PIRQ values.

The patch inhibits setting up the IRQ handler if MSI or
MSI-X (for symmetry reasons) code had been called successfully.

P.S.
Xen PCIBack when it sets up the device for the guest consumption
ends up writting 0 to the PCI_COMMAND (see xen_pcibk_reset_device).
XSA-120 addendum patch removed that - however when upstreaming said
addendum we found that it caused issues with qemu upstream. That
has now been fixed in qemu upstream.

This is part of XSA-157

CC: stable@vger.kernel.org
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit a396f3a210c3a61e94d6b87ec05a75d0be2a60d0)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen/pciback: Return error on XEN_PCI_OP_enable_msix when device has MSI or MSI-X enabled

The guest sequence of:

a) XEN_PCI_OP_enable_msix
b) XEN_PCI_OP_enable_msix

results in hitting an NULL pointer due to using freed pointers.

The device passed in the guest MUST have MSI-X capability.

The a) constructs and SysFS representation of MSI and MSI groups.
The b) adds a second set of them but adding in to SysFS fails (duplicate entry).
'populate_msi_sysfs' frees the newly allocated msi_irq_groups (note that
in a) pdev->msi_irq_groups is still set) and also free's ALL of the
MSI-X entries of the device (the ones allocated in step a) and b)).

The unwind code: 'free_msi_irqs' deletes all the entries and tries to
delete the pdev->msi_irq_groups (which hasn't been set to NULL).
However the pointers in the SysFS are already freed and we hit an
NULL pointer further on when 'strlen' is attempted on a freed pointer.

The patch adds a simple check in the XEN_PCI_OP_enable_msix to guard
against that. The check for msi_enabled is not stricly neccessary.

This is part of XSA-157

CC: stable@vger.kernel.org
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 5e0ce1455c09dd61d029b8ad45d82e1ac0b6c4c9)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen/pciback: Return error on XEN_PCI_OP_enable_msi when device has MSI or MSI-X enabled

The guest sequence of:

a) XEN_PCI_OP_enable_msi
b) XEN_PCI_OP_enable_msi
c) XEN_PCI_OP_disable_msi

results in hitting an BUG_ON condition in the msi.c code.

The MSI code uses an dev->msi_list to which it adds MSI entries.
Under the above conditions an BUG_ON() can be hit. The device
passed in the guest MUST have MSI capability.

The a) adds the entry to the dev->msi_list and sets msi_enabled.
The b) adds a second entry but adding in to SysFS fails (duplicate entry)
and deletes all of the entries from msi_list and returns (with msi_enabled
is still set). c) pci_disable_msi passes the msi_enabled checks and hits:

BUG_ON(list_empty(dev_to_msi_list(&dev->dev)));

and blows up.

The patch adds a simple check in the XEN_PCI_OP_enable_msi to guard
against that. The check for msix_enabled is not stricly neccessary.

This is part of XSA-157.

CC: stable@vger.kernel.org
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 56441f3c8e5bd45aab10dd9f8c505dd4bec03b0d)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen/pciback: Save xen_pci_op commands before processing it

Double fetch vulnerabilities that happen when a variable is
fetched twice from shared memory but a security check is only
performed the first time.

The xen_pcibk_do_op function performs a switch statements on the op->cmd
value which is stored in shared memory. Interestingly this can result
in a double fetch vulnerability depending on the performed compiler
optimization.

This patch fixes it by saving the xen_pci_op command before
processing it. We also use 'barrier' to make sure that the
compiler does not perform any optimization.

This is part of XSA155.

CC: stable@vger.kernel.org
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Jan Beulich <JBeulich@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 8135cf8b092723dbfcc611fe6fdcb3a36c9951c5)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen-scsiback: safely copy requests

The copy of the ring request was lacking a following barrier(),
potentially allowing the compiler to optimize the copy away.

Use RING_COPY_REQUEST() to ensure the request is copied to local
memory.

This is part of XSA155.

CC: stable@vger.kernel.org
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit be69746ec12f35b484707da505c6c76ff06f97dc)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen-blkback: read from indirect descriptors only once

Since indirect descriptors are in memory shared with the frontend, the
frontend could alter the first_sect and last_sect values after they have
been validated but before they are recorded in the request. This may
result in I/O requests that overflow the foreign page, possibly
overwriting local pages when the I/O request is executed.

When parsing indirect descriptors, only read first_sect and last_sect
once.

This is part of XSA155.

CC: stable@vger.kernel.org
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 18779149101c0dd43ded43669ae2a92d21b6f9cb)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen-blkback: only read request operation from shared ring once

A compiler may load a switch statement value multiple times, which could
be bad when the value is in memory shared with the frontend.

When converting a non-native request to a native one, ensure that
src->operation is only loaded once by using READ_ONCE().

This is part of XSA155.

CC: stable@vger.kernel.org
Signed-off-by: Roger Pau Monné <roger.pau@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 1f13d75ccb806260079e0679d55d9253e370ec8a)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen-netback: use RING_COPY_REQUEST() throughout

Instead of open-coding memcpy()s and directly accessing Tx and Rx
requests, use the new RING_COPY_REQUEST() that ensures the local copy
is correct.

This is more than is strictly necessary for guest Rx requests since
only the id and gref fields are used and it is harmless if the
frontend modifies these.

This is part of XSA155.

CC: stable@vger.kernel.org
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 68a33bfd8403e4e22847165d149823a2e0e67c9c)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen-netback: don't use last request to determine minimum Tx credit

The last from guest transmitted request gives no indication about the
minimum amount of credit that the guest might need to send a packet
since the last packet might have been a small one.

Instead allow for the worst case 128 KiB packet.

This is part of XSA155.

CC: stable@vger.kernel.org
Reviewed-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 0f589967a73f1f30ab4ac4dd9ce0bb399b4d6357)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen: Add RING_COPY_REQUEST()

Using RING_GET_REQUEST() on a shared ring is easy to use incorrectly
(i.e., by not considering that the other end may alter the data in the
shared ring while it is being inspected). Safe usage of a request
generally requires taking a local copy.

Provide a RING_COPY_REQUEST() macro to use instead of
RING_GET_REQUEST() and an open-coded memcpy(). This takes care of
ensuring that the copy is done correctly regardless of any possible
compiler optimizations.

Use a volatile source to prevent the compiler from reordering or
omitting the copy.

This is part of XSA155.

CC: stable@vger.kernel.org
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 454d5d882c7e412b840e3c99010fe81a9862f6fb)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen/x86/pvh: Use HVM's flush_tlb_others op

Using MMUEXT_TLB_FLUSH_MULTI doesn't buy us much since the hypervisor
will likely perform same IPIs as would have the guest.

More importantly, using MMUEXT_INVLPG_MULTI may not to invalidate the
guest's address on remote CPU (when, for example, VCPU from another guest
is running there).

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Suggested-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 20f36e0380a7e871a711d5e4e59d04d4948326b4)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen: Resume PMU from non-atomic context

Resuming PMU currently triggers a warning from ___might_sleep() (assuming
CONFIG_DEBUG_ATOMIC_SLEEP is set) when xen_pmu_init() allocates GFP_KERNEL
page because we are in state resembling atomic context.

Move resuming PMU to xen_arch_resume() which is called in regular context.
For symmetry move suspending PMU to xen_arch_suspend() as well.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: <stable@vger.kernel.org> # 4.3
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit de0afc9bdeeadaa998797d2333c754bf9f4d5dcf)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen/events/fifo: Consume unprocessed events when a CPU dies

When a CPU is offlined, there may be unprocessed events on a port for
that CPU. If the port is subsequently reused on a different CPU, it
could be in an unexpected state with the link bit set, resulting in
interrupts being missed. Fix this by consuming any unprocessed events
for a particular CPU when that CPU dies.

Signed-off-by: Ross Lagerwall <ross.lagerwall@citrix.com>
Cc: <stable@vger.kernel.org> # 3.14+
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 3de88d622fd68bd4dbee0f80168218b23f798fd0)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen/evtchn: dynamically grow pending event channel ring

If more than 1024 event channels are bound to a evtchn device then it
possible (even with well behaved applications) for the ring to
overflow and events to be lost (reported as an -EFBIG error).

Dynamically increase the size of the ring so there is always enough
space for all bound events.  Well behaved applicables that only unmask
events after draining them from the ring can thus no longer lose
events.

However, an application could unmask an event before draining it,
allowing multiple entries per port to accumulate in the ring, and a
overflow could still occur.  So the overflow detection and reporting
is retained.

The ring size is initially only 64 entries so the common use case of
an application only binding a few events will use less memory than
before.  The ring size may grow to 512 KiB (enough for all 2^17
possible channels).  This order 7 kmalloc() may fail due to memory
fragmentation, so we fall back to trying vmalloc().

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
(cherry picked from commit 8620015499101090ae275bf11e9bc2f9febfdf08)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen/gntdev: Grant maps should not be subject to NUMA balancing

Doing so will cause the grant to be unmapped and then, during
fault handling, the fault to be mistakenly treated as NUMA hint
fault.

In addition, even if those maps could partcipate in NUMA
balancing, it wouldn't provide any benefit since we are unable
to determine physical page's node (even if/when VNUMA is
implemented).

Marking grant maps' VMAs as VM_IO will exclude them from being
part of NUMA balancing.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit 9c17d96500f78d7ecdb71ca6942830158bc75a2b)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

xen: fix the check of e_pfn in xen_find_pfn_range

On some NUMA system, after dom0 up, we see below warning even if there are
enough pfn ranges that could be used for remapping:
"Unable to find available pfn range, not remapping identity pages"

Fix it to avoid getting a memory region of zero size in xen_find_pfn_range.

Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
(cherry picked from commit abed7d0710e8f892c267932a9492ccf447674fb8)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

x86/xen: add reschedule point when mapping foreign GFNs

Mapping a large range of foreign GFNs can take a long time, add a
reschedule point after each batch of 16 GFNs.

Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
(cherry picked from commit 914beb9fc26d6225295b8315ab54026f8f22755c)
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>