Padmanabh Ratnakar [Fri, 24 Aug 2012 13:26:33 +0000 (18:56 +0530)]
be2net: Fix driver load for VFs for Lancer
Permanent MAC is wrongly supplied in create iface command. Call the
command with no MAC address and then MAC address should be later queried
and applied.
Signed-off-by: Padmanabh Ratnakar <padmanabh.ratnakar@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sathya Perla [Fri, 24 Aug 2012 13:05:10 +0000 (18:35 +0530)]
be2net: do not use SCRATCHPAD register
The CUST_SCRATCHPAD_CSR register is used for marking if FW cleanup is
needed. This is used in a crash kernel scenario. Do no use this register as
it is not available for some functions. Instead, always issue an FLR when
a function is probed *except* when VFs are preset (enabled in the previous
PF load).
Signed-off-by: Sathya Perla <sathya.perla@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sathya Perla [Tue, 5 Jun 2012 19:37:19 +0000 (19:37 +0000)]
be2net: do not modify PCI MaxReadReq size
Setting the PCI MRRS to a value of 4096 (overriding the system decided
value) had provided perf improvement in TX.
But, IBM has provided feedback that on POWER platforms, this value is set
by the system firmware, and drivers modifying this value can cause
unpredictable results (like EEH errors.) So, backing off this change.
On POWER7 platforms most slots, it seems, do get a MRRS of 4096.
Suggested-by: Brian King <bjking1@us.ibm.com> Signed-off-by: Sathya Perla <sathya.perla@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Somnath Kotur [Fri, 24 Aug 2012 13:00:51 +0000 (18:30 +0530)]
be2net: Fix to allow get/set of debug levels in the firmware.
Patch re-spin.
Incorporated review comments by Ben Hutchings.
Signed-off-by: Suresh Reddy <suresh.reddy@emulex.com> Signed-off-by: Somnath Kotur <somnath.kotur@emulex.com> Acked-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sathya Perla [Fri, 24 Aug 2012 12:08:10 +0000 (17:38 +0530)]
be2net: avoid disabling sriov while VFs are assigned
Calling pci_disable_sriov() while VFs are assigned to VMs causes
kernel panic. This patch uses PCI_DEV_FLAGS_ASSIGNED bit state of the
VF's pci_dev to avoid this. Also, the unconditional function reset cmd
issued on a PF probe can delete the VF configuration for the
previously enabled VFs. A scratchpad register is now used to issue a
function reset only when needed (i.e., in a crash dump scenario.)
Signed-off-by: Sathya Perla <sathya.perla@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Parav Pandit [Fri, 24 Aug 2012 10:01:27 +0000 (15:31 +0530)]
be2net: Add functionality to support RoCE driver
- Increase MSI-X vectors by 5 for RoCE traffic.
- Add macro to check roce support on a device.
- Add device-specific doorbell and MSI-X vector fields shared with NIC
functionality.
- Provide RoCE driver registration and deregistration functions.
- Add support functions which will be invoked on adapter add/remove
and port up/down events.
- Traverse through the list of adapters to invoke callback functions.
Signed-off-by: Parav Pandit <parav.pandit@emulex.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Roland Dreier <roland@purestorage.com>
Parav Pandit [Mon, 26 Mar 2012 14:27:12 +0000 (14:27 +0000)]
be2net: Add function to issue mailbox cmd on MQ
- Add generic function to issue mailbox cmd on MQ as export function.
- RoCE driver will use this before it setups its own MQ.
Signed-off-by: Parav Pandit <parav.pandit@emulex.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Roland Dreier <roland@purestorage.com>
Saurav Kashyap [Thu, 2 Aug 2012 06:04:17 +0000 (11:34 +0530)]
qla2xxx: Restrict nic core reset to one function for mctp.
In case of mctp enable board both functions receive 8200 AEN, both captures
the dump and both ends up restarting the nic core. This patch prevents allow
only function to perform nic core reset.
Saurav Kashyap [Mon, 27 Aug 2012 07:28:00 +0000 (12:58 +0530)]
qla2xxx: Update to Implementation of the mctp.
- Changes the size to 0x86064.
- The fw expects dword len instead of bytes.
- removed family version, its removed from requirements.
- Do nic core reset even on dump failure.
- mctp dump failure used to return failure even in case of success.
- memset mailbox regs and correctly set them.
Chad Dupuis [Mon, 27 Aug 2012 07:24:57 +0000 (12:54 +0530)]
qla2xxx: Fix for handling some error conditions in loopback.
Fixes the bug where in case we do not receive DCBX completion aen after a
set-port mbx Or when we get a bad status for IDC completion (mbx 8100) AEN
(i.e. bit 15 of mb2 is set) then we need to return error status and fail the
loopback operation.
Andrew Vasquez [Tue, 3 Jul 2012 16:51:56 +0000 (09:51 -0700)]
qla2xxx: Ensure PLOGI is sent to Fabric Management-Server upon request.
The internal firmware state for this 'well known port' may
be out-of-sync after a link-flop, causing a follow-on
CT-request to be dropped due to the requestor not having
been 'logged in'. Correct the code by not passing the
'conditional' directive for the PLOGI request.
Cathy Avery [Fri, 17 Aug 2012 19:15:28 +0000 (15:15 -0400)]
[ovmapi] fix memcpy overrun, leaks and mutex unlock
Added bug fixes:
mempy overrun of name and value buffer when strings are too long.
Fixed memory leaks.
Fixed not unlocking mutex on some error returns.
Konrad Rzeszutek Wilk [Fri, 17 Aug 2012 14:26:21 +0000 (10:26 -0400)]
Merge branch 'stable/for-linus-3.7.rebased' into uek2-merge
* stable/for-linus-3.7.rebased:
xen/mmu: If the revector fails, don't attempt to revector anything else.
xen/p2m: When revectoring deal with holes in the P2M array.
xen/p2m: Reuse existing P2M leafs if they are filled with 1:1 PFNs or INVALID.
Revert "xen PVonHVM: move shared_info to MMIO before kexec"
xen/mmu: Release just the MFN list, not MFN list and part of pagetables.
Konrad Rzeszutek Wilk [Fri, 17 Aug 2012 13:35:31 +0000 (09:35 -0400)]
xen/mmu: If the revector fails, don't attempt to revector anything else.
If the P2M revectoring would fail, we would try to continue on by
cleaning the PMD for L1 (PTE) page-tables. The xen_cleanhighmap
is greedy and erases the PMD on both boundaries. Since the P2M
array can share the PMD, we would wipe out part of the __ka
that is still used in the P2M tree to point to P2M leafs.
This fixes it by bypassing the revectoring and continuing on.
If the revector fails, a nice WARN is printed so we can still
troubleshoot this.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Konrad Rzeszutek Wilk [Thu, 16 Aug 2012 20:38:55 +0000 (16:38 -0400)]
xen/p2m: When revectoring deal with holes in the P2M array.
When we free the PFNs and then subsequently populate them back
during bootup:
Freeing 20000-20200 pfn range: 512 pages freed
1-1 mapping on 20000->20200
Freeing 40000-40200 pfn range: 512 pages freed
1-1 mapping on 40000->40200
Freeing bad80-badf4 pfn range: 116 pages freed
1-1 mapping on bad80->badf4
Freeing badf6-bae7f pfn range: 137 pages freed
1-1 mapping on badf6->bae7f
Freeing bb000-100000 pfn range: 282624 pages freed
1-1 mapping on bb000->100000
Released 283999 pages of unused memory
Set 283999 page(s) to 1-1 mapping
Populating 1acb8a-1f20e9 pfn range: 283999 pages added
We end up having the P2M array (that is the one that was
grafted on the P2M tree) filled with IDENTITY_FRAME or
INVALID_P2M_ENTRY) entries. The patch titled
"xen/p2m: Reuse existing P2M leafs if they are filled with 1:1 PFNs or INVALID."
recycles said slots and replaces the P2M tree leaf's with
&mfn_list[xx] with p2m_identity or p2m_missing.
And re-uses the P2M array sections for other P2M tree leaf's.
For the above mentioned bootup excerpt, the PFNs at
0x20000->0x20200 are going to be IDENTITY based:
P2M[0][256][0] -> P2M[0][257][0] get turned in IDENTITY_FRAME.
We can re-use that and replace P2M[0][256] to point to p2m_identity.
The "old" page (the grafted P2M array provided by Xen) that was at
P2M[0][256] gets put somewhere else. Specifically at P2M[6][358],
b/c when we populate back:
we fill P2M[6][358][0] (and P2M[6][358], P2M[6][359], ...) with
the new MFNs.
That is all OK, except when we revector we assume that the PFN
count would be the same in the grafted P2M array and in the
newly allocated. Since that is no longer the case, as we have
holes in the P2M that point to p2m_missing or p2m_identity we
have to take that into account.
[v2: Check for overflow] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Konrad Rzeszutek Wilk [Fri, 17 Aug 2012 13:27:35 +0000 (09:27 -0400)]
xen/p2m: Reuse existing P2M leafs if they are filled with 1:1 PFNs or INVALID.
If P2M leaf is completly packed with INVALID_P2M_ENTRY or with
1:1 PFNs (so IDENTITY_FRAME type PFNs), we can swap the P2M leaf
with either a p2m_missing or p2m_identity respectively. The old
page (which was created via extend_brk or was grafted on from the
mfn_list) can be re-used for setting new PFNs.
This also means we can remove git commit: 5bc6f9888db5739abfa0cae279b4b442e4db8049
xen/p2m: Reserve 8MB of _brk space for P2M leafs when populating back
which tried to fix this.
and make the amount that is required to be reserved much smaller.
CC: stable@vger.kernel.org # for 3.5 only. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Konrad Rzeszutek Wilk [Tue, 14 Aug 2012 20:37:31 +0000 (16:37 -0400)]
xen/mmu: Release just the MFN list, not MFN list and part of pagetables.
We call memblock_reserve for [start of mfn list] -> [PMD aligned end
of mfn list] instead of <start of mfn list> -> <page aligned end of mfn list].
This has the disastrous effect that if at bootup the end of mfn_list is
not PMD aligned we end up returning to memblock parts of the region
past the mfn_list array. And those parts are the PTE tables with
the disastrous effect of seeing this at bootup:
Write protecting the kernel read-only data: 10240k
Freeing unused kernel memory: 1860k freed
Freeing unused kernel memory: 200k freed
(XEN) mm.c:2429:d0 Bad type (saw 1400000000000002 != exp 7000000000000000) for mfn 116a80 (pfn 14e26)
...
(XEN) mm.c:908:d0 Error getting mfn 116a83 (pfn 14e2a) from L1 entry 8000000116a83067 for l1e_owner=0, pg_owner=0
(XEN) mm.c:908:d0 Error getting mfn 4040 (pfn 5555555555555555) from L1 entry 0000000004040601 for l1e_owner=0, pg_owner=0
.. and so on.
Maxim Uvarov [Thu, 9 Aug 2012 15:14:24 +0000 (08:14 -0700)]
x86/nmi: Add new NMI queues to deal with IO_CHK and SERR
In discussions with Thomas Mingarelli about hpwdt, he explained
to me some issues they were some when using their virtual NMI
button to test the hpwdt driver.
It turns out the virtual NMI button used on HP's machines do no
send unknown NMIs but instead send IO_CHK NMIs. The way the
kernel code is written, the hpwdt driver can not register itself
against that type of NMI and therefore can not successfully
capture system information before panic'ing.
To solve this I created two new NMI queues to allow driver to
register against the IO_CHK and SERR NMIs. Or in the hpwdt all
three (if you include unknown NMIs too).
The change is straightforward and just mimics what the unknown
NMI does.
Reported-and-tested-by: Thomas Mingarelli <thomas.mingarelli@hp.com> Signed-off-by: Don Zickus <dzickus@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/1333051877-15755-3-git-send-email-dzickus@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
Conflicts:
Don Zickus [Fri, 30 Sep 2011 19:06:20 +0000 (15:06 -0400)]
x86, nmi: Create new NMI handler routines
The NMI handlers used to rely on the notifier infrastructure. This worked
great until we wanted to support handling multiple events better.
One of the key ideas to the nmi handling is to process _all_ the handlers for
each NMI. The reason behind this switch is because NMIs are edge triggered.
If enough NMIs are triggered, then they could be lost because the cpu can
only latch at most one NMI (besides the one currently being processed).
In order to deal with this we have decided to process all the NMI handlers
for each NMI. This allows the handlers to determine if they recieved an
event or not (the ones that can not determine this will be left to fend
for themselves on the unknown NMI list).
As a result of this change it is now possible to have an extra NMI that
was destined to be received for an already processed event. Because the
event was processed in the previous NMI, this NMI gets dropped and becomes
an 'unknown' NMI. This of course will cause printks that scare people.
However, we prefer to have extra NMIs as opposed to losing NMIs and as such
are have developed a basic mechanism to catch most of them. That will be
a later patch.
To accomplish this idea, I unhooked the nmi handlers from the notifier
routines and created a new mechanism loosely based on doIRQ. The reason
for this is the notifier routines have a couple of shortcomings. One we
could't guarantee all future NMI handlers used NOTIFY_OK instead of
NOTIFY_STOP. Second, we couldn't keep track of the number of events being
handled in each routine (most only handle one, perf can handle more than one).
Third, I wanted to eventually display which nmi handlers are registered in
the system in /proc/interrupts to help see who is generating NMIs.
The patch below just implements the new infrastructure but doesn't wire it up
yet (that is the next patch). Its design is based on doIRQ structs and the
atomic notifier routines. So the rcu stuff in the patch isn't entirely untested
(as the notifier routines have soaked it) but it should be double checked in
case I copied the code wrong.
"Historically, Linux has tried to make the regular timer tick on the
various CPUs not happen at the same time, to avoid contention on
xtime_lock.
Nowadays, with the tickless kernel, this contention no longer happens
since time keeping and updating are done differently. In addition,
this skew is actually hurting power consumption in a measurable way on
many-core systems."
Problems:
- Contrary to the above, systems do encounter contention on both
xtime_lock and RCU structure locks when the tick is synchronized.
- Moderate sized RT systems suffer intolerable jitter due to the tick
being synchronized.
- SGI reports the same for their large systems.
- Fully utilized systems reap no power saving benefit from skew removal,
but do suffer from resulting induced lock contention.
- 0209f649 rcu: limit rcu_node leaf-level fanout
This patch was born to combat lock contention which testing showed
to have been _induced by_ skew removal. Skew the tick, contention
disappeared virtually completely.
Dimitri Sivanich [Fri, 10 Aug 2012 11:46:56 +0000 (04:46 -0700)]
vfs: fix panic in __d_lookup() with high dentry hashtable counts
When the number of dentry cache hash table entries gets too high
(2147483648 entries), as happens by default on a 16TB system, use of a
signed integer in the dcache_init() initialization loop prevents the
dentry_hashtable from getting initialized, causing a panic in
__d_lookup(). Fix this in dcache_init() and similar areas.
Signed-off-by: Dimitri Sivanich <sivanich@sgi.com> Acked-by: David S. Miller <davem@davemloft.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Conflicts:
Jack Steiner [Fri, 10 Aug 2012 11:45:33 +0000 (04:45 -0700)]
cpusets: randomize node rotor used in cpuset_mem_spread_node()
Some workloads that create a large number of small files tend to assign
too many pages to node 0 (multi-node systems). Part of the reason is that
the rotor (in cpuset_mem_spread_node()) used to assign nodes starts at
node 0 for newly created tasks.
This patch changes the rotor to be initialized to a random node number of
the cpuset.
[akpm@linux-foundation.org: fix layout]
[Lee.Schermerhorn@hp.com: Define stub numa_random() for !NUMA configuration] Signed-off-by: Jack Steiner <steiner@sgi.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Paul Menage <menage@google.com> Cc: Jack Steiner <steiner@sgi.com> Cc: Robin Holt <holt@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Conflicts:
Jack Steiner [Fri, 10 Aug 2012 11:44:37 +0000 (04:44 -0700)]
x86: Reduce clock calibration time during slave cpu startup
Reduce the startup time for slave cpus.
Adds hooks for an arch-specific function for clock calibration.
These hooks are used on x86. If a newly started cpu has the
same phys_proc_id as a core already active, uses the TSC for the
delay loop and has a CONSTANT_TSC, use the already-calculated
value of loops_per_jiffy.
This patch reduces the time required to start slave cpus on a
4096 cpu system from: 465 sec OLD 62 sec NEW
This reduces boot time on a 4096p system by almost 7 minutes.
Nice...
Signed-off-by: Jack Steiner <steiner@sgi.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: John Stultz <john.stultz@linaro.org>
[fix CONFIG_SMP=n build] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu>
Conflicts:
Mike Travis [Fri, 10 Aug 2012 11:40:36 +0000 (04:40 -0700)]
x86, pci: Increase the number of iommus supported to be MAX_IO_APICS
The number of IOMMUs supported should be the same as the number of IO APICS.
This limit comes into play when the IOMMUs are identity mapped, thus the
number of possible IOMMUs in the "static identity" (si) domain should be
this same number.
Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Jack Steiner <steiner@sgi.com>
Mike Travis [Fri, 10 Aug 2012 11:37:48 +0000 (04:37 -0700)]
x86 PCI: Fix identity mapping for sandy bridge
With SandyBridge, Intel has changed these Socket PCI devices to have a class
type of "System Peripheral" & "Performance counter", rather than "HostBridge".
So instead of using a "special" case to detect which devices will not be
doing DMA, use the fact that a device that is not associated with an IOMMU,
will not need an identity map.
Signed-off-by: Mike Travis <travis@sgi.com> Signed-off-by: Mike Habeck <habeck@sgi.com>
Joe Jin [Fri, 20 Jul 2012 13:30:51 +0000 (21:30 +0800)]
[scsi] hpsa: add all support devices for ol5
Orabug: 14106006
To support uek2 on ol5, commit 29a8828 disable some devices from support list,
this made ovm3 upgrade from 3.0.3 to 3.1.1 failed to addressed local disk for
disk device name changed.
If kernel run as ovm3.1.1 dom0 kernel, please pass cciss_allow_hpsa=1 when
load cciss driver, for ol5.
Adnan Misherfi [Thu, 2 Aug 2012 20:17:44 +0000 (16:17 -0400)]
Disable VLAN 0 tagging for none VLAN traffic
Orabug: 14406424
Cisco enic driver on UCS blades tags a None VLAN traffic with VLAN 0, this causes VMs
that do not have the kernel patch " VLAN 0 should be treated as no vlan tag" to drop all
receive traffic as these VMs do not know how to deal with the VLAN 0 tag.
This is also a problem for older VMs that can not take the mentioned patch.
This fix disables the enic driver from tagging a None VLAN traffic with VLAN 0.This
fix is controlled by a driver parameters " disable_vlan0". the default value is disable_vlan0=1
which to disable the driver from tagging traffic with VLAN 0. To revert to original behavior
add "options enic disable_vlan0=0" to /etc/modprobe.con
Jeff Mahoney [Thu, 2 Aug 2012 12:04:00 +0000 (05:04 -0700)]
dl2k: Clean up rio_ioctl
Orabug: 14126896
The dl2k driver's rio_ioctl call has a few issues:
- No permissions checking
- Implements SIOCGMIIREG and SIOCGMIIREG using the SIOCDEVPRIVATE numbers
- Has a few ioctls that may have been used for debugging at one point
but have no place in the kernel proper.
This patch removes all but the MII ioctls, renumbers them to use the
standard ones, and adds the proper permission check for SIOCSMIIREG.
We can also get rid of the dl2k-specific struct mii_data in favor of
the generic struct mii_ioctl_data.
Since we have the phyid on hand, we can add the SIOCGMIIPHY ioctl too.
Most of the MII code for the driver could probably be converted to use
the generic MII library but I don't have a device to test the results.
This fixes: CVE-2012-2313
Reported-by: Stephan Mueller <stephan.mueller@atsec.com> Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Maxim Uvarov <maxim.uvarov@oracle.com> Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com>
Jeff Mahoney [Thu, 2 Aug 2012 12:04:00 +0000 (05:04 -0700)]
dl2k: Clean up rio_ioctl
Orabug: 14126896
The dl2k driver's rio_ioctl call has a few issues:
- No permissions checking
- Implements SIOCGMIIREG and SIOCGMIIREG using the SIOCDEVPRIVATE numbers
- Has a few ioctls that may have been used for debugging at one point
but have no place in the kernel proper.
This patch removes all but the MII ioctls, renumbers them to use the
standard ones, and adds the proper permission check for SIOCSMIIREG.
We can also get rid of the dl2k-specific struct mii_data in favor of
the generic struct mii_ioctl_data.
Since we have the phyid on hand, we can add the SIOCGMIIPHY ioctl too.
Most of the MII code for the driver could probably be converted to use
the generic MII library but I don't have a device to test the results.
This fixes: CVE-2012-2313
Reported-by: Stephan Mueller <stephan.mueller@atsec.com> Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Maxim Uvarov <maxim.uvarov@oracle.com>
This is not for upstream as it memblock_x86_reserve_range is not
used upstream anymore.
When I back-ported the patches:
xen/x86: Use memblock_reserve for sensitive areas.
xen/mmu: Recycle the Xen provided L4, L3, and L2 pages
I simply used sed s/memblock_reserve/memblock_x86_reserve_range/.
That was incorrect as the parameters are different - memblock_reserve
as second expects the size, while memblock_x86_reserve_range expects
the physical address. This patch fixes those bugs.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Merge branch 'stable/for-linus-3.7.rebased' into uek2-merge
* stable/for-linus-3.7.rebased:
xen/p2m: Reserve 8MB of _brk space for P2M leafs when populating back.
xen/mmu: Remove from __ka space PMD entries for pagetables.
xen/mmu: Copy and revector the P2M tree.
xen/p2m: Add logic to revector a P2M tree to use __va leafs.
xen/mmu: Recycle the Xen provided L4, L3, and L2 pages
xen/mmu: For 64-bit do not call xen_map_identity_early
xen/mmu: use copy_page instead of memcpy.
xen/mmu: Provide comments describing the _ka and _va aliasing issue
xen/mmu: The xen_setup_kernel_pagetable doesn't need to return anything.
xen/x86: Use memblock_reserve for sensitive areas.
xen/p2m: Fix the comment describing the P2M tree.
xen/perf: Define .glob for the different hypercalls.
We then try to populate those pages back. In the P2M tree however
the space for those leafs must be reserved - as such we use extend_brk.
We reserve 8MB of _brk space, which means we can fit over 1048576 PFNs - which is more than we should ever need.
[v1: Made it 8MB of _brk space instead of 4MB per Jan's suggestion] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 99266871de5006ba7ad0bfece6bb283ede4094b9)
xen/mmu: Remove from __ka space PMD entries for pagetables.
Please first read the description in "xen/mmu: Copy and revector the
P2M tree."
At this stage, the __ka address space (which is what the old
P2M tree was using) is partially disassembled. The cleanup_highmap
has removed the PMD entries from 0-16MB and anything past _brk_end
up to the max_pfn_mapped (which is the end of the ramdisk).
The xen_remove_p2m_tree and code around has ripped out the __ka for
the old P2M array.
Here we continue on doing it to where the Xen page-tables were.
It is safe to do it, as the page-tables are addressed using __va.
For good measure we delete anything that is within MODULES_VADDR
and up to the end of the PMD.
At this point the __ka only contains PMD entries for the start
of the kernel up to __brk.
[v1: Per Stefano's suggestion wrapped the MODULES_VADDR in debug] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit 4e928e1a48b6b76e0b8384160213a32d03197e4b)
Please first read the description in "xen/p2m: Add logic to revector a
P2M tree to use __va leafs" patch.
The 'xen_revector_p2m_tree()' function allocates a new P2M tree
copies the contents of the old one in it, and returns the new one.
At this stage, the __ka address space (which is what the old
P2M tree was using) is partially disassembled. The cleanup_highmap
has removed the PMD entries from 0-16MB and anything past _brk_end
up to the max_pfn_mapped (which is the end of the ramdisk).
We have revectored the P2M tree (and the one for save/restore as well)
to use new shiny __va address to new MFNs. The xen_start_info
has been taken care of already in 'xen_setup_kernel_pagetable()' and
xen_start_info->shared_info in 'xen_setup_shared_info()', so
we are free to roam and delete PMD entries - which is exactly what
we are going to do. We rip out the __ka for the old P2M array.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Conflicts:
As can be seen, the ramdisk, P2M and pagetables are taking
a bit of __ka addresses space. Which is a problem since the
MODULES_VADDR starts at 0xffffffffa0000000 - and P2M sits
right in there! This results during bootup with the inability to
load modules, with this error:
Since the __va and __ka are 1:1 up to MODULES_VADDR and
cleanup_highmap rids __ka of the ramdisk mapping, what
we want to do is similar - get rid of the P2M in the __ka
address space. There are two ways of fixing this:
1) All P2M lookups instead of using the __ka address would
use the __va address. This means we can safely erase from
__ka space the PMD pointers that point to the PFNs for
P2M array and be OK.
2). Allocate a new array, copy the existing P2M into it,
revector the P2M tree to use that, and return the old
P2M to the memory allocate. This has the advantage that
it sets the stage for using XEN_ELF_NOTE_INIT_P2M
feature. That feature allows us to set the exact virtual
address space we want for the P2M - and allows us to
boot as initial domain on large machines.
So we pick option 2).
This patch only lays the groundwork in the P2M code. The patch
that modifies the MMU is called "xen/mmu: Copy and revector the P2M tree."
xen/mmu: Recycle the Xen provided L4, L3, and L2 pages
As we are not using them. We end up only using the L1 pagetables
and grafting those to our page-tables.
[v1: Per Stefano's suggestion squashed two commits]
[v2: Per Stefano's suggestion simplified loop] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Conflicts:
Acked-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
(cherry picked from commit cbc09be35990fb3d15671507f11c3e90479ef816)
xen/mmu: Provide comments describing the _ka and _va aliasing issue
Which is that the level2_kernel_pgt (__ka virtual addresses)
and level2_ident_pgt (__va virtual address) contain the same
PMD entries. So if you modify a PTE in __ka, it will be reflected
in __va (and vice-versa).
xen/x86: Use memblock_reserve for sensitive areas.
instead of a big memblock_reserve. This way we can be more
selective in freeing regions (and it also makes it easier
to understand where is what).
[v1: Move the auto_translate_physmap to proper line]
[v2: Per Stefano suggestion add more comments] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[upstream git commit 91addbf07abfdd109a9da4e02061e6ed3728b298]
Conflicts: