Merge branch 'topic/uek-4.1/sparc' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* 'topic/uek-4.1/sparc' of git://ca-git.us.oracle.com/linux-uek: (25 commits)
lib/iommu-common.c: do not use 0xffffffffffffffffl for computing align_mask
sparc64: use ENTRY/ENDPROC in VISsave
SPARC64: PORT LDOMS TO UEK4
Fix incorrect ASI_ST_BLKINIT_MRU_S value
sparc64: perf: Use UREG_FP rather than UREG_I6
sparc64: perf: Add sanity checking on addresses in user stack
sparc64: Convert BUG_ON to warning
sparc: perf: Disable pagefaults while walking userspace stacks
sparc: time: Replace update_persistent_clock() with CONFIG_RTC_SYSTOHC
PCI: Set under_pref for mem64 resource of pcie device
sparc/PCI: Add mem64 resource parsing for root bus
PCI: Add pci_bus_addr_t
sparc64: Fix userspace FPU register corruptions.
sparc64: using 2048 as default for number of CPUS (cherry picked from commit 578ddb2512a5c908cd17ef8cbc43ff78dd399afd)
sparc64: iommu-common build error fix (cherry picked from commit accb4c6276793b991c6382bf57a58b40ea17eb11)
sparc64: fix Setup sysfs to mark LDOM sockets build error (cherry picked from commit 59be02427bfcac6c904ddd1374c35d63155b82d4)
sparc64: mmap fixed and shared
sparc64: restore TIF_FREEZE flag for sparc
sparc64: Setup sysfs to mark LDOM sockets, cores and threads correctly
sparc: Revert generic IOMMU allocator.
...
Merge branch 'topic/uek-4.1/dtrace' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* 'topic/uek-4.1/dtrace' of git://ca-git.us.oracle.com/linux-uek:
kallsyms: unbreak kallmodsyms after CONFIG_KALLMODSYMS addition
kallsyms: de-ifdef kallmodsyms
dtrace: use syscall_get_nr() to obtain syscall number
Merge branch 'topic/uek-4.1/ocfs2' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* 'topic/uek-4.1/ocfs2' of git://ca-git.us.oracle.com/linux-uek:
add OCFS2_LOCK_RECURSIVE arg_flags to ocfs2_cluster_lock() to prevent hang
ocfs2: direct write will call ocfs2_rw_unlock() twice when doing aio+dio
ocfs2_iop_set/get_acl() are also called from the VFS so we must take inode lock
BUG_ON(lockres->l_level != DLM_LOCK_EX && !checkpointed) tripped in ocfs2_ci_checkpointed
Merge branch 'topic/uek-4.1/upstream-cherry-picks' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* 'topic/uek-4.1/upstream-cherry-picks' of git://ca-git.us.oracle.com/linux-uek:
NVMe: Setup max hardware sector count to 512KB
intel_pstate: enable HWP per CPU
Merge branch '4.1_sparc' of git://ca-git.us.oracle.com/linux-uek-apais into topic/uek-4.1/sparc
* '4.1_sparc' of git://ca-git.us.oracle.com/linux-uek-apais: (25 commits)
lib/iommu-common.c: do not use 0xffffffffffffffffl for computing align_mask
sparc64: use ENTRY/ENDPROC in VISsave
SPARC64: PORT LDOMS TO UEK4
Fix incorrect ASI_ST_BLKINIT_MRU_S value
sparc64: perf: Use UREG_FP rather than UREG_I6
sparc64: perf: Add sanity checking on addresses in user stack
sparc64: Convert BUG_ON to warning
sparc: perf: Disable pagefaults while walking userspace stacks
sparc: time: Replace update_persistent_clock() with CONFIG_RTC_SYSTOHC
PCI: Set under_pref for mem64 resource of pcie device
sparc/PCI: Add mem64 resource parsing for root bus
PCI: Add pci_bus_addr_t
sparc64: Fix userspace FPU register corruptions.
sparc64: using 2048 as default for number of CPUS (cherry picked from commit 578ddb2512a5c908cd17ef8cbc43ff78dd399afd)
sparc64: iommu-common build error fix (cherry picked from commit accb4c6276793b991c6382bf57a58b40ea17eb11)
sparc64: fix Setup sysfs to mark LDOM sockets build error (cherry picked from commit 59be02427bfcac6c904ddd1374c35d63155b82d4)
sparc64: mmap fixed and shared
sparc64: restore TIF_FREEZE flag for sparc
sparc64: Setup sysfs to mark LDOM sockets, cores and threads correctly
sparc: Revert generic IOMMU allocator.
...
Sowmini Varadhan [Thu, 6 Aug 2015 22:46:39 +0000 (15:46 -0700)]
lib/iommu-common.c: do not use 0xffffffffffffffffl for computing align_mask
Using a 64 bit constant generates "warning: integer constant is too
large for 'long' type" on 32 bit platforms. Instead use ~0ul and
BITS_PER_LONG.
Detected by Andrew Morton on ARMD.
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David S. Miller <davem@davemloft.net> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 447f6a95a9c80da7faaec3e66e656eab8f262640) Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 73958c651fbf70d8d8bf2a60b871af5f7a2e3199) Signed-off-by: Allen Pais <allen.pais@oracle.com>
Linux in box NVMe driver does not handle 0 MDTS as expected
•0 MDTS - the drive can accept any request size.
•The device driver set up max hardware sector size by
BLK_SAFE_MAX_SECTORS or 124KB.
•Every IO size greater than 124KB is splitted by 124KB and remainder.
David Ahern [Mon, 15 Jun 2015 20:15:45 +0000 (16:15 -0400)]
sparc64: perf: Add sanity checking on addresses in user stack
Processes are getting killed (sigbus or segv) while walking userspace
callchains when using perf. In some instances I have seen ufp = 0x7ff
which does not seem like a proper stack address.
This patch adds a function to run validity checks against the address
before attempting the copy_from_user. The checks are copied from the
x86 version as a start point with the addition of a 4-byte alignment
check.
David Ahern [Mon, 15 Jun 2015 20:15:44 +0000 (16:15 -0400)]
sparc64: Convert BUG_ON to warning
Pagefault handling has a BUG_ON path that panics the system. Convert it to
a warning instead. There is no need to bring down the system for this kind
of failure.
The following was hit while running:
perf sched record -g -- make -j 16
David Ahern [Mon, 15 Jun 2015 20:15:43 +0000 (16:15 -0400)]
sparc: perf: Disable pagefaults while walking userspace stacks
Page faults generated walking userspace stacks can call schedule to switch
out the task. When collecting callchains for scheduler tracepoints this
causes a deadlock as the tracepoints can be hit with the runqueue lock held:
[ 8138.159054] WARNING: CPU: 758 PID: 12488 at /opt/dahern/linux.git/arch/sparc/kernel/nmi.c:80 perfctr_irq+0x1f8/0x2b4()
[ 8138.203152] Watchdog detected hard LOCKUP on cpu 758
All the bridges 64-bit resource have pref bit, but the device resource does not
have pref set, then we can not find parent for the device resource,
as we can not put non-pref mem under pref mem.
According to pcie spec errta
https://www.pcisig.com/specifications/pciexpress/base2/PCIe_Base_r2.1_Errata_08Jun10.pdf
page 13, in some case it is ok to mark some as pref.
Only set pref for 64bit mmio when the entire path from the host to the adapter is
over PCI Express.
The problem is that sparc64 assumed that dma_addr_t only needed to hold DMA
addresses, i.e., bus addresses returned via the DMA API (dma_map_single(),
etc.), while the PCI core assumed dma_addr_t could hold *any* bus address,
including raw BAR values. On sparc64, all DMA addresses fit in 32 bits, so
dma_addr_t is a 32-bit type. However, BAR values can be 64 bits wide, so
they don't fit in a dma_addr_t. d63e2e1f3df9 added new checking that
tripped over this mismatch.
Add pci_bus_addr_t, which is wide enough to hold any PCI bus address,
including both raw BAR values and DMA addresses. This will be 64 bits
on 64-bit platforms and on platforms with a 64-bit dma_addr_t. Then
dma_addr_t only needs to be wide enough to hold addresses from the DMA API.
If we have a series of events from userpsace, with %fprs=FPRS_FEF,
like follows:
ETRAP
ETRAP
VIS_ENTRY(fprs=0x4)
VIS_EXIT
RTRAP (kernel FPU restore with fpu_saved=0x4)
RTRAP
We will not restore the user registers that were clobbered by the FPU
using kernel code in the inner-most trap.
Traps allocate FPU save slots in the thread struct, and FPU using
sequences save the "dirty" FPU registers only.
This works at the initial trap level because all of the registers
get recorded into the top-level FPU save area, and we'll return
to userspace with the FPU disabled so that any FPU use by the user
will take an FPU disabled trap wherein we'll load the registers
back up properly.
But this is not how trap returns from kernel to kernel operate.
The simplest fix for this bug is to always save all FPU register state
for anything other than the top-most FPU save area.
Getting rid of the optimized inner-slot FPU saving code ends up
making VISEntryHalf degenerate into plain VISEntry.
Longer term we need to do something smarter to reinstate the partial
save optimizations. Perhaps the fundament error is having trap entry
and exit allocate FPU save slots and restore register state. Instead,
the VISEntry et al. calls should be doing that work.
This bug is about two decades old.
Reported-by: James Y Knight <jyknight@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit b75513b0f1c734b1e084a6e9952ea6260d4724e3)
bob picco [Thu, 25 Jun 2015 00:10:18 +0000 (17:10 -0700)]
sparc64: mmap fixed and shared
Older sparc64 must have a VAC because there is concern that mmapping fixed
and shared with incorrect align would cause cache aliases. To my knowledge
this is not an issue for sun4v. I will eventually research this.
The patch appears required for uek4 too.
We will enforce the rigid alignment condition only for tlb_type != hypervisor.
sparc64: Setup sysfs to mark LDOM sockets, cores and threads correctly
The current sparc kernel has no representation for sockets though tools
like lscpu can pull this from sysfs. This patch walks the machine
description cache and socket hierarchy and marks sockets as well as cores
and threads such that a representative sysfs is created by
drivers/base/topology.c.
Before this patch:
$ lscpu
Architecture: sparc64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Big Endian
CPU(s): 1024
On-line CPU(s) list: 0-1023
Thread(s) per core: 8
Core(s) per socket: 1 <--- wrong
Socket(s): 128 <--- wrong
NUMA node(s): 4
NUMA node0 CPU(s): 0-255
NUMA node1 CPU(s): 256-511
NUMA node2 CPU(s): 512-767
NUMA node3 CPU(s): 768-1023
After this patch:
$ lscpu
Architecture: sparc64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Big Endian
CPU(s): 1024
On-line CPU(s) list: 0-1023
Thread(s) per core: 8
Core(s) per socket: 32
Socket(s): 4
NUMA node(s): 4
NUMA node0 CPU(s): 0-255
NUMA node1 CPU(s): 256-511
NUMA node2 CPU(s): 512-767
NUMA node3 CPU(s): 768-1023
Most of this patch was done by Chris with updates by David.
Signed-off-by: Chris Hyser <chris.hyser@oracle.com> Signed-off-by: David Ahern <david.ahern@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit acc455cffa75070d55e74fc7802b49edbc080e92)
Conflicts:
arch/sparc/include/asm/cpudata_64.h
arch/sparc/kernel/mdesc.c
arch/sparc/kernel/smp_64.c Signed-off-by: Allen Pais <allen.pais@oracle.com>
(cherry picked from commit bd1039234cf41d0afd35f8e9a302eac9c344d18d)
Allen Pais [Wed, 7 Jan 2015 12:36:22 +0000 (18:06 +0530)]
sparc64: Setup sysfs to mark LDOM sockets, cores and threads correctly.
The current sparc kernel has no representation for sockets (i.e. a 3rd level
cache shared by cores) though tools like lscpu can pull this from sysfs. This
patch walks the LDOM MD (machine description) cache hierarchy structure and
marks sockets as well as cores and threads such that a representative sysfs is
created by drivers/base/topology.c.
Allen Pais [Fri, 2 Jan 2015 05:47:00 +0000 (11:17 +0530)]
sparc64: prevent solaris control domain warnings about Domain Service handles
Solaris created its own protocol on top of domain service registration. This
matters because the control domain that linux is talking to is Solaris. The
hypervisor specs say that the handle used for service identification is simply
an opaque 64 bit number. The only constraint is that a handle never be used
twice (within a reasonable time frame) to prevent connection to a prior stale
registered handle. Solaris on the other hand reserves the bit 0x80000000 to
indicate what it calls client registration requests. These registration requests
are sent to the guest domain to prod it to send its own registration requests to
the control domain.
When a guest (linux in this case) sends its own registration requests with this
bit set, Solaris assumes that these come from clients running in the guest that
should not do this since there can only be one control domain. Linux not
knowing this uses the top 32 bits as a quick lookup index and sets the bottom 32
bits based off jiffies. Of course there are times when a handle is constructed
with the Solaris client bit not set and everything appears to work correctly
with no errors or warnings and times when the client bit is set and everything
works except the Solaris kernel puts a bunch of warnings into its dmesg buffer.
The fix is literally 1 character, changing the mask used to grab the bottom 32
bits of sched_clock() (jiffy based) to use only the bottom 31 bits. Halving the
roll-over time should not be an issue. Worse case additional jiffy bits can be
shifted into the upper 32 bits of the handle.
Domain service registration intermittently fails. Though using “reliable"
LDC communication, this only guarantees the data, not delivery. Analysis
indicated a timing issue that varies between boots. LDOM domain service
architecture is now sufficiently complicated that packets (domain service
registration requests in this case) do apparently get lost, the symptoms
being receiving neither an ACK or a NACK on the initial service registration
request.
This patch uses a timer and retries with delay up to N (currently 5) times
any requests that went unacknowledged, positively or negatively, before
reporting a failed registration attempt. Using timer with callback allows early
boot to progress as normal versus spinning in a loop. Also clean up of
./script/checkpatch.pl warnings and errors in ds.c.
Allen Pais [Fri, 2 Jan 2015 05:18:41 +0000 (10:48 +0530)]
sparc64: __init code no longer called during non __init
mdesc_update calling __init memory free code through a pointer at
non-init time. Since text page was already given back and reused
this results in an illegal instruction trap. Was not caught by
linker section mismatch checks due to pointer indirection.
This patch NULL's out mops pointer after __init time and then
checks for non-NULL before calling mops->free.
Signed-off-by: Chris Hyser <chris.hyser@oracle.com> Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com> Acked-by: Bob Picco <bob.picco@oracle.com> Signed-off-by: Allen Pais <allen.pais@oracle.com>
(cherry picked from commit 6dbae4a0137d7855472c4845b5db11cffa32efc1)
(cherry picked from commit f0673a413f04de21963ab7f3912eb9a84c52c66e)
ocfs2_setattr called by chmod command holds cluster wide inode lock
(Orabug 21685187) when calling posix_acl_chmod. This
latter function in turn calls ocfs2_iop_get_acl and ocfs2_iop_set_acl.
These two are also called directly from vfs layer for getfacl/setfacl
commands and therefore acquire the cluster wide inode lock. If a remote
conversion request comes after the first inode lock in ocfs2_setattr,
OCFS2_LOCK_BLOCKED will be set in l_flags. This will cause the second
call to inode lock from the ocfs2_iop_get|set_acl() to block indefinetly.
The new flag OCFS2_LOCK_RECURSIVE will be used to prevent this blocking.
HWP previously was only enabled at driver load time, on the boot
CPU, however, HWP must be enabled per package. Move the code to
enable HWP to the cpufreq driver init path so that it will be
called per CPU.
Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com> Tested-by: David Zhuang <david.zhuang@oracle.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit ba88d4338f226766f510e207911dde8c1875e072) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/cpufreq/intel_pstate.c Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Ryan Ding [Mon, 7 Sep 2015 05:38:00 +0000 (13:38 +0800)]
ocfs2: direct write will call ocfs2_rw_unlock() twice when doing aio+dio
ocfs2_file_write_iter() is usng the wrong return value ('written'). This
will cause ocfs2_rw_unlock() be called both in write_iter & end_io,
triggering a BUG_ON.
This issue was introduced by commit 7da839c47589 ("ocfs2: use
__generic_file_write_iter()").
Orabug: 21612107 Fixes: 7da839c47589 ("ocfs2: use __generic_file_write_iter()") Signed-off-by: Ryan Ding <ryan.ding@oracle.com> Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit aa1057b3dec478b20c77bad07442318ae36d893c)
This bug in mainline code is pointed out by Mark Fasheh. When ocfs2_iop_set_acl
and ocfs2_iop_ge_acl are entered from VFS layer, inode lock is not held. This
seems to be regression from older kernels. The patch is to fix that.
Some time ago, chmod command had been executed. As result, the following call
chain left the inode cluster lock in PR state, latter on causing the assert.
system_call_fastpath
-> my_chmod
-> sys_chmod
-> sys_fchmodat
-> notify_change
-> ocfs2_setattr
-> posix_acl_chmod
-> ocfs2_iop_set_acl
-> ocfs2_set_acl
-> ocfs2_acl_set_mode
Here is how.
1119 int ocfs2_setattr(struct dentry *dentry, struct iattr *attr)
1120 {
1247 ocfs2_inode_unlock(inode, 1); <<< WRONG thing to do.
..
1258 if (!status && attr->ia_valid & ATTR_MODE) {
1259 status = posix_acl_chmod(inode, inode->i_mode);
224 int ocfs2_set_acl(handle_t *handle,
225 struct inode *inode, ...
231 {
..
252 ret = ocfs2_acl_set_mode(inode, di_bh,
253 handle, mode);
168 static int ocfs2_acl_set_mode(struct inode *inode, struct buffer_head ...
170 {
183 if (handle == NULL) {
>>> BUG: inode lock not held in ex at this point <<<
184 handle = ocfs2_start_trans(OCFS2_SB(inode->i_sb),
185 OCFS2_INODE_UPDATE_CREDITS);
ocfs2_setattr.#1247 we unlock and at #1259 call posix_acl_chmod. When we reach
ocfs2_acl_set_mode.#181 and do trans, the inode cluster lock is not held in EX
mode (it should be). How this could have happended?
We are the lock master, were holding lock EX and have released it in
ocfs2_setattr.#1247. Note that there are no holders of this lock at
this point. Another node needs the lock in PR, and we downconvert from
EX to PR. So the inode lock is PR when do the trans in
ocfs2_acl_set_mode.#184. The trans stays in core (not flushed to disc).
Now another node want the lock in EX, downconvert thread gets kicked (the
one that tripped assert abovt), finds an unflushed trans but the lock is
not EX (it is PR). If the lock was at EX, it would have flushed the trans
ocfs2_ci_checkpointed -> ocfs2_start_checkpoint before downconverting (to NULL)
for the request.
ocfs2_setattr must not drop inode lock ex in this code path. If it does,
takes it again before the trans, say in ocfs2_set_acl, another cluster node can
get in between, execute another setattr, overwriting the one in progress
on this node, resulting in a mode acl size combo that is a mix of the two.
Merge branch 'topic/uek-4.1/ofed' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* 'topic/uek-4.1/ofed' of git://ca-git.us.oracle.com/linux-uek:
IB/rds_rdma: unloading of ofed stack causes page fault panic
RDS-TCP: Support multiple RDS-TCP listen endpoints, one per netns.
RDS-TCP: Make RDS-TCP work correctly when it is set up in a netns other than init_net
net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket
net: Modify sk_alloc to not reference count the netns of kernel sockets.
net: Pass kern from net_proto_family.create to sk_alloc
net: Add a struct net parameter to sock_create_kern
Nick Alcock [Thu, 3 Sep 2015 15:42:09 +0000 (16:42 +0100)]
kallsyms: unbreak kallmodsyms after CONFIG_KALLMODSYMS addition
The recent addition of CONFIG_KALLMODSYMS In 28df3b99a7 had the effect
of entirely disabling all module info in /proc/kallmodsyms, thus
breaking all module-specific symbol lookups from DTrace.
This is because you can't use a CONFIG_ symbol in a HOSTCC-compiled
program without including autoconf.h by hand, and we weren't, so
scripts/kallsyms.c always acted as if CONFIG_KALLMODSYMS was turned
off and didn't populate the kallsyms_modules or kallsyms_symbol_modules
tables.
(Including autoconf.h in this context is safe, because kallsyms.c never
gets compiled until after some *config target has run. Other build
tools in a similar position, such as modpost, already do this.)
Orabug: 21539840 Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Acked-by: Kris Van Hees <kris.van.hees@oracle.com>
Nick Alcock [Sat, 15 Aug 2015 11:18:14 +0000 (12:18 +0100)]
kallsyms: de-ifdef kallmodsyms
CONFIG_KALLMODSYMS is a bit ugly because of the burden of ifdefs. It's
hard to remove them from scripts/kallsyms.c, but kernel/kallsyms.c doesn't
need any, since even when CONFIG_KALLMODSYMS is on it does not pull in any
extra build dependencies in and of itself: it just needs to arrange to not
create the kallmodsyms /proc node when the config option is turned off.
This will have the effect of disabling /proc/kallmodsyms when
CONFIG_KALLMODSYMS=n, without cluttering up the code with so many
ifdefs. (We still need one to populate the node in the first place.)
We also reverse the code motion we did earlier to make the other ifdefs
easier to insert.
Signed-off-by: Nick Alcock <nick.alcock@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Orabug: 21539840
dtrace: use syscall_get_nr() to obtain syscall number
Rather than trying to get the syscall number directly from %rax on
x86_64, which is error prone due to compiler changes causing that
register to get clobbered, we use the syscall_get_nr() function to
get the same information.
No need to use WARN_TAINT_ONCE to generate a such big noise if this is
not a critical error for kernel. DCA driver could print out a debug
messages then quit quietly.
If this is a real BIOS bug, please ignore this patch. Let's transfer
this issue to BIOS guys.
Merge branch 'topic/uek-4.1/ofed.rds-p2' into topic/uek-4.1/ofed
* topic/uek-4.1/ofed.rds-p2:
IB/rds_rdma: unloading of ofed stack causes page fault panic
RDS-TCP: Support multiple RDS-TCP listen endpoints, one per netns.
RDS-TCP: Make RDS-TCP work correctly when it is set up in a netns other than init_net
Rama Nichanamatlu [Thu, 11 Jun 2015 17:43:54 +0000 (10:43 -0700)]
IB/rds_rdma: unloading of ofed stack causes page fault panic
This issue surfaced at the tail end of OFED functional automatic test suite
while unloading ofed modules resulting in following stack trace:
BUG: unable to handle kernel paging request at ffffffffa0abd1a0
IP: [<ffffffffa0abd1a0>] 0xffffffffa0abd1a0
Sowmini Varadhan [Fri, 28 Aug 2015 14:09:04 +0000 (10:09 -0400)]
RDS-TCP: Support multiple RDS-TCP listen endpoints, one per netns.
Register pernet subsys init/stop functions that will set up
and tear down per-net RDS-TCP listen endpoints. Unregister
pernet subusys functions on 'modprobe -r' to clean up these
end points.
Enable keepalive on both accept and connect socket endpoints.
The keepalive timer expiration will ensure that client socket
endpoints will be removed as appropriate from the netns when
an interface is removed from a namespace.
Register a device notifier callback that will clean up all
sockets (and thus avoid the need to wait for keepalive timeout)
when the loopback device is unregistered from the netns indicating
that the netns is getting deleted.
Sowmini Varadhan [Fri, 28 Aug 2015 11:16:01 +0000 (07:16 -0400)]
RDS-TCP: Make RDS-TCP work correctly when it is set up in a netns other than init_net
Open the sockets calling sock_create_kern() with the correct struct net
pointer, and use that struct net pointer when verifying the
address passed to rds_bind().
Sowmini Varadhan [Fri, 28 Aug 2015 00:57:24 +0000 (20:57 -0400)]
net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket
The newsk returned by sk_clone_lock should hold a get_net()
reference if, and only if, the parent is not a kernel socket
(making this similar to sk_alloc()).
E.g,. for the SYN_RECV path, tcp_v4_syn_recv_sock->..inet_csk_clone_lock
sets up the syn_recv newsk from sk_clone_lock. When the parent (listen)
socket is a kernel socket (defined in sk_alloc() as having
sk_net_refcnt == 0), then the newsk should also have a 0 sk_net_refcnt
and should not hold a get_net() reference.
Fixes: 26abe14379f8 ("net: Modify sk_alloc to not reference count the
netns of kernel sockets.")
Acked-by: Eric Dumazet <edumazet@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sowmini Varadhan [Thu, 27 Aug 2015 23:23:26 +0000 (19:23 -0400)]
net: Modify sk_alloc to not reference count the netns of kernel sockets.
Now that sk_alloc knows when a kernel socket is being allocated modify
it to not reference count the network namespace of kernel sockets.
Keep track of if a socket needs reference counting by adding a flag to
struct sock called sk_net_refcnt.
Update all of the callers of sock_create_kern to stop using
sk_change_net and sk_release_kernel as those hacks are no longer
needed, to avoid reference counting a kernel socket.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Sowmini Varadhan [Thu, 27 Aug 2015 21:22:00 +0000 (17:22 -0400)]
net: Pass kern from net_proto_family.create to sk_alloc
In preparation for changing how struct net is refcounted
on kernel sockets pass the knowledge that we are creating
a kernel socket from sock_create_kern through to sk_alloc.
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Santosh Shilimkar [Fri, 28 Aug 2015 17:56:32 +0000 (10:56 -0700)]
Merge branch 'topic/uek-4.1/drivers' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* 'topic/uek-4.1/drivers' of git://ca-git.us.oracle.com/linux-uek: (61 commits)
i40e/i40evf: Bump version to 1.3.6 for i40e and 1.3.2 for i40evf
i40e: Refine an error message to avoid confusion
i40e/i40evf: Add support for pre-allocated pages for PD
i40evf: add MAC address filter in open, not init
i40evf: don't delete all the filters
i40e: un-disable VF after reset
i40e: do a proper reset when disabling a VF
i40e: correctly program filters for VFs
i40e/i40evf: Update the admin queue command header
i40e: Remove incorrect #ifdef's
i40e: ignore duplicate port VLAN requests
i40evf: Allow for an abundance of vectors
i40e/i40evf: improve Tx performance with a small tweak
i40e/i40evf: Update Flex-10 related device/function capabilities
i40e/i40evf: Add stats to track FD ATR and SB dynamic enable state
i40e: Implement ndo_features_check()
i40evf: don't configure unused RSS queues
i40evf: fix panic during MTU change
i40e: Bump version to 1.3.4
i40e/i40evf: remove time_stamp member
...
Santosh Shilimkar [Fri, 28 Aug 2015 17:56:10 +0000 (10:56 -0700)]
Merge branch 'topic/uek-4.1/upstream-cherry-picks' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1
* 'topic/uek-4.1/upstream-cherry-picks' of git://ca-git.us.oracle.com/linux-uek:
nfs: take extra reference to fl->fl_file when running a LOCKU operation
mm: madvise allow remove operation for hugetlbfs
mmotm: build fix hugetlbfs fallocate if not CONFIG_NUMA
hugetlbfs: add hugetlbfs_fallocate()
hugetlbfs: New huge_add_to_page_cache helper routine
mm/hugetlb: alloc_huge_page handle areas hole punched by fallocate
mm/hugetlb: vma_has_reserves() needs to handle fallocate hole punch
mm/hugetlb.c: make vma_has_reserves() return bool
hugetlbfs: truncate_hugepages() takes a range of pages
hugetlbfs: hugetlb_vmtruncate_list() needs to take a range to delete
mm/hugetlb: expose hugetlb fault mutex for use by fallocate
mm/hugetlb: add region_del() to delete a specific range of entries
mm-hugetlb-add-cache-of-descriptors-to-resv_map-for-region_add-fix
mm/hugetlb: add cache of descriptors to resv_map for region_add
mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages
mm/hugetlb: compute/return the number of regions added by region_add()
mm/hugetlb: document the reserve map/region tracking routines
The problem is almost exactly the same as the one fixed by feaff8e5b2cf.
We must take a reference to the struct file when running the LOCKU
compound to prevent the final fput from running until the operation is
complete.
Reported-by: Jean Spector <jean@primarydata.com> Signed-off-by: Jeff Layton <jeff.layton@primarydata.com> Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
Orabug: 21687670
(cherry picked from mainline commit db2efec0caba4f81a22d95a34da640b86c313c8e) Acked-by: Chuck Anderson <chuck.anderson@oracle.com>
NFS on a 2 node ocfs2 cluster each node exporting dir. The lock causing
the hang is the global bit map inode lock. Node 1 is master, has
the lock granted in PR mode; Node 2 is in the converting list (PR ->
EX). There are no holders of the lock on the master node so it should
downconvert to NL and grant EX to node 2 but that does not happen.
BLOCKED + QUEUED in lock res are set and it is on osb blocked list.
Threads are waiting in __ocfs2_cluster_lock on BLOCKED. One thread wants
EX, rest want PR. So it is as though the downconvert thread needs to be
kicked to complete the conv.
The hang is caused by an EX req coming into __ocfs2_cluster_lock on
the heels of a PR req after it sets BUSY (drops l_lock, releasing EX
thread), forcing the incoming EX to wait on BUSY without doing anything.
PR has called ocfs2_dlm_lock, which sets the node 1 lock from NL ->
PR, queues ast.
At this time, upconvert (PR ->EX) arrives from node 2, finds conflict with
node 1 lock in PR, so the lock res is put on dlm thread's dirty listt.
After ret from ocf2_dlm_lock, PR thread now waits behind EX on BUSY till
awoken by ast.
Now it is dlm_thread that serially runs dlm_shuffle_lists, ast, bast,
in that order. dlm_shuffle_lists ques a bast on behalf of node 2
(which will be run by dlm_thread right after the ast). ast does its
part, sets UPCONVERT_FINISHING, clears BUSY and wakes its waiters. Next,
dlm_thread runs bast. It sets BLOCKED and kicks dc thread. dc thread
runs ocfs2_unblock_lock, but since UPCONVERT_FINISHING set, skips doing
anything and reques.
Inside of __ocfs2_cluster_lock, since EX has been waiting on BUSY ahead
of PR, it wakes up first, finds BLOCKED set and skips doing anything
but clearing UPCONVERT_FINISHING (which was actually "meant" for the
PR thread), and this time waits on BLOCKED. Next, the PR thread comes
out of wait but since UPCONVERT_FINISHING is not set, it skips updating
the l_ro_holders and goes straight to wait on BLOCKED. So there, we
have a hang! Threads in __ocfs2_cluster_lock wait on BLOCKED, lock
res in osb blocked list. Only when dc thread is awoken, it will run
ocfs2_unblock_lock and things will unhang.
One way to fix this is to wake the dc thread on the flag after clearing
UPCONVERT_FINISHING
Change-ID: I84573d9fa51effc5b29bf5b8c74e3cc8b2673f48 Signed-off-by: Catherine Sullivan <catherine.sullivan@intel.com> Tested-by: Jim Young <james.m.young@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit 76945bf9ff8a2433f1efb777ec64475c1eec08ab) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Change a warning message to indicate what may have really happened when
the init_shared_code call fails.
Change-ID: I616ace40fed120d0dec86dfc91ab2d7cde466904 Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com> Tested-by: Jim Young <james.m.young@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit b2a75c5819ec910f430a2ff12fec6cce202899a0) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
The i40e_add_pd_table_entry() routine is being modified to handle both
cases where a backing page is passed and where backing page is allocated
in i40e_add_pd_table_entry().
For PBLE resource management, it is more efficient for it to manage its
backing pages. For VF, PBLE backing page addresses will be send to PF
driver for PBLE resource.
The i40e_remove_pd_bp() is also modified to not free pre-allocated pages and
free only ones which were allocated in i40e_add_pd_table_entry().
Change-ID: Ie673f0403f22979e9406f5a94048dceb91bcf9a8 Signed-off-by: Faisal Latif <faisal.latif@intel.com> Tested-by: Jim Young <james.m.young@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit 3bbf0faa90cb8d541d8b2ce01610dcec6828bd00) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
During close, all of the MAC filters are cleared, so the driver would be
unable to receive unicast packets after being closed and reopened.
Add the adapter's "hardware" MAC address filter in open, not init. This
ensures that the correct filter is present each time.
Change-ID: I51a11e9c1200139dab6f66a5353bd38c7d26f875 Signed-off-by: Mitch Williams <mitch.a.williams@intel.com> Tested-by: Jim Young <james.m.young@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit 44151cd32deb1074530f3beba51d535fa0887d9a) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Due to an inverted conditional, the driver was marking all of its MAC
filters for deletion every time set_rx_mode was called. Depending upon
the timing of the calls to set_rx_mode and the processing of the admin
queue, the driver would (accidentally) end up with a varying number of
functional filters.
Correct this logic so that MAC filters are added and removed correctly.
Add a check for the driver's "hardware" MAC address so that this filter
doesn't get removed incorrectly.
Change-ID: Ib3e7c4a5b53df6835f164fe44cb778cb71f8aff8 Signed-off-by: Mitch Williams <mitch.a.williams@intel.com> Tested-by: Jim Young <james.m.young@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit 68ef169204e3a88ea4823645038d5496f66200f6) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
When a VF is disabled, there is no way for it to recover until either
the PF driver is reloaded or SR-IOV is disabled and enabled. To correct
this, enable the VF after a successful reset.
Change-ID: I9e0788476c4d53d5407961b503febdfff2b8a7c6 Signed-off-by: Mitch Williams <mitch.a.williams@intel.com> Tested-by: Jim Young <james.m.young@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit 5b8f8505d37c63d492391e5fafcd43332671b36b) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
The VF disable code was just whanging on the reset bit without properly
cleaning up the VF, which would leave the VF in an indeterminate state
from which it could not recover. Fix this by notifying the VF and then
by calling the normal VF reset routine.
Change-ID: I862b9dfa919368773cbdc212b805b520db2f7430 Signed-off-by: Mitch Williams <mitch.a.williams@intel.com> Tested-by: Jim Young <james.m.young@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit 54f455eeb56c0ab92db87bed6bd767d206d9e743) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
MAC filters for VFs were being programmed with 0 for the VLAN value when
there was no VLAN assigned. This is incorrect and actually assigns the
VF to VLAN 0. Instead, we must use -1 to indicate that no VLAN is in
use. This change programs the filters correctly and gets rid of a bogus
error message when setting a port VLAN on an active VF.
Change-ID: Ica9a9906d768405377ff3308e27f7d0b5b2ea96e Signed-off-by: Mitch Williams <mitch.a.williams@intel.com> Tested-by: Jim Young <james.m.young@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit e995163cdcf9b70c7840a8d6a7ea7c0ce81c761b) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Change-ID: Ib031c86cc6cab78e5aa44c64d8ce5474be8d7e42 Signed-off-by: Greg Rose <gregory.v.rose@intel.com> Tested-by: Jim Young <james.m.young@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit cb2f65bc0c64015e8fa45fe1065ad241bf31a994) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
This patch removes some #ifdef's that should not be there. They
were stopping code that is needed from being compiled in.
With these #ifdef's removed, changes are needed in the driver
to fix some compile errors: adding missing parameters to
the definition of ndo_bridge_setlink and a ndo_dflt_brige_getlink call.
Change-ID: I5516614e1bc50b6bca0647cef971bc96161ba2de Signed-off-by: Carolyn Wyborny <carolyn.wyborny@intel.com> Signed-off-by: Catherine Sullivan <catherine.sullivan@intel.com> Tested-by: Jim Young <james.m.young@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit 9df70b66418e284dc1e7f272ac445c1d1e990b97) Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/net/ethernet/intel/i40e/i40e_main.c Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
If user attempts to set a port VLAN on a VF that already has the same
port VLAN configured, the driver will go through a completely
unnecessary flurry of filter removals and filter adds. Just check for
this condition and return success instead of doing a bunch of busywork.
Change-ID: Ia1a9e83e6ed48b3f4658bc20dfc6af0cf525d54a Signed-off-by: Mitch Williams <mitch.a.williams@intel.com> Tested-by: Jim Young <james.m.young@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit 85927ec1b369c880407aa82eba70d49c04c35062) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
The driver currently only maps TX and RX queues to a single MSI-X vector
per queue pair if there are exactly enough vectors for this.
Unfortunately, if we have too many vectors it will fail and allocate
queues to vectors in a suboptimal manner. Change the condition check to
allow for excess vectors. In this case, the extras just won't be used.
Change-ID: I23e1e2955c64739c86612db88a25583e6a7e0b17 Signed-off-by: Mitch Williams <mitch.a.williams@intel.com> Tested-by: Jim Young <james.m.young@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit 973371da4d66b96736143bd3f2b2ff2331faae8f) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Add a prefetch for the next Tx descriptor to be used when we know
there are more coming.
Change-ID: Ibb9acab11d508eec2db7da795df74debc16eeacb Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Jim Young <james.m.young@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit 489ce7a46306052ab4ef26c6305051c5f1b24bb4) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
The Flex10 device/function capability has been upgraded to include
information needed to support Flex-10 configurations. This patch adds new
fields to the i40e_hw_capabilities structure and updates
i40e_parse_discover_capabilities functions to extract them from the AQ
response. Naming convention has changed to use flex10 mode instead of
existing mfp_mode_1.
Change-ID: I305dd888866985a30293acb3fb14fa43ca6b79ea Signed-off-by: Pawel Orlowski <pawel.orlowski@intel.com> Signed-off-by: Akeem G Abodunrin <akeem.g.abodunrin@intel.com> Signed-off-by: Shannon Nelson <shannon.nelson@intel.com> Tested-by: Jim Young <james.m.young@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit c78b953e0f189824f5eaa2d60123cfd12ea6db0d) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Since the driver can dynamically enable/disable FD ATR and SB features,
these stats help keep track of the current state and along with
fd_flush count provide a means to debug what could be going on
with the flow director filters. This will take away the need for
being verbose in our debug logs with respect to FD.
Change-ID: I29224f750fe6602391043655d18996570720377d Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com> Tested-by: Jim Young <james.m.young@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit d0389e51fc9b3c74e7935ded5d22eab4ea004589) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
The driver will only configure as many queues as there are available
CPUs, up the maximum number of queues. However, it always configures
RSS as though it is using the maximum number of queues. This can cause
the device to drop a lot of RX traffic, as the packets get assigned to
nonfunctional queues.
Fix this by only configuring RSS with the number of active queues.
Signed-off-by: Mitch Williams <mitch.a.williams@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit 40746eb14c6b44f4d635c2f4cf8c67550db9b3ab) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Down was requesting queue disables, but then exited immediately
without waiting for the queues to actually disable. This could
allow any function called after i40evf_down to run immediately,
including i40evf_up, and causes a memory leak.
Removing the whole reinit_locked function is the best way
to go about this, and allows for the driver to handle the
state changes by requesting reset from the periodic timer.
Also, add a couple WARN_ONs in slow path to help us recognize
if we re-introduce this issue or missed any cases.
Signed-off-by: Mitch Williams <mitch.a.williams@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit 67c818a1d58c7897b8a6f531684516f9c236fe1b) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Change-ID: I54ec2787a9fead5e18447078f26e5dd27f01da44 Signed-off-by: Catherine Sullivan <catherine.sullivan@intel.com> Tested-by: Jim Young <james.m.young@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit f029094e49814b56fdb3261a694c8890983b7a2d) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
The driver doesn't use the time_stamp member to determine if there is a
tx_hang any more. There really isn't any point to the variable at all
so just remove it. It was left over from a previous tx_hang design.
Change-ID: I4c814827e1bcb46e45118fe37acdcfa814fb62a0 Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Jim Young <james.m.young@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit 335075989fbb3c3fffc3ba238b893fa92508a6f1) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Eric added support for skb->xmit_more in i40e, this ports that into
i40evf as well.
Support skb->xmit_more in i40evf is straightforward; we need to move
around i40e_maybe_stop_tx() call to correctly test netif_xmit_stopped()
before taking the decision to not kick the NIC.
Change-ID: Idddda6a2e4a7ab335631c91ced51f55b25eb8468 Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Tested-by: Jim Young <james.m.young@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit 8f6a2b05c67d915cef66b2c9636404e0d531def2) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
These are not useful unless SV is happening as there is a FD flush counter
that tracks this.
Change-ID: If2655b5a29687247d03a51d35f69854bbeb711ce Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com> Tested-by: Jim Young <james.m.young@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit 2e4875e38c288702c2002c7bcf527d8aa0083979) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Because i40e_fcoe_ctxt_eof should never be called without
i40e_fcoe_eof_is_supported being called first, the EOF in fcoe_ctxt_eof
should always be valid and therefore we do not need to print an error
if it is not valid.
However, a WARN ON to easily catch any calls to i40e_fcoe_ctxt_eof that
aren't preceded with a call to i40e_fcoe_eof_is_supported is helpful.
Change-ID: I3b536b1981ec0bce80576a74440b7dea3908bdb9 Signed-off-by: Vasu Dev <vasu.dev@intel.com> Tested-by: Jim Young <james.m.young@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit 41837cad54fe22d29f021f6cb0e9d151acb104a0) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
There's no need for a counter so remove the TODO comment.
Change-ID: I3321dda04934c4f5fda9b279ab666192bda44214 Signed-off-by: Greg Rose <gregory.v.rose@intel.com> Tested-by: Jim Young <james.m.young@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit 6b02a174c1542486eeaa1de94e6c38e9271b89d8) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
We can use the stat index macro directly, a variable is not required.
Change-ID: I19f08ac16353dc0cd87a1a8248d714e15a54aa8a Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com> Tested-by: Jim Young <james.m.young@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit 0bf4b1b0c3fda4dd72910cba3c40b3273a2de756) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Add a 3rd dynamic filter counter to track Tunneled ATR hits separately.
Ethtool port stat "fdir_atr_tunnel_match"
Change-ID: Idd978b6db2a462b5722397cd2ffd04ef055f8655 Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com> Tested-by: Jim Young <james.m.young@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit 60ccd45cbabdc058061b860c43c48877558cc176) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Without this, RSS would have done inner header load balancing. Now we can
get the benefits of ATR for tunneled packets to better align TX and RX
queues with the right core/interrupt.
Change-ID: I07d0e0a192faf28fdd33b2f04c32b2a82ff97ddd Signed-off-by: Anjali Singhai Jain <anjali.singhai@intel.com> Signed-off-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit 89232c3bf78b3799699e48201f60892283564b78) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Require the user to disable virtual functions before running the device
offline diagnostics. The offline diagnostics are intended to ensure
basic operation of the device - it is beyond the scope of the diagnostic
test to handle the additional complexity of bringing all the virtual
functions offline and then back online for each test run.
Change-ID: Ic0b854851a09fc85df0c9e82c220e45885457c30 Signed-off-by: Greg Rose <gregory.v.rose@intel.com> Tested-by: Jim Young <james.m.young@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit e17bc411aea8fbebc51857037f104ab09f765120) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
When PFC is enabled for any UP in single TC configuration the driver didn't
collect the PFC XOFF RX stats. Though a single TC with PFC enabled is not a
common scenario do not prevent the driver from collecting stats if firmware
indicates that PFC is enabled.
Change-ID: Ie20bd58b07608b528f3c6d95894c9ae56b00077a Signed-off-by: Neerav Parikh <neerav.parikh@intel.com> Tested-by: Jim Young <james.m.young@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
(cherry picked from commit e120814d74bc805769d18ed7177f43a17a88fd40) Signed-off-by: Brian Maly <brian.maly@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Commit 56bb4d795 introduced a build error if CONFIG_NUMA is not
defined. When fallocate preallocation allocates pages, it will
use the defined numa policy. However, if numa is not defined
there is no such policy and no code should reference numa policy.
Create wrappers to isolate policy manipulation code that are a
NOOP in the non-NUMA case.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reported-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 0ce9057732d1dd94ef2bd32c8acb68ae68b08a2f) Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
This is based on the shmem version, but it has diverged quite a bit. We
have no swap to worry about, nor the new file sealing. Add
synchronication via the fault mutex table to coordinate page faults,
fallocate allocation and fallocate hole punch.
What this allows us to do is move physical memory in and out of a
hugetlbfs file without having it mapped. This also gives us the ability
to support MADV_REMOVE since it is currently implemented using
fallocate(). MADV_REMOVE lets madvise() remove pages from the middle of a
hugetlbfs file, which wasn't possible before.
hugetlbfs fallocate only operates on whole huge pages.
Based on code by Dave Hansen.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 79925b07fd22d7c4e4e77cdc26edb26dc4ff2701) Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Currently, there is only a single place where hugetlbfs pages are added to
the page cache. The new fallocate code be adding a second one, so break
the functionality out into its own helper.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 4b153581930e8c61250078efcdcce3e19bc2a45b) Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Areas hole punched by fallocate will not have entries in the
region/reserve map. However, shared mappings with min_size subpool
reservations may still have reserved pages. alloc_huge_page needs to
handle this special case and do the proper accounting.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit b21aa74b8077ce5b9c5fea566fe37af866934746) Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
In vma_has_reserves(), the current assumption is that reserves are always
present for shared mappings. However, this will not be the case with
fallocate hole punch. When punching a hole, the present page will be
deleted as well as the region/reserve map entry (and hence any
reservation). vma_has_reserves is passed "chg" which indicates whether or
not a region/reserve map is present. Use this to determine if reserves
are actually present or were removed via hole punch.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 320fb799e7cedc1edb96fd69c686547c731d5fc8) Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Modify truncate_hugepages() to take a range of pages (start, end) instead
of simply start. If an end value of LLONG_MAX is passed, the current
"truncate" functionality is maintained. Existing callers are modified to
pass LLONG_MAX as end of range. By keying off end == LLONG_MAX, the
routine behaves differently for truncate and hole punch. Page removal is
now synchronized with page allocation via faults by using the fault mutex
table. The hole punch case can experience the rare region_del error and
must handle accordingly.
Add the routine hugetlb_fix_reserve_counts to fix up reserve counts in the
case where region_del returns an error.
Since the routine handles more than just the truncate case, it is renamed
to remove_inode_hugepages(). To be consistent, the routine
truncate_huge_page() is renamed remove_huge_page().
Downstream of remove_inode_hugepages(), the routine
hugetlb_unreserve_pages() is also modified to take a range of pages.
hugetlb_unreserve_pages is modified to detect an error from region_del and
pass it back to the caller.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 6a57804ccdfb77b8f333b736a3ee7cb1bf8732e1) Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
fallocate hole punch will want to unmap a specific range of pages. Modify
the existing hugetlb_vmtruncate_list() routine to take a start/end range.
If end is 0, this indicates all pages after start should be unmapped.
This is the same as the existing truncate functionality. Modify existing
callers to add 0 as end of range.
Since the routine will be used in hole punch as well as truncate
operations, it is more appropriately renamed to hugetlb_vmdelete_list().
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit dea23e2a9e811e5fba895a134f701455908aa0d3) Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
hugetlb page faults are currently synchronized by the table of mutexes
(htlb_fault_mutex_table). fallocate code will need to synchronize with
the page fault code when it allocates or deletes pages. Expose interfaces
so that fallocate operations can be synchronized with page faults. Minor
name changes to be more consistent with other global hugetlb symbols.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit fec73245c33b067c60f520908a93c971003664c8) Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
fallocate hole punch will want to remove a specific range of pages. The
existing region_truncate() routine deletes all region/reserve map entries
after a specified offset. region_del() will provide this same
functionality if the end of region is specified as LONG_MAX. Hence,
region_del() can replace region_truncate().
Unlike region_truncate(), region_del() can return an error in the rare
case where it can not allocate memory for a region descriptor. This ONLY
happens in the case where an existing region must be split. Current
callers passing LONG_MAX as end of range will never experience this error
and do not need to deal with error handling. Future callers of
region_del() (such as fallocate hole punch) will need to handle this
error.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit c2cfad5701106f8ddb0607b9e09d524ef55ef0ec) Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
hugetlbfs is used today by applications that want a high degree of control
over huge page usage. Often, large hugetlbfs files are used to map a
large number huge pages into the application processes. The applications
know when page ranges within these large files will no longer be used, and
ideally would like to release them back to the subpool or global pools for
other uses. The fallocate() system call provides an interface for
preallocation and hole punching within files. This patch set adds
fallocate functionality to hugetlbfs.
fallocate hole punch will want to remove a specific range of pages. When
pages are removed, their associated entries in the region/reserve map will
also be removed. This will break an assumption in the
region_chg/region_add calling sequence. If a new region descriptor must
be allocated, it is done as part of the region_chg processing. In this
way, region_add can not fail because it does not need to attempt an
allocation.
To prepare for fallocate hole punch, create a "cache" of descriptors that
can be used by region_add if necessary. region_chg will ensure there are
sufficient entries in the cache. It will be necessary to track the number
of in progress add operations to know a sufficient number of descriptors
reside in the cache. A new routine region_abort is added to adjust this
in progress count when add operations are aborted. vma_abort_reservation
is also added for callers creating reservations with
vma_needs_reservation/vma_commit_reservation.
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
(cherry picked from commit 27af163113310a86b6d19bb5693c1a08eb89b0f7) Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>