]> www.infradead.org Git - users/jedix/linux-maple.git/log
users/jedix/linux-maple.git
9 years agoMerge branch 'topic/uek-4.1/rpm-build' of git://ca-git.us.oracle.com/linux-uek into...
Santosh Shilimkar [Thu, 17 Sep 2015 21:18:10 +0000 (14:18 -0700)]
Merge branch 'topic/uek-4.1/rpm-build' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* 'topic/uek-4.1/rpm-build' of git://ca-git.us.oracle.com/linux-uek:
  uek-rpm: build: sparc: Build sparc headers
  uek-rpm: configs: Adjust config for new rpcrdma.ko module
  uek-rpm: builds: sparc64: enable dtrace support

9 years agoMerge branch 'uek-4.1/upstream-cherry-picks' of git://ca-git.us.oracle.com/linux...
Santosh Shilimkar [Thu, 17 Sep 2015 21:17:58 +0000 (14:17 -0700)]
Merge branch 'uek-4.1/upstream-cherry-picks' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* 'uek-4.1/upstream-cherry-picks' of git://ca-git.us.oracle.com/linux-uek:
  rtnetlink: RTEXT_FILTER_SKIP_STATS support to avoid dumping inet/inet6 stats

9 years agoMerge branch 'topic/uek-4.1/ofed' of git://ca-git.us.oracle.com/linux-uek into uek...
Santosh Shilimkar [Thu, 17 Sep 2015 21:17:46 +0000 (14:17 -0700)]
Merge branch 'topic/uek-4.1/ofed' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* 'topic/uek-4.1/ofed' of git://ca-git.us.oracle.com/linux-uek:
  RDS: change spin_lock to spin_lock_bh
  rds: add busy_list only when fmr allocated successfully
  rds: free ib_device related resource
  rds: srq initialization and cleanup

9 years agoMerge branch 'topic/uek-4.1/nfs-rdma' of git://ca-git.us.oracle.com/linux-uek into...
Santosh Shilimkar [Thu, 17 Sep 2015 21:17:13 +0000 (14:17 -0700)]
Merge branch 'topic/uek-4.1/nfs-rdma' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* 'topic/uek-4.1/nfs-rdma' of git://ca-git.us.oracle.com/linux-uek: (71 commits)
  xprtrdma: Add class for RDMA backwards direction transport
  svcrdma: Add infrastructure to receive backwards direction RPC/RDMA replies
  svcrdma: Add infrastructure to send backwards direction RPC/RDMA calls
  svcrdma: Add svc_rdma_get_context() API that is allowed to fail
  svcrdma: Define maximum number of backchannel requests
  NFS: Enable client side NFSv4.1 backchannel to use other transports
  svcrdma: Add backward direction service for RPC/RDMA transport
  xprtrdma: Handle incoming backward direction RPC calls
  xprtrdma: Add support for sending backward direction RPC replies
  xprtrdma: Pre-allocate Work Requests for backchannel
  xprtrdma: Pre-allocate backward rpc_rqst and send/receive buffers
  SUNRPC: Abstract backchannel operations
  SUNRPC: xprt_complete_bc_request must also decrement the free slot count
  SUNRPC: Fix a backchannel deadlock
  SUNRPC: Fix a backchannel race
  SUNRPC: Clean up allocation and freeing of back channel requests
  xprtrdma: Replace send and receive arrays
  xprtrdma: Refactor reply handler error handling
  xprtrdma: Wait before destroying transport's queue pair
  xprtrdma: Remove completion polling budgets
  ...

9 years agoMerge branch 'topic/uek-4.1/sparc' of git://ca-git.us.oracle.com/linux-uek into uek...
Santosh Shilimkar [Thu, 17 Sep 2015 21:16:51 +0000 (14:16 -0700)]
Merge branch 'topic/uek-4.1/sparc' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* 'topic/uek-4.1/sparc' of git://ca-git.us.oracle.com/linux-uek:
  sparc64: vdso: simplify cpu_relax
  vdso: replace current_thread_info when building vDSO rather than diking it out
  sparc64, vdso: Add gettimeofday() and clock_gettime().
  sparc64, vdso: sparc64 vDSO implementation.

9 years agoMerge branch 'topic/uek-4.1/uek-carry' of git://ca-git.us.oracle.com/linux-uek into...
Santosh Shilimkar [Thu, 17 Sep 2015 21:16:40 +0000 (14:16 -0700)]
Merge branch 'topic/uek-4.1/uek-carry' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* 'topic/uek-4.1/uek-carry' of git://ca-git.us.oracle.com/linux-uek:
  bonding: If IP route look-up to send an ARP fails, mark in bonding structure as no ARP sent.
  xen/fpu: stts() before the local_irq_enable(), and clts() after the local_irq_disable().
  Revert "x86, fpu: Avoid possible error in math_state_restore()"

9 years agortnetlink: RTEXT_FILTER_SKIP_STATS support to avoid dumping inet/inet6 stats
Sowmini Varadhan [Fri, 11 Sep 2015 20:48:48 +0000 (16:48 -0400)]
rtnetlink: RTEXT_FILTER_SKIP_STATS support to avoid dumping inet/inet6 stats

Many commonly used functions like getifaddrs() invoke RTM_GETLINK
to dump the interface information, and do not need the
the AF_INET6 statististics that are always returned by default
from rtnl_fill_ifinfo().

Computing the statistics can be an expensive operation that impacts
scaling, so it is desirable to avoid this if the information is
not needed.

This patch adds a the RTEXT_FILTER_SKIP_STATS extended info flag that
can be passed with netlink_request() to avoid statistics computation
for the ifinfo path.

Orabug: 21857538

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agobonding: If IP route look-up to send an ARP fails, mark in bonding structure as no...
Rama Nichanamatlu [Tue, 15 Sep 2015 19:34:46 +0000 (12:34 -0700)]
bonding: If IP route look-up to send an ARP fails, mark in bonding structure as no ARP sent.

During the creation of VLAN's atop bonding the underlying interfaces are
made part of VLAN's, and at the same bonding driver gets aware of that
VLAN's exists above it and hence would consult IP routing for every ARP to
be sent to determine the route which tells bonding driver the correct VLAN
tag to attach to the outgoing ARP packet. But, during the VLAN creation
when vlan driver puts the underlying interface into default vlan and actual
vlan in-between this if bonding driver consults the IP for a route, IP fails
to provide a correct route and upon which bonding driver drops the ARP
packet. ARP monitor when it comes aroung next time, sees no ARP response
and fails-over to the next available slave. To prevent this false fail-over,
when bonding dirver fails to send an ARP out it marks in its private
structure, bonding{}, not to expect an ARP response, and when ARP monitor
comes around next time ARP sending will be tried again.

(this is same as commit 7cdd940ee8d9e25c942f5479410a7d2d6ac38d09)

Orabug: 21844825

Signed-off-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com>
Signed-off-by: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agouek-rpm: build: sparc: Build sparc headers
Natalya Naumova [Thu, 17 Sep 2015 15:31:30 +0000 (08:31 -0700)]
uek-rpm: build: sparc: Build sparc headers

Also drop accidental x86 firmware build which sparc carried
forward from previous port

Signed-off-by: Natalya Naumova <natalya.naumova@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoMerge branch 'topic/uek-4.1/ofed.rds-p2' into topic/uek-4.1/ofed
Mukesh Kacker [Thu, 17 Sep 2015 00:08:31 +0000 (17:08 -0700)]
Merge branch 'topic/uek-4.1/ofed.rds-p2' into topic/uek-4.1/ofed

* topic/uek-4.1/ofed.rds-p2:
  RDS: change spin_lock to spin_lock_bh
  rds: add busy_list only when fmr allocated successfully
  rds: free ib_device related resource
  rds: srq initialization and cleanup

9 years agoRDS: change spin_lock to spin_lock_bh
Wengang Wang [Tue, 8 Sep 2015 02:01:40 +0000 (10:01 +0800)]
RDS: change spin_lock to spin_lock_bh

softirq can occur when holding rds_ib_mr_pool.busy_lock

Mar 29 21:53:11 scac10db01vm03 kernel: [<ffffffff815167e3>] xen_hvm_callback_vector+0x13/0x20
Mar 29 21:53:11 scac10db01vm03 kernel: [<ffffffff81040fe9>] ? __ticket_spin_lock+0x19/0x20
Mar 29 21:53:11 scac10db01vm03 kernel: [<ffffffff8150cfae>] ? _raw_spin_lock+0xe/0x20
Mar 29 21:53:11 scac10db01vm03 kernel: [<ffffffffa03d05fa>] ? rds_ib_free_mr+0x3a/0x180 [rds_rdma]
Mar 29 21:53:11 scac10db01vm03 kernel: [<ffffffffa03898f0>] ? rds_destroy_mr+0xb0/0xc0 [rds]
Mar 29 21:53:11 scac10db01vm03 kernel: [<ffffffffa0389a08>] ? rds_rdma_unuse+0xd8/0x100 [rds]
Mar 29 21:53:11 scac10db01vm03 kernel: [<ffffffffa0384660>] ? rds_recv_local+0x180/0x310 [rds]
Mar 29 21:53:11 scac10db01vm03 kernel: [<ffffffffa03850ed>] ? rds_recv_incoming+0x7d/0x290 [rds]
Mar 29 21:53:11 scac10db01vm03 kernel: [<ffffffffa03cb194>] ? rds_ib_process_recv+0x2b4/0x340 [rds_rdma]
Mar 29 21:53:11 scac10db01vm03 kernel: [<ffffffffa03cccb2>] ? rds_ib_recv_cqe_handler+0x152/0x220 [rds_rdma]
Mar 29 21:53:11 scac10db01vm03 kernel: [<ffffffffa03c7fc6>] ? poll_cq+0x66/0xe0 [rds_rdma]
Mar 29 21:53:11 scac10db01vm03 kernel: [<ffffffffa03c80d9>] ? rds_ib_rx+0x99/0x210 [rds_rdma]
Mar 29 21:53:11 scac10db01vm03 kernel: [<ffffffffa03c82da>] ? rds_ib_tasklet_fn_recv+0x3a/0x50 [rds_rdma]
Mar 29 21:53:11 scac10db01vm03 kernel: [<ffffffff8107568d>] ? tasklet_action+0xcd/0x110
Mar 29 21:53:11 scac10db01vm03 kernel: [<ffffffff81075117>] ? __do_softirq+0xb7/0x210
Mar 29 21:53:11 scac10db01vm03 kernel: [<ffffffff815166bc>] ? call_softirq+0x1c/0x30
Mar 29 21:53:11 scac10db01vm03 kernel: [<ffffffff810173b5>] ? do_softirq+0x65/0xa0
Mar 29 21:53:11 scac10db01vm03 kernel: [<ffffffff81074f1d>] ? irq_exit+0xbd/0xe0
Mar 29 21:53:11 scac10db01vm03 kernel: [<ffffffff812fa735>] ? xen_evtchn_do_upcall+0x35/0x50
Mar 29 21:53:11 scac10db01vm03 kernel: [<ffffffff815167e3>] ? xen_hvm_callback_vector+0x13/0x20
Mar 29 21:53:11 scac10db01vm03 kernel: <EOI>  [<ffffffff81040fdd>] ? __ticket_spin_lock+0xd/0x20
Mar 29 21:53:11 scac10db01vm03 kernel: [<ffffffff8150cfae>] ? _raw_spin_lock+0xe/0x20
Mar 29 21:53:11 scac10db01vm03 kernel: [<ffffffffa03d096f>] ? rds_ib_alloc_fmr+0xff/0x530 [rds_rdma]
Mar 29 21:53:11 scac10db01vm03 kernel: [<ffffffffa03d0e52>] ? rds_ib_get_mr+0xb2/0x190 [rds_rdma]
Mar 29 21:53:11 scac10db01vm03 kernel: [<ffffffffa038a0f1>] ? __rds_rdma_map+0x241/0x360 [rds]
Mar 29 21:53:11 scac10db01vm03 kernel: [<ffffffffa038a377>] ? rds_get_mr+0x57/0x60 [rds]
Mar 29 21:53:11 scac10db01vm03 kernel: [<ffffffffa0380360>] ? rds_setsockopt+0x160/0x250 [rds]
Mar 29 21:53:11 scac10db01vm03 kernel: [<ffffffff812048fb>] ? selinux_socket_setsockopt+0x4b/0x60
Mar 29 21:53:11 scac10db01vm03 kernel: [<ffffffff81429e8f>] ? sys_setsockopt+0x7f/0xe0
Mar 29 21:53:11 scac10db01vm03 kernel: [<ffffffff81515482>] ? system_call_fastpath+0x16/0x1b

We need to avoid the entrance of softirq when taking busy_lock in rds_ib_alloc_fmr()

Orabug: 21795851

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Chien-Hua Yen <chien.yen@oracle.com>
Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
9 years agords: add busy_list only when fmr allocated successfully
Wengang Wang [Mon, 7 Sep 2015 08:42:44 +0000 (16:42 +0800)]
rds: add busy_list only when fmr allocated successfully

The rdma layer ibmr is always added to busy list of the pool after
memory is allocated.  In case the lower layer fmr allocation fails,
it should be removed from the busy list before memoryis freed but
it wasn't. Thus the freed ibmr is left in busy list, and the busy list
gets into unstable state.

Fix is to add busy_list only when fmr is allocated successfully

Orabug: 21795840

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Chien Yen <chien.yen@oracle.com>
Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
9 years agords: free ib_device related resource
Wengang Wang [Mon, 7 Sep 2015 07:35:29 +0000 (15:35 +0800)]
rds: free ib_device related resource

There is a (rare) case that a ib_device gets removed(driver unload) while
upper layer(RDS) is still having references to the resources allocated
from this ib_device.

The result is either causing memory leak or crashing when accessing
the freed memory.

The resources are mainly rds_ib_mr objects, in-use rds_ib_mr (rds_mr)
objects are stored in rds_sock.rs_rdma_keys.

The fix is to
1) links up all in-use rds_ib_mr objects to the pool
2) links the rds_sock to rds_ib_mr
3) the destroy of the rds_ib_mr_pool takes care of freeing rds_ib_mrs
   by calling rds_rdma_drop_keys()

Orabug: 21795824

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Acked-by: Chien Yen <chien.yen@oracle.com>
Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com>
Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
9 years agords: srq initialization and cleanup
Wengang Wang [Mon, 7 Sep 2015 06:12:42 +0000 (14:12 +0800)]
rds: srq initialization and cleanup

RDS has the following two problem related to shared receive queues

1) srq initialization:
  When a new IB dev is registered to device_list, the .add methods
of clients in client_list are called to do some initialization work.

For RDS, rds_ib_add_one() is called. srq related things should be
well initialized here since this is the last change before using srq.

However, code only allocates memory and seems hope rds_ib_srqs_init()
to initialize it later. But infact, rds_ib_srqs_init()
is not called if the call path is not insmod of rds_rdma.

2) srq cleanup:
  When removing rds_rdma module, srqs for all rds_ib_device should
be cleaned up. However, code only frees the rds_ib_device.srq memory
and is not cleaning up memory pointed to by pointers embedded inside.
This lead to resource leak.

This patch fixes the above two problems.

Orabug: 21795815

Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
Reviewed-by: Chien Yen <chien.yen@oracle.com>
Signed-off-by: Guangyu Sun <guangyu.sun@oracle.com>
Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
9 years agouek-rpm: configs: Adjust config for new rpcrdma.ko module
Chuck Lever [Mon, 31 Aug 2015 21:54:04 +0000 (15:54 -0600)]
uek-rpm: configs: Adjust config for new rpcrdma.ko module

Upstream merged svcrdma.ko and xprtrdma.ko into a single module,
rpcrdma.ko, in order to support bi-directional RPC/RDMA. The
old modules were controlled by separate Kconfig options, which
have been replaced by a single config option controlling both.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoMerge branch 'uek4-ga-nfs-rdma' of git://ca-git.us.oracle.com/linux-cel-public into...
Santosh Shilimkar [Wed, 16 Sep 2015 21:11:14 +0000 (14:11 -0700)]
Merge branch 'uek4-ga-nfs-rdma' of git://ca-git.us.oracle.com/linux-cel-public into topic/uek-4.1/nfs-rdma

* 'uek4-ga-nfs-rdma' of git://ca-git.us.oracle.com/linux-cel-public: (71 commits)
  xprtrdma: Add class for RDMA backwards direction transport
  svcrdma: Add infrastructure to receive backwards direction RPC/RDMA replies
  svcrdma: Add infrastructure to send backwards direction RPC/RDMA calls
  svcrdma: Add svc_rdma_get_context() API that is allowed to fail
  svcrdma: Define maximum number of backchannel requests
  NFS: Enable client side NFSv4.1 backchannel to use other transports
  svcrdma: Add backward direction service for RPC/RDMA transport
  xprtrdma: Handle incoming backward direction RPC calls
  xprtrdma: Add support for sending backward direction RPC replies
  xprtrdma: Pre-allocate Work Requests for backchannel
  xprtrdma: Pre-allocate backward rpc_rqst and send/receive buffers
  SUNRPC: Abstract backchannel operations
  SUNRPC: xprt_complete_bc_request must also decrement the free slot count
  SUNRPC: Fix a backchannel deadlock
  SUNRPC: Fix a backchannel race
  SUNRPC: Clean up allocation and freeing of back channel requests
  xprtrdma: Replace send and receive arrays
  xprtrdma: Refactor reply handler error handling
  xprtrdma: Wait before destroying transport's queue pair
  xprtrdma: Remove completion polling budgets
  ...

9 years agoxen/fpu: stts() before the local_irq_enable(), and clts() after the local_irq_disable().
Konrad Rzeszutek Wilk [Wed, 16 Sep 2015 18:23:35 +0000 (11:23 -0700)]
xen/fpu: stts() before the local_irq_enable(), and clts() after the local_irq_disable().

The Linux scheduler FPU allocation for a new process is a two-stage
mechanism prior to Linux v4.2. When an task is scheduled that hasn't
demonstrated a need for an FPU it set CR0.TS=1. The CR0.TS=1
will trap (and the CPU won't execute it) any FPU operations that the
task encountered. It allows the OS to lazily allocate for the
'struct task' an memory where FPU registers will be saved/restored.

When the task performs an FPU operation (MMX/SSE/etc) the first time
with CR0.TS=1 set, the hardware will trigger an exception #NM
(do_device_not_available) - and the exception handler (
math_state_restore) will setup up the memory for the task FPU
registers. And then return back to application allowing it to
execute the FPU operation (so with CR0.TS=0). And so on.

Thereafter if the task that has used the FPU is loaded, the CR0.TS
is cleared (0) so that the task can execute FPU operations unhindered.
Any tasks that are scheduled that haven't used the FPU get the
CR0.TS set (1). The kernel uses an PF_USED_MATH flag to figure
this out.

The below example should help in cementing this knowledge.

For simplicity we assume the guest/baremetal use the lazy mechanism
not eager. That makes 'switch_fpu_prepare' (called by schedule()) effectively:

if (previous task had PF_USED_MATH set)
   stts (CR0.TS=1)
else
   ;

And ignoring the case if the task had used the FPU more than
five times - where we do things a bit different.

The time diagram looks great at 132x42.

Lets assume that we have two tasks: A and B. Both haven't used
the FPU. This is on PVHVM (or baremetal):

CR0.TS=1                       CR0.TS=1                 CR0.TS=0                   CR0.TS=1                       CR0.TS=0
------------------------------------------------------------------------------------+--------+-------------------+-------+
task A | #NM                     |task B|                    |taskB |               | task A |                   |taskA  |
MMX    |math_state_restore       |      |                    |      |               |        |                   |       |
op     |  \- fpu_init            |      |                    |      |               |        |                   |       |
       |       \- .. schedule()  |      |                    |      |               |        |                   |       |
       |           [swap task B] |      |                    |      |               |        |                   |       |
       |           [since task A |      |                    |      |               |        |                   |       |
       |            hadn't set   |      |                    |      |               |        |                   |       |
       |            PF_USED_MATH |      |                    |      |               |        |                   |       |
       |            we don't muck|      |                    |      |               |        |                   |       |
       |            with CR0.TS] |      |                    |      |               |        |                   |       |
       |                         |MMX op|                    |      |               |        |                   |       |
       |                         |      |#NM                 |      |               |        |                   |       |
       |                         |      |math_state_restore  |      |               |        |                   |       |
       |                         |      | fpu_init worked    |      |               |        |                   |       |
       |                         |      |  clts()            |      |               |        |                   |       |
       |                         |      |task_B->flags |=    |      |               |        |                   |       |
       |                         |      |  PF_USED_MATH      |      |               |        |                   |       |
       |                         |      |  return;           |      |               |        |                   |       |
       |                         |      |                    |syscall|              |        |                   |       |
       |                         |      |                    |      |schedule()     |        |                   |       |
       |                         |      |                    |      |[swap task A]  |        |                   |       |
       |                         |      |                    |      |[taskB has     |        |                   |       |
       |                         |      |                    |      | PF_USED_MATH] |        |                   |       |
       |                         |      |                    |      |[so CR0.TS=1]  |        |                   |       |
       |                         |      |                    |      |  task A runs  |        |                   |       |
       |                         |      |                    |      |               |MMX op  |                   |       |
       |                         |      |                    |      |               |        |#NM                |       |
       |                         |      |                    |      |               |        | fpu_init works    |       |
       |                         |      |                    |      |               |        | clts()            |       |
       |                         |      |                    |      |               |        |  taskA->flags |=  |       |
       |                         |      |                    |      |               |        |  PF_USED_MATH     |       |
       |                         |      |                    |      |               |        |  return           |       |
       |                         |      |                    |      |               |        |                   |MMX op |

However Xen PV ABI choose to do a shortcut. When Xen hypervisor receives
an #NM it immediately clears the CR0.TS bit and executes the PV kernel
do_device_not_available handler. Which would be OK if the exception handler
would immediately do 'clts' (CR0.TS=0). Which it does 99% except that
one time when:

                 * does a slab alloc which can sleep
                 */
                if (init_fpu(tsk)) {

which can end up calling 'schedule()' (and swapping to another task)
with the CR0.TS bit being cleared.

The scheduler can schedule-in an application that uses the FPU and
since nobody has marked the task with FP_USED_MATH we end up
reusing the FPU registers across all the tasks. Ouch.

CR0.TS=1                       CR0.TS=0                 CR0.TS=0                   CR0.TS=0                       CR0.TS=0
[but Xen sets it to
CR0.TS=0 and calls
Linux #NM:]
------------------------------------------------------------------------------------+--------+-------------------+-------+
task A | #NM                     |task B|                    |taskB |               | task A |                   |taskA  |
MMX    |math_state_restore       |      |                    |      |               |        |                   |       |
op     |  \- fpu_init            |      |                    |      |               |        |                   |       |
       |       \- .. schedule()  |      |                    |      |               |        |                   |       |
       |           [swap task B] |      |                    |      |               |        |                   |       |
       |           [since task A |      |                    |      |               |        |                   |       |
       |            hadn't set   |      |                    |      |               |        |                   |       |
       |            PF_USED_MATH |      |                    |      |               |        |                   |       |
       |            we don't muck|      |                    |      |               |        |                   |       |
       |            with CR0.TS] |      |                    |      |               |        |                   |       |
       |                         |MMX op|                    |      |               |        |                   |       |
       |                         |      |[no trap to Linux or|      |               |        |                   |       |
       |                         |      |Xen as CR0.TS=0]    |      |               |        |                   |       |
       |                         |      |                    |      |               |        |                   |       |
       |                         |      |And task B clobbers |      |               |        |                   |       |
       |                         |      |task A FPU registers|      |               |        |                   |       |
       |                         |      |(or in the generic  |      |               |        |                   |       |
       |                         |      |case whoever ran    |      |               |        |                   |       |
       |                         |      |before task B). |syscall|              |        |                   |       |
       |                         |      |                    |      |schedule()     |        |                   |       |
       |                         |      |                    |      |[swap task A]  |        |                   |       |
       |                         |      |                    |      |[with task B]  |        |                   |       |
       |                         |      |                    |      |  task A runs  |        |                   |       |
       |                         |      |                    |      |               |MMX op  |                   |       |
       |                         |      |                    |      |               |        |[again, no trap to |       |
       |                         |      |                    |      |               |        | Xen or Linux b/c  |       |
       |                         |      |                    |      |               |        | CR0.TS=0 *1]      |       |
       |                         |      |                    |      |               |        |                   |MMX op |

The [*1] refers to the Xen scheduler. If any of the
syscalls that the user application called, ended in the Linux kernel
halt (xen_safe_halt) routine - we would deschedule the guest VCPU.

When that VCPU is re-scheduled, Xen would set CR0.TS=1 back
so the #NM would function again.

Not pretty - and again - only happening if the fpu_alloc() ends
up calling the schedule().

Upstream wise (v4.2) Ingo FPU rewrite (~296 patches) fixed this.

(Tests ran for 2 weeks while they would have failed within
two hours).

Digging in it was due to:

commit 0c8c0f03e3a292e031596484275c14cf39c0ab7a
Author: Dave Hansen <dave@sr71.net>
Date:   Fri Jul 17 12:28:11 2015 +0200

    x86/fpu, sched: Dynamically allocate 'struct fpu'

    The FPU rewrite removed the dynamic allocations of 'struct fpu'.
    But, this potentially wastes massive amounts of memory (2k per
    task on systems that do not have AVX-512 for instance).

    Instead of having a separate slab, this patch just appends the
    space that we need to the 'task_struct' which we dynamically
    allocate already.  This saves from doing an extra slab
    allocation at fork().

When Xen hypervisor calls the PV guests #NM ('do_device_not_available')
it does:

 fpu__restore(&current->thread.fpu); /* interrupts still off */
   |+- fpu__activate_curr (which just inits the already allocated space)
   |     \- memset(state, 0, xstate_size);
   |+- fpregs_activate
         \- stts()

So no call to 'schedule()' and leaking the FPU across different
tasks.

This patch modifies (and only for Xen PV guests) the state of
the CR0.TS to be set when 'schedule()' may be called. And if
'schedule()' is not called (fpu_alloc had no trouble getting
memory)', we set the CR0.TS back to zero (which actually may
not even be needed as we do that later as well).

Due to the wonder of paravirt and multicall batching the
'stts', 'clts' are not dispatched until arch_end_context_switch
is called (which is done in __switch_next which 'schedule()' does).

What that means is:
 - If fpu_alloc() (well, SLAB) ends up calling 'schedule()'
   the CR0.TS will get set when 'schedule()' is ready to start
   the new thread.
 - If fpu_alloc() had no trouble and there was no need for
   'schedule()' - then will flush out the multicall effectively
   doing CR0.TS=1 followed by CR0.TS=0, followed by CR0.TS=0 again.
   The end result is the same.

P.S.
Multicalls is a mechanism to put a bunch of hypercalls in on
hypercall. It can execute up to 32 hypercalls.

Oracle-Bug: 14768
Orabug: 20318090
Reported-and-Tested-by: Saar Maoz <Saar.Maoz@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoRevert "x86, fpu: Avoid possible error in math_state_restore()"
Konrad Rzeszutek Wilk [Wed, 16 Sep 2015 18:23:29 +0000 (11:23 -0700)]
Revert "x86, fpu: Avoid possible error in math_state_restore()"

This reverts commit 5b5e1763859439f4733aafc7585b15b28ab94209.

The patch does not fix the underlaying problem. The
patch "xen/fpu:  stts() before the local_irq_enable(), and clts()
after the local_irq_disable()" fixes the issue.

Acked-by: Annie Li <annie.li@oracle.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agouek-rpm: builds: sparc64: enable dtrace support
Allen Pais [Tue, 15 Sep 2015 12:15:13 +0000 (17:45 +0530)]
uek-rpm: builds: sparc64: enable dtrace support

  Enables dtrace support for sparc64 and also
  enables building with headers.

Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoMerge branch '4.1_sparc' of git://ca-git.us.oracle.com/linux-uek-apais into topic...
Santosh Shilimkar [Tue, 15 Sep 2015 15:36:47 +0000 (08:36 -0700)]
Merge branch '4.1_sparc' of git://ca-git.us.oracle.com/linux-uek-apais into topic/uek-4.1/sparc

* '4.1_sparc' of git://ca-git.us.oracle.com/linux-uek-apais:
  sparc64: vdso: simplify cpu_relax
  vdso: replace current_thread_info when building vDSO rather than diking it out
  sparc64, vdso: Add gettimeofday() and clock_gettime().
  sparc64, vdso: sparc64 vDSO implementation.

9 years agosparc64: vdso: simplify cpu_relax
Dave Kleikamp [Fri, 7 Aug 2015 21:34:03 +0000 (16:34 -0500)]
sparc64: vdso: simplify cpu_relax

Orabug: 20861959

Create a simpler cpu_relax for the VDSO object. The pause_3insn_patch
section introduces relocatable code that prevents the vdso from building.

Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
(cherry picked from commit edb2ed24d0e4a250ef432cf1fb7c0532817728d3)

9 years agovdso: replace current_thread_info when building vDSO rather than diking it out
Nick Alcock [Thu, 30 Jul 2015 17:06:29 +0000 (12:06 -0500)]
vdso: replace current_thread_info when building vDSO rather than diking it out

Orabug: 20861959

When building the userspace parts of the vDSO code, we have to dike out things
from the various kernel headers we use that generate register relocations,
since we cannot handle relocations in the vDSO.  The principal such thing is
current_thread_info(), which we used to dike out entirely -- but in the -rt
patchset, a lot of things in the headers reference this.  So, instead,
simply have current_thread_info() generate nonsense code that doesn't emit
a relocation, so that its users still compile (though they would never
work -- but that's not important, since they are never used).

Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
(cherry picked from commit 4f9b3c6e7fe105ea04f89794007f97b43a63c897)

9 years agosparc64, vdso: Add gettimeofday() and clock_gettime().
Nick Alcock [Mon, 8 Dec 2014 13:42:37 +0000 (13:42 +0000)]
sparc64, vdso: Add gettimeofday() and clock_gettime().

This commit adds gettimeofday() and clock_gettime() entry points to the SPARC64
vDSO: in conjunction with a suitably-modified glibc this provides a speedup to
gettimeofday(), time() and some clock_gettime() calls of on the order of
10--15x (the higher figure is for the coarse clock_gettime() clocks).

gettimeofday() and clock_gettime() use largely separate code paths: all
other approaches with less code duplication (e.g. doing all the work in
a struct timespec and only shifting it into a struct timeval before return)
turned out slower.

Tested on %stick-capable machines only: %tick codepaths untested.

Orabug: 20861959
Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
(cherry picked from commit e8f9e5cfc9d297c7ebbc459b677ee0b8a3e45154)

9 years agosparc64, vdso: sparc64 vDSO implementation.
Nick Alcock [Mon, 8 Dec 2014 13:32:19 +0000 (13:32 +0000)]
sparc64, vdso: sparc64 vDSO implementation.

This commit adds a vDSO similar to that used on x86: in this commit, that vDSO
is empty bar the ELF note used by glibc to verify that it knows about this vDSO.
The vDSO's location is somewhat randomized, so, as a consequence, tends to
randomize the locations of other shared libraries too.  (The randomization
respects /proc/sys/kernel/randomize_va_space.)

It is derived from the implementation in recent kernels, in that it uses a C
generator to translate the vDSO shared library into C code and validate that it
contains no relocations and the like.

Notes for future improvement:

 - There is no support for a vDSO in 32-bit userspace yet.  This is just because
   I want to get the sparc64 version working first: the compat vDSO
   implementation adds significant complexity.

 - The vDSO randomization process is ugly: we are calling get_unmapped_area()
   twice, with a randomization in the middle.  Eventually,
   arch_get_unmapped_area() on SPARC64 should learn about PF_RANDOMIZE, as it
   has on other arches.

Orabug: 20861959
Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
(cherry picked from commit 2da875e6f5781dd196e9f055cd53a3ac0d80aaaa)

9 years agoMerge branch 'topic/uek-4.1/ofed' of git://ca-git.us.oracle.com/linux-uek into uek...
Santosh Shilimkar [Fri, 11 Sep 2015 21:34:52 +0000 (14:34 -0700)]
Merge branch 'topic/uek-4.1/ofed' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* 'topic/uek-4.1/ofed' of git://ca-git.us.oracle.com/linux-uek:
  ib_core: Usermode FMR config params
  ib_core: User mode FMR fixes 2012-06-11
  ib/srp: Enable usermode FMR
  ib/iser: Enable usermode FMR
  ib/mlx4: Enable usermode FMR
  ib/core: Enable usermode FMR
  ib/core: init shared-pd ref count to 1, and add cleanup
  IB/Shared PD support from Oracle

9 years agoMerge branch 'topic/uek-4.1/rpm-build' of git://ca-git.us.oracle.com/linux-uek into...
Santosh Shilimkar [Fri, 11 Sep 2015 21:34:34 +0000 (14:34 -0700)]
Merge branch 'topic/uek-4.1/rpm-build' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* 'topic/uek-4.1/rpm-build' of git://ca-git.us.oracle.com/linux-uek:
  sparc64: enable firmware build in kernel spec
  sparc64: enable usb xhci/ehci pci configs
  sparc64: enable a few configs required for proxyt
  sparc64:perf: fix perf build crash
  sparc64: enable dtrace support for sparc64 in the spec file
  sparc64: kernel-uek.spec update to support sparc.
  sparc64: uek4 debug config for sparc64
  sparc64: uek4 config for sparc64
  uek-rpm: config: add turbostat into kernel pakackage for OL6 and OL7
  uek-rom: config: Unset CONFIG_NFS_USE_LEGACY_DNS for OL7
  uek-rpm: configs: Enbale X86_SYSFB on OL7 too

9 years agoMerge branch 'topic/uek-4.1/sparc' of git://ca-git.us.oracle.com/linux-uek into uek...
Santosh Shilimkar [Fri, 11 Sep 2015 21:34:13 +0000 (14:34 -0700)]
Merge branch 'topic/uek-4.1/sparc' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* 'topic/uek-4.1/sparc' of git://ca-git.us.oracle.com/linux-uek: (25 commits)
  lib/iommu-common.c: do not use 0xffffffffffffffffl for computing align_mask
  sparc64: use ENTRY/ENDPROC in VISsave
  SPARC64: PORT LDOMS TO UEK4
  Fix incorrect ASI_ST_BLKINIT_MRU_S value
  sparc64: perf: Use UREG_FP rather than UREG_I6
  sparc64: perf: Add sanity checking on addresses in user stack
  sparc64: Convert BUG_ON to warning
  sparc: perf: Disable pagefaults while walking userspace stacks
  sparc: time: Replace update_persistent_clock() with CONFIG_RTC_SYSTOHC
  PCI: Set under_pref for mem64 resource of pcie device
  sparc/PCI: Add mem64 resource parsing for root bus
  PCI: Add pci_bus_addr_t
  sparc64: Fix userspace FPU register corruptions.
  sparc64: using 2048 as default for number of CPUS (cherry picked from commit 578ddb2512a5c908cd17ef8cbc43ff78dd399afd)
  sparc64: iommu-common build error fix (cherry picked from commit accb4c6276793b991c6382bf57a58b40ea17eb11)
  sparc64: fix Setup sysfs to mark LDOM sockets build error (cherry picked from commit 59be02427bfcac6c904ddd1374c35d63155b82d4)
  sparc64: mmap fixed and shared
  sparc64: restore TIF_FREEZE flag for sparc
  sparc64: Setup sysfs to mark LDOM sockets, cores and threads correctly
  sparc: Revert generic IOMMU allocator.
  ...

Conflicts:
arch/sparc/lib/VISsave.S
drivers/block/Kconfig

9 years agoMerge branch 'topic/uek-4.1/dtrace' of git://ca-git.us.oracle.com/linux-uek into...
Santosh Shilimkar [Fri, 11 Sep 2015 21:32:59 +0000 (14:32 -0700)]
Merge branch 'topic/uek-4.1/dtrace' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* 'topic/uek-4.1/dtrace' of git://ca-git.us.oracle.com/linux-uek:
  kallsyms: unbreak kallmodsyms after CONFIG_KALLMODSYMS addition
  kallsyms: de-ifdef kallmodsyms
  dtrace: use syscall_get_nr() to obtain syscall number

9 years agoMerge branch 'topic/uek-4.1/ocfs2' of git://ca-git.us.oracle.com/linux-uek into uek...
Santosh Shilimkar [Fri, 11 Sep 2015 21:32:52 +0000 (14:32 -0700)]
Merge branch 'topic/uek-4.1/ocfs2' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* 'topic/uek-4.1/ocfs2' of git://ca-git.us.oracle.com/linux-uek:
  add OCFS2_LOCK_RECURSIVE arg_flags to ocfs2_cluster_lock() to prevent hang
  ocfs2: direct write will call ocfs2_rw_unlock() twice when doing aio+dio
  ocfs2_iop_set/get_acl() are also called from the VFS so we must take inode lock
  BUG_ON(lockres->l_level != DLM_LOCK_EX && !checkpointed) tripped in ocfs2_ci_checkpointed

9 years agoMerge branch 'topic/uek-4.1/upstream-cherry-picks' of git://ca-git.us.oracle.com...
Santosh Shilimkar [Fri, 11 Sep 2015 21:32:37 +0000 (14:32 -0700)]
Merge branch 'topic/uek-4.1/upstream-cherry-picks' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* 'topic/uek-4.1/upstream-cherry-picks' of git://ca-git.us.oracle.com/linux-uek:
  NVMe: Setup max hardware sector count to 512KB
  intel_pstate: enable HWP per CPU

9 years agoMerge branch 'topic/uek-4.1/ofed.mlnx2.4-p3.orclFixes' into topic/uek-4.1/ofed
Mukesh Kacker [Fri, 11 Sep 2015 17:15:28 +0000 (10:15 -0700)]
Merge branch 'topic/uek-4.1/ofed.mlnx2.4-p3.orclFixes' into topic/uek-4.1/ofed

* topic/uek-4.1/ofed.mlnx2.4-p3.orclFixes:
  ib_core: Usermode FMR config params
  ib_core: User mode FMR fixes 2012-06-11
  ib/srp: Enable usermode FMR
  ib/iser: Enable usermode FMR
  ib/mlx4: Enable usermode FMR
  ib/core: Enable usermode FMR
  ib/core: init shared-pd ref count to 1, and add cleanup
  IB/Shared PD support from Oracle

9 years agoib_core: Usermode FMR config params
Dotan Barak [Wed, 11 Jul 2012 12:14:40 +0000 (15:14 +0300)]
ib_core: Usermode FMR config params

Orabug: 21517998

Signed-off-by: Arun Kaimalettu <gotoarunk@gmail.com>
(Ported from UEK2/Mellanox OFED1.5.5R2)

Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
9 years agoib_core: User mode FMR fixes 2012-06-11
Dotan Barak [Wed, 11 Jul 2012 09:45:12 +0000 (03:45 -0600)]
ib_core: User mode FMR fixes 2012-06-11

Orabug: 21517998

Signed-off-by: Arun Kaimalettu <gotoarunk@gmail.com>
(Ported from UEK2/Mellanox OFED 1.5.5R2)

Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
9 years agoib/srp: Enable usermode FMR
Dotan Barak [Wed, 22 Feb 2012 13:00:21 +0000 (15:00 +0200)]
ib/srp: Enable usermode FMR

Orabug: 21517998

Signed-off-by: Arun Kaimalettu <arun.kaimalettu@oracle.com>
Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il>
(Ported from UEK2/Mellanox OFED1.5.5R2)

Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
9 years agoib/iser: Enable usermode FMR
Dotan Barak [Wed, 22 Feb 2012 12:59:18 +0000 (14:59 +0200)]
ib/iser: Enable usermode FMR

Orabug: 21517998

Signed-off-by: Arun Kaimalettu <arun.kaimalettu@oracle.com>
Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il>
(Ported from UEK2/Mellanox OFED 1.5.5R2)

Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
9 years agoib/mlx4: Enable usermode FMR
Dotan Barak [Wed, 22 Feb 2012 12:27:55 +0000 (14:27 +0200)]
ib/mlx4: Enable usermode FMR

Orabug: 21517998

Signed-off-by: Arun Kaimalettu <arun.kaimalettu@oracle.com>
Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il>
(Ported from UEK2/Mellanox OFED 1.5.5R2)

Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
9 years agoib/core: Enable usermode FMR
Dotan Barak [Wed, 22 Feb 2012 12:23:21 +0000 (14:23 +0200)]
ib/core: Enable usermode FMR

Signed-off-by: Arun Kaimalettu <arun.kaimalettu@oracle.com>
Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il>
(Ported from UEK2/OFED 1.5.5)

Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
9 years agoib/core: init shared-pd ref count to 1, and add cleanup
Arun Kaimalettu [Mon, 18 Jul 2011 12:21:34 +0000 (15:21 +0300)]
ib/core: init shared-pd ref count to 1, and add cleanup

When shpd is created it is already referred to by parent 'pd',
so shpd->shared should be '1' initially (and not '0');
otherwise, the 'shpd' memory may get freed/reallocated
while it is still being referred to by one last pd.

Additionally, add shared-pd cleanup to ucontext cleanup flow.

Orabug: 21496696

Signed-off-by: Arun Kaimalettu <arun.kaimalettu@oracle.com>
Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il>
(Ported from UEK2/OFED 1.5.5)

Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
9 years agoIB/Shared PD support from Oracle
Eli Cohen [Sun, 5 Jun 2011 12:36:46 +0000 (15:36 +0300)]
IB/Shared PD support from Oracle

Orabug: 21496696

Signed-off-by: Arun Kaimalettu <arun.kaimalettu@oracle.com>
Signed-off-by: Eli Cohen <eli@mellanox.co.il>
(Ported from UEK2/OFED 1.5.5)

Signed-off-by: Mukesh Kacker <mukesh.kacker@oracle.com>
9 years agosparc64: enable firmware build in kernel spec
Allen Pais [Thu, 10 Sep 2015 15:10:37 +0000 (20:40 +0530)]
sparc64: enable firmware build in kernel spec

(cherry picked from commit 7d7e426ca7af65f21e91aacf6eceaff4ccb946bb)

Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agosparc64: enable usb xhci/ehci pci configs
Allen Pais [Thu, 6 Aug 2015 13:30:42 +0000 (19:00 +0530)]
sparc64: enable usb xhci/ehci pci configs

(cherry picked from commit 180ab995c09a50782ae41969b4af22adadf3687d)

Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agosparc64: enable a few configs required for proxyt
Allen Pais [Thu, 6 Aug 2015 12:27:33 +0000 (17:57 +0530)]
sparc64: enable a few configs required for proxyt

Enabled the following:
(cherry picked from commit 7196e2382fb95198803de686bb04bc25f2ce9075)

Signed-off-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agosparc64:perf: fix perf build crash
Allen Pais [Mon, 18 May 2015 13:36:51 +0000 (19:06 +0530)]
sparc64:perf: fix perf build crash

fix "the `-j' option requires a positive integral argument"
crash for sparc build.

Signed-off-by: Allen Pais <allen.pais@oracle.com>
(cherry picked from commit c96761cf53294b05c6c8e855e2ec7be6afed0f86)

Conflicts:
tools/perf/Makefile
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agosparc64: enable dtrace support for sparc64 in the spec file
Allen Pais [Thu, 14 May 2015 17:30:21 +0000 (23:00 +0530)]
sparc64: enable dtrace support for sparc64 in the spec file

Signed-off-by: Allen Pais <allen.pais@oracle.com>
(cherry picked from commit 701b65eba862955a458ab6b1ebb4f82125d12d44)
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agosparc64: kernel-uek.spec update to support sparc.
Allen Pais [Wed, 6 May 2015 14:32:43 +0000 (20:02 +0530)]
sparc64: kernel-uek.spec update to support sparc.

Signed-off-by: Allen Pais <allen.pais@oracle.com>
(cherry picked from commit 4ef9d973f49d3e451c56b32f2c7bdf1473d77d84)
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agosparc64: uek4 debug config for sparc64
Allen Pais [Fri, 8 May 2015 13:32:58 +0000 (19:02 +0530)]
sparc64: uek4 debug config for sparc64

Signed-off-by: Allen Pais <allen.pais@oracle.com>
(cherry picked from commit 3a9940a9ebadb5e3c0e9722658a47ac8438acdb5)
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agosparc64: uek4 config for sparc64
Allen Pais [Wed, 6 May 2015 14:29:24 +0000 (19:59 +0530)]
sparc64: uek4 config for sparc64

Signed-off-by: Allen Pais <allen.pais@oracle.com>
(cherry picked from commit 025a9097b06f5ef7a3a0a9333699c2623496dbce)
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoMerge branch '4.1_sparc' of git://ca-git.us.oracle.com/linux-uek-apais into topic...
Santosh Shilimkar [Fri, 11 Sep 2015 16:34:52 +0000 (09:34 -0700)]
Merge branch '4.1_sparc' of git://ca-git.us.oracle.com/linux-uek-apais into topic/uek-4.1/sparc

* '4.1_sparc' of git://ca-git.us.oracle.com/linux-uek-apais: (25 commits)
  lib/iommu-common.c: do not use 0xffffffffffffffffl for computing align_mask
  sparc64: use ENTRY/ENDPROC in VISsave
  SPARC64: PORT LDOMS TO UEK4
  Fix incorrect ASI_ST_BLKINIT_MRU_S value
  sparc64: perf: Use UREG_FP rather than UREG_I6
  sparc64: perf: Add sanity checking on addresses in user stack
  sparc64: Convert BUG_ON to warning
  sparc: perf: Disable pagefaults while walking userspace stacks
  sparc: time: Replace update_persistent_clock() with CONFIG_RTC_SYSTOHC
  PCI: Set under_pref for mem64 resource of pcie device
  sparc/PCI: Add mem64 resource parsing for root bus
  PCI: Add pci_bus_addr_t
  sparc64: Fix userspace FPU register corruptions.
  sparc64: using 2048 as default for number of CPUS (cherry picked from commit 578ddb2512a5c908cd17ef8cbc43ff78dd399afd)
  sparc64: iommu-common build error fix (cherry picked from commit accb4c6276793b991c6382bf57a58b40ea17eb11)
  sparc64: fix Setup sysfs to mark LDOM sockets build error (cherry picked from commit 59be02427bfcac6c904ddd1374c35d63155b82d4)
  sparc64: mmap fixed and shared
  sparc64: restore TIF_FREEZE flag for sparc
  sparc64: Setup sysfs to mark LDOM sockets, cores and threads correctly
  sparc: Revert generic IOMMU allocator.
  ...

9 years agolib/iommu-common.c: do not use 0xffffffffffffffffl for computing align_mask
Sowmini Varadhan [Thu, 6 Aug 2015 22:46:39 +0000 (15:46 -0700)]
lib/iommu-common.c: do not use 0xffffffffffffffffl for computing align_mask

Using a 64 bit constant generates "warning: integer constant is too
large for 'long' type" on 32 bit platforms.  Instead use ~0ul and
BITS_PER_LONG.

Detected by Andrew Morton on ARMD.

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 447f6a95a9c80da7faaec3e66e656eab8f262640)
Signed-off-by: Allen Pais <allen.pais@oracle.com>
9 years agosparc64: use ENTRY/ENDPROC in VISsave
Sam Ravnborg [Fri, 7 Aug 2015 18:26:12 +0000 (20:26 +0200)]
sparc64: use ENTRY/ENDPROC in VISsave

Commit 44922150d87cef616fd183220d43d8fde4d41390
("sparc64: Fix userspace FPU register corruptions") left a
stale globl symbol which was not used.

Fix this and introduce use of ENTRY/ENDPROC

Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 73958c651fbf70d8d8bf2a60b871af5f7a2e3199)
Signed-off-by: Allen Pais <allen.pais@oracle.com>
9 years agoSPARC64: PORT LDOMS TO UEK4
Aaron Young [Tue, 18 Aug 2015 19:10:23 +0000 (12:10 -0700)]
SPARC64: PORT LDOMS TO UEK4

    Initial port of LDoms code to UEK4.

    NOTE: due to UEK4 kernel issue(s) encountered during testing,
    this port has NOT been fully tested.

Signed-off-by: Aaron Young <aaron.young@oracle.com>
    Orabug: 21644721
(cherry picked from commit 6dfe4cf1cc02dbea298480804d030850bfef1ab3)

Conflicts:
arch/sparc/kernel/ds.c
drivers/tty/Kconfig
drivers/tty/Makefile
(cherry picked from commit c398fd2a3c18f6385eb4db80305ab693027a58d5)

Conflicts:
drivers/tty/Kconfig
drivers/tty/Makefile
Signed-off-by: Allen Pais <allen.pais@oracle.com>
9 years agoFix incorrect ASI_ST_BLKINIT_MRU_S value
Rob Gardner [Thu, 6 Aug 2015 20:12:52 +0000 (14:12 -0600)]
Fix incorrect ASI_ST_BLKINIT_MRU_S value

ASI_ST_BLKINIT_MRU_S is incorrectly defined at F2, but it
should be F3.
(cherry picked from commit cfbf92f064067fffbc447fc6b094da77cbe75f57)
(cherry picked from commit 7ab80ef701c0b6afe7bb8988372c45ea0f67f0f3)

Signed-off-by: Allen Pais <allen.pais@oracle.com>
9 years agouek-rpm: config: add turbostat into kernel pakackage for OL6 and OL7
Ethan Zhao [Fri, 11 Sep 2015 15:51:49 +0000 (08:51 -0700)]
uek-rpm: config: add turbostat into kernel pakackage for OL6 and OL7

Orabug: 21613769

Create shell wrapper for turbostat and add turbosat tool into kernel
package.

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agouek-rom: config: Unset CONFIG_NFS_USE_LEGACY_DNS for OL7
Todd Vierling [Thu, 10 Sep 2015 20:36:14 +0000 (16:36 -0400)]
uek-rom: config: Unset CONFIG_NFS_USE_LEGACY_DNS for OL7

The userland nfs-utils needs the kernel to do resolver queries.
In order to enable this functionality, this config option needs to
match what RHCK uses (which is enabled on OL6, disabled on OL7).

Orabug: 21483381

Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoNVMe: Setup max hardware sector count to 512KB
Santosh Shilimkar [Thu, 10 Sep 2015 15:10:32 +0000 (08:10 -0700)]
NVMe: Setup max hardware sector count to 512KB

Linux in box NVMe driver does not handle 0 MDTS as expected
•0 MDTS - the drive can accept any request size.
•The device driver set up max hardware sector size by
BLK_SAFE_MAX_SECTORS or 124KB.
•Every IO size greater than 124KB is splitted by 124KB and remainder.

Hence performance drop at 128KB IO size.

Orabug: 21818316

Acked-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agosparc64: perf: Use UREG_FP rather than UREG_I6
David Ahern [Mon, 15 Jun 2015 20:15:46 +0000 (16:15 -0400)]
sparc64: perf: Use UREG_FP rather than UREG_I6

perf walks userspace callchains by following frame pointers. Use the
UREG_FP macro to make it clearer that the %fp is being used.

Signed-off-by: David Ahern <david.ahern@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 2d89cd8625c4af01a2683b18c3c8194cc3b3067c)
(cherry picked from commit 96746184672da481e38be6c30967538127bb9e33)

9 years agosparc64: perf: Add sanity checking on addresses in user stack
David Ahern [Mon, 15 Jun 2015 20:15:45 +0000 (16:15 -0400)]
sparc64: perf: Add sanity checking on addresses in user stack

Processes are getting killed (sigbus or segv) while walking userspace
callchains when using perf. In some instances I have seen ufp = 0x7ff
which does not seem like a proper stack address.

This patch adds a function to run validity checks against the address
before attempting the copy_from_user. The checks are copied from the
x86 version as a start point with the addition of a 4-byte alignment
check.

Signed-off-by: David Ahern <david.ahern@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit b69fb7699c92f85991672fc144b0adb7c717fbc8)
(cherry picked from commit 64ff44be3eb1044b7ce000dc409c785810f9d1f0)

9 years agosparc64: Convert BUG_ON to warning
David Ahern [Mon, 15 Jun 2015 20:15:44 +0000 (16:15 -0400)]
sparc64: Convert BUG_ON to warning

Pagefault handling has a BUG_ON path that panics the system. Convert it to
a warning instead. There is no need to bring down the system for this kind
of failure.

The following was hit while running:
    perf sched record -g -- make -j 16

[3609412.782801] kernel BUG at /opt/dahern/linux.git/arch/sparc/mm/fault_64.c:416!
[3609412.782833]               \|/ ____ \|/
[3609412.782833]               "@'/ .. \`@"
[3609412.782833]               /_| \__/ |_\
[3609412.782833]                  \__U_/
[3609412.782870] cat(4516): Kernel bad sw trap 5 [#1]
[3609412.782889] CPU: 0 PID: 4516 Comm: cat Tainted: G            E   4.1.0-rc8+ #6
[3609412.782909] task: fff8000126e31f80 ti: fff8000110d90000 task.ti: fff8000110d90000
[3609412.782931] TSTATE: 0000004411001603 TPC: 000000000096b164 TNPC: 000000000096b168 Y: 0000004e    Tainted: G            E
[3609412.782964] TPC: <do_sparc64_fault+0x5e4/0x6a0>
[3609412.782979] g0: 000000000096abe0 g1: 0000000000d314c4 g2: 0000000000000000 g3: 0000000000000001
[3609412.783009] g4: fff8000126e31f80 g5: fff80001302d2000 g6: fff8000110d90000 g7: 00000000000000ff
[3609412.783045] o0: 0000000000aff6a8 o1: 00000000000001a0 o2: 0000000000000001 o3: 0000000000000054
[3609412.783080] o4: fff8000100026820 o5: 0000000000000001 sp: fff8000110d935f1 ret_pc: 000000000096b15c
[3609412.783117] RPC: <do_sparc64_fault+0x5dc/0x6a0>
[3609412.783137] l0: 000007feff996000 l1: 0000000000030001 l2: 0000000000000004 l3: fff8000127bd0120
[3609412.783174] l4: 0000000000000054 l5: fff8000127bd0188 l6: 0000000000000000 l7: fff8000110d9dba8
[3609412.783210] i0: fff8000110d93f60 i1: fff8000110ca5530 i2: 000000000000003f i3: 0000000000000054
[3609412.783244] i4: fff800010000081a i5: fff8000100000398 i6: fff8000110d936a1 i7: 0000000000407c6c
[3609412.783286] I7: <sparc64_realfault_common+0x10/0x20>
[3609412.783308] Call Trace:
[3609412.783329]  [0000000000407c6c] sparc64_realfault_common+0x10/0x20
[3609412.783353] Disabling lock debugging due to kernel taint
[3609412.783379] Caller[0000000000407c6c]: sparc64_realfault_common+0x10/0x20
[3609412.783449] Caller[fff80001002283e4]: 0xfff80001002283e4
[3609412.783471] Instruction DUMP: 921021a0  7feaff91  901222a8 <91d0200582086100  02f87f7b  808a2873  81cfe008  01000000
[3609412.783542] Kernel panic - not syncing: Fatal exception
[3609412.784605] Press Stop-A (L1-A) to return to the boot prom
[3609412.784615] ---[ end Kernel panic - not syncing: Fatal exception

With this patch rather than a panic I occasionally get something like this:
    perf sched record -g -m 1024  -- make -j N

where N is based on number of cpus (128 to 1024 for a T7-4 and 8 for an 8 cpu
VM on a T5-2).

WARNING: CPU: 211 PID: 52565 at /opt/dahern/linux.git/arch/sparc/mm/fault_64.c:417 do_sparc64_fault+0x340/0x70c()
address (7feffcd6000) != regs->tpc (fff80001004873c0)
Modules linked in: ipt_REJECT nf_reject_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_reject_ipv6 xt_tcpudp nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables x_tables ipv6 cdc_ether usbnet mii ixgbe mdio igb i2c_algo_bit i2c_core ptp crc32c_sparc64 camellia_sparc64 des_sparc64 des_generic md5_sparc64 sha512_sparc64 sha1_sparc64 uio_pdrv_genirq uio usb_storage mpt3sas scsi_transport_sas raid_class aes_sparc64 sunvnet sunvdc sha256_sparc64(E) sha256_generic(E)
CPU: 211 PID: 52565 Comm: ld Tainted: G        W   E   4.1.0-rc8+ #19
Call Trace:
 [000000000045ce30] warn_slowpath_common+0x7c/0xa0
 [000000000045ceec] warn_slowpath_fmt+0x30/0x40
 [000000000098ad64] do_sparc64_fault+0x340/0x70c
 [0000000000407c2c] sparc64_realfault_common+0x10/0x20
---[ end trace 62ee02065a01a049 ]---
ld[52565]: segfault at fff80001004873c0 ip fff80001004873c0 (rpc fff8000100158868) sp 000007feffcd70e1 error 30002 in libc-2.12.so[fff8000100410000+184000]

The segfault is horrible, but better than a system panic.

An 8-cpu VM on a T5-2 also showed the above traces from time to time,
so it is a general problem and not specific to the T7 or baremetal.

Signed-off-by: David Ahern <david.ahern@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 2bf7c3efc393937d1e5f92681501a914dbfbae07)
(cherry picked from commit 50c390fd136d37513536422f5a6a44207ad4fed0)

9 years agosparc: perf: Disable pagefaults while walking userspace stacks
David Ahern [Mon, 15 Jun 2015 20:15:43 +0000 (16:15 -0400)]
sparc: perf: Disable pagefaults while walking userspace stacks

Page faults generated walking userspace stacks can call schedule to switch
out the task. When collecting callchains for scheduler tracepoints this
causes a deadlock as the tracepoints can be hit with the runqueue lock held:

[ 8138.159054] WARNING: CPU: 758 PID: 12488 at /opt/dahern/linux.git/arch/sparc/kernel/nmi.c:80 perfctr_irq+0x1f8/0x2b4()

[ 8138.203152] Watchdog detected hard LOCKUP on cpu 758

[ 8138.410969] CPU: 758 PID: 12488 Comm: perf Not tainted 4.0.0-rc6+ #6
[ 8138.437146] Call Trace:
[ 8138.447193]  [000000000045cdd4] warn_slowpath_common+0x7c/0xa0
[ 8138.471238]  [000000000045ce90] warn_slowpath_fmt+0x30/0x40
[ 8138.494189]  [0000000000983e38] perfctr_irq+0x1f8/0x2b4
[ 8138.515716]  [00000000004209f4] tl0_irq15+0x14/0x20
[ 8138.535791]  [00000000009839ec] _raw_spin_trylock_bh+0x68/0x108
[ 8138.560180]  [0000000000980018] __schedule+0xcc/0x710
[ 8138.580981]  [00000000009806dc] preempt_schedule_common+0x10/0x3c
[ 8138.606082]  [000000000098077c] _cond_resched+0x34/0x44
[ 8138.627603]  [0000000000565990] kmem_cache_alloc_node+0x24/0x1a0
[ 8138.652345]  [0000000000450b60] tsb_grow+0xac/0x488
[ 8138.672429]  [0000000000985040] do_sparc64_fault+0x4dc/0x6e4
[ 8138.695736]  [0000000000407c2c] sparc64_realfault_common+0x10/0x20
[ 8138.721202]  [00000000006f2e24] NG4copy_from_user+0xa4/0x3c0
[ 8138.744510]  [000000000044f900] perf_callchain_user+0x5c/0x6c
[ 8138.768182]  [0000000000517b5c] perf_callchain+0x16c/0x19c
[ 8138.790774]  [0000000000515f84] perf_prepare_sample+0x68/0x218
[ 8138.814801] ---[ end trace 42ca6294b1ff7573 ]---

As with PowerPC (b59a1bfcc240, "powerpc/perf: Disable pagefaults during
callchain stack read") disable pagefaults while walking userspace stacks.

Signed-off-by: David Ahern <david.ahern@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit c17af4dd96aa99e6e58b5d715a7c66db63a15106)
(cherry picked from commit bd9d88e90fd32a4871a5794d7859d2bdd390d0b6)

9 years agosparc: time: Replace update_persistent_clock() with CONFIG_RTC_SYSTOHC
Xunlei Pang [Fri, 12 Jun 2015 03:10:17 +0000 (11:10 +0800)]
sparc: time: Replace update_persistent_clock() with CONFIG_RTC_SYSTOHC

On Sparc systems, update_persistent_clock() uses RTC drivers to do
the job, it makes more sense to hand it over to CONFIG_RTC_SYSTOHC.

In the long run, all the update_persistent_clock() should migrate to
proper class RTC drivers if any and use CONFIG_RTC_SYSTOHC instead.

Signed-off-by: Xunlei Pang <pang.xunlei@linaro.org>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
(cherry picked from commit 460ea8d70db1ffd9a5d6996c240c34458473334f)
(cherry picked from commit aed860c3f7bf93c467374a7e69b44a40dd90aa92)

9 years agoPCI: Set under_pref for mem64 resource of pcie device
Yinghai Lu [Thu, 28 May 2015 00:23:51 +0000 (17:23 -0700)]
PCI: Set under_pref for mem64 resource of pcie device

We still get "no compatible bridge window" warning on sparc T5-8
after we add support for 64bit resource parsing for root bus.

 PCI: scan_bus[/pci@300/pci@1/pci@0/pci@6] bus no 8
 PCI: Claiming 0000:00:01.0: Resource 15: 0000800100000000..00008004afffffff [220c]
 PCI: Claiming 0000:01:00.0: Resource 15: 0000800100000000..00008004afffffff [220c]
 PCI: Claiming 0000:02:04.0: Resource 15: 0000800100000000..000080012fffffff [220c]
 PCI: Claiming 0000:03:00.0: Resource 15: 0000800100000000..000080012fffffff [220c]
 PCI: Claiming 0000:04:06.0: Resource 14: 0000800100000000..000080010fffffff [220c]
 PCI: Claiming 0000:05:00.0: Resource 0: 0000800100000000..0000800100001fff [204]
 pci 0000:05:00.0: can't claim BAR 0 [mem 0x800100000000-0x800100001fff]: no compatible bridge window

All the bridges 64-bit resource have pref bit, but the device resource does not
have pref set, then we can not find parent for the device resource,
as we can not put non-pref mem under pref mem.

According to pcie spec errta
https://www.pcisig.com/specifications/pciexpress/base2/PCIe_Base_r2.1_Errata_08Jun10.pdf
page 13, in some case it is ok to mark some as pref.

Only set pref for 64bit mmio when the entire path from the host to the adapter is
over PCI Express.

Fixes: commit d63e2e1f3df9 ("sparc/PCI: Clip bridge windows to fit in upstream windows")
Link: http://lkml.kernel.org/r/CAE9FiQU1gJY1LYrxs+ma5LCTEEe4xmtjRG0aXJ9K_Tsu+m9Wuw@mail.gmail.com
Reported-by: David Ahern <david.ahern@oracle.com>
Tested-by: David Ahern <david.ahern@oracle.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: <stable@vger.kernel.org> #3.19
(cherry picked from commit ab88cbba3f034f6b2da122280a8000b02fc841dd)

9 years agosparc/PCI: Add mem64 resource parsing for root bus
Yinghai Lu [Wed, 1 Apr 2015 02:57:48 +0000 (19:57 -0700)]
sparc/PCI: Add mem64 resource parsing for root bus

Found "no compatible bridge window" warning in boot log from T5-8.

pci 0000:00:01.0: can't claim BAR 15 [mem 0x100000000-0x4afffffff pref]: no compatible bridge window

That resource is above 4G, but does not get offset correctly as
root bus only report io and mem32.

pci_sun4v f02dbcfc: PCI host bridge to bus 0000:00
pci_bus 0000:00: root bus resource [io  0x804000000000-0x80400fffffff] (bus address [0x0000-0xfffffff])
pci_bus 0000:00: root bus resource [mem 0x800000000000-0x80007effffff] (bus address [0x00000000-0x7effffff])
pci_bus 0000:00: root bus resource [bus 00-77]

Add mem64 handling in pci_common for sparc, so we can have 64bit resource
registered for root bus at first.

After patch, will have:
pci_sun4v f02dbcfc: PCI host bridge to bus 0000:00
pci_bus 0000:00: root bus resource [io  0x804000000000-0x80400fffffff] (bus address [0x0000-0xfffffff])
pci_bus 0000:00: root bus resource [mem 0x800000000000-0x80007effffff] (bus address [0x00000000-0x7effffff])
pci_bus 0000:00: root bus resource [mem 0x800100000000-0x8007ffffffff] (bus address [0x100000000-0x7ffffffff])
pci_bus 0000:00: root bus resource [bus 00-77]

Fixes: commit d63e2e1f3df9 ("sparc/PCI: Clip bridge windows to fit in upstream windows")
Link: http://lkml.kernel.org/r/CAE9FiQU1gJY1LYrxs+ma5LCTEEe4xmtjRG0aXJ9K_Tsu+m9Wuw@mail.gmail.com
Reported-by: David Ahern <david.ahern@oracle.com>
Tested-by: David Ahern <david.ahern@oracle.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: <stable@vger.kernel.org> #3.19
(cherry picked from commit cd252f7298ec848ec23745938a84259999bdbe25)
(cherry picked from commit 4be8ce4e960931c18d3d5098bb9e2c6e0509b80e)

9 years agoPCI: Add pci_bus_addr_t
Yinghai Lu [Thu, 28 May 2015 00:23:51 +0000 (17:23 -0700)]
PCI: Add pci_bus_addr_t

David Ahern reported that d63e2e1f3df9 ("sparc/PCI: Clip bridge windows
to fit in upstream windows") fails to boot on sparc/T5-8:

  pci 0000:06:00.0: reg 0x184: can't handle BAR above 4GB (bus address 0x110204000)

The problem is that sparc64 assumed that dma_addr_t only needed to hold DMA
addresses, i.e., bus addresses returned via the DMA API (dma_map_single(),
etc.), while the PCI core assumed dma_addr_t could hold *any* bus address,
including raw BAR values.  On sparc64, all DMA addresses fit in 32 bits, so
dma_addr_t is a 32-bit type.  However, BAR values can be 64 bits wide, so
they don't fit in a dma_addr_t.  d63e2e1f3df9 added new checking that
tripped over this mismatch.

Add pci_bus_addr_t, which is wide enough to hold any PCI bus address,
including both raw BAR values and DMA addresses.  This will be 64 bits
on 64-bit platforms and on platforms with a 64-bit dma_addr_t.  Then
dma_addr_t only needs to be wide enough to hold addresses from the DMA API.

[bhelgaas: changelog, bugzilla, Kconfig to ensure pci_bus_addr_t is at
least as wide as dma_addr_t, documentation]
Fixes: d63e2e1f3df9 ("sparc/PCI: Clip bridge windows to fit in upstream windows")
Fixes: 23b13bc76f35 ("PCI: Fail safely if we can't handle BARs larger than 4GB")
Link: http://lkml.kernel.org/r/CAE9FiQU1gJY1LYrxs+ma5LCTEEe4xmtjRG0aXJ9K_Tsu+m9Wuw@mail.gmail.com
Link: http://lkml.kernel.org/r/1427857069-6789-1-git-send-email-yinghai@kernel.org
Link: https://bugzilla.kernel.org/show_bug.cgi?id=96231
Reported-by: David Ahern <david.ahern@oracle.com>
Tested-by: David Ahern <david.ahern@oracle.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: David S. Miller <davem@davemloft.net>
CC: stable@vger.kernel.org # v3.19+
(cherry picked from commit f733876e61c0b9dea87afa804c48a0f56e0b8426)
(cherry picked from commit e0d38fb78f961295476c2b9e0b3cc96ecc290e7a)

9 years agosparc64: Fix userspace FPU register corruptions.
David S. Miller [Fri, 7 Aug 2015 02:13:25 +0000 (19:13 -0700)]
sparc64: Fix userspace FPU register corruptions.

[ Upstream commit 44922150d87cef616fd183220d43d8fde4d41390 ]

If we have a series of events from userpsace, with %fprs=FPRS_FEF,
like follows:

ETRAP
ETRAP
VIS_ENTRY(fprs=0x4)
VIS_EXIT
RTRAP (kernel FPU restore with fpu_saved=0x4)
RTRAP

We will not restore the user registers that were clobbered by the FPU
using kernel code in the inner-most trap.

Traps allocate FPU save slots in the thread struct, and FPU using
sequences save the "dirty" FPU registers only.

This works at the initial trap level because all of the registers
get recorded into the top-level FPU save area, and we'll return
to userspace with the FPU disabled so that any FPU use by the user
will take an FPU disabled trap wherein we'll load the registers
back up properly.

But this is not how trap returns from kernel to kernel operate.

The simplest fix for this bug is to always save all FPU register state
for anything other than the top-most FPU save area.

Getting rid of the optimized inner-slot FPU saving code ends up
making VISEntryHalf degenerate into plain VISEntry.

Longer term we need to do something smarter to reinstate the partial
save optimizations.  Perhaps the fundament error is having trap entry
and exit allocate FPU save slots and restore register state.  Instead,
the VISEntry et al. calls should be doing that work.

This bug is about two decades old.

Reported-by: James Y Knight <jyknight@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
(cherry picked from commit b75513b0f1c734b1e084a6e9952ea6260d4724e3)

9 years agosparc64: using 2048 as default for number of CPUS
Allen Pais [Fri, 17 Jul 2015 16:30:25 +0000 (22:00 +0530)]
sparc64: using 2048 as default for number of CPUS
(cherry picked from commit 578ddb2512a5c908cd17ef8cbc43ff78dd399afd)

9 years agosparc64: iommu-common build error fix
Allen Pais [Tue, 14 Jul 2015 17:51:17 +0000 (23:21 +0530)]
sparc64: iommu-common build error fix
(cherry picked from commit accb4c6276793b991c6382bf57a58b40ea17eb11)

9 years agosparc64: fix Setup sysfs to mark LDOM sockets build error
Allen Pais [Tue, 14 Jul 2015 15:43:11 +0000 (21:13 +0530)]
sparc64: fix Setup sysfs to mark LDOM sockets build error
(cherry picked from commit 59be02427bfcac6c904ddd1374c35d63155b82d4)

9 years agosparc64: mmap fixed and shared
bob picco [Thu, 25 Jun 2015 00:10:18 +0000 (17:10 -0700)]
sparc64: mmap fixed and shared

Older sparc64 must have a VAC because there is concern that mmapping fixed
and shared with incorrect align would cause cache aliases. To my knowledge
this is not an issue for sun4v. I will eventually research this.

The patch appears required for uek4 too.

We will enforce the rigid alignment condition only for tlb_type != hypervisor.

Orabug: 20426304

Signed-off-by: Bob Picco <bob.picco@oracle.com>
(cherry picked from commit 88b6df6b74de358992b2b58dab018672606975c7)
(cherry picked from commit 3daf2db176d2f874446d74839635e9bbaffccc7f)

9 years agosparc64: restore TIF_FREEZE flag for sparc
Allen Pais [Sun, 17 May 2015 14:43:59 +0000 (20:13 +0530)]
sparc64: restore TIF_FREEZE flag for sparc

Re-add TIF_FREEZE to allow Ksplice to freeze threads.

Signed-off-by: Allen Pais <allen.pais@oracle.com>
(cherry picked from commit 5f6738f567fa45a8a4083e34c29740015eb2b084)

9 years agosparc64: Setup sysfs to mark LDOM sockets, cores and threads correctly
chris hyser [Wed, 22 Apr 2015 16:28:31 +0000 (12:28 -0400)]
sparc64: Setup sysfs to mark LDOM sockets, cores and threads correctly

commit 5f4826a362405748bbf73957027b77993e61e1af
Author: chris hyser <chris.hyser@oracle.com>
Date:   Tue Apr 21 10:31:38 2015 -0400

    sparc64: Setup sysfs to mark LDOM sockets, cores and threads correctly

    The current sparc kernel has no representation for sockets though tools
    like lscpu can pull this from sysfs. This patch walks the machine
    description cache and socket hierarchy and marks sockets as well as cores
    and threads such that a representative sysfs is created by
    drivers/base/topology.c.

    Before this patch:
        $ lscpu
        Architecture:          sparc64
        CPU op-mode(s):        32-bit, 64-bit
        Byte Order:            Big Endian
        CPU(s):                1024
        On-line CPU(s) list:   0-1023
        Thread(s) per core:    8
        Core(s) per socket:    1     <--- wrong
        Socket(s):             128   <--- wrong
        NUMA node(s):          4
        NUMA node0 CPU(s):     0-255
        NUMA node1 CPU(s):     256-511
        NUMA node2 CPU(s):     512-767
        NUMA node3 CPU(s):     768-1023

        After this patch:
        $ lscpu
        Architecture:          sparc64
        CPU op-mode(s):        32-bit, 64-bit
        Byte Order:            Big Endian
        CPU(s):                1024
        On-line CPU(s) list:   0-1023
        Thread(s) per core:    8
        Core(s) per socket:    32
        Socket(s):             4
        NUMA node(s):          4
        NUMA node0 CPU(s):     0-255
        NUMA node1 CPU(s):     256-511
        NUMA node2 CPU(s):     512-767
        NUMA node3 CPU(s):     768-1023

    Most of this patch was done by Chris with updates by David.

Signed-off-by: Chris Hyser <chris.hyser@oracle.com>
Signed-off-by: David Ahern <david.ahern@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit acc455cffa75070d55e74fc7802b49edbc080e92)

Conflicts:
arch/sparc/include/asm/cpudata_64.h
arch/sparc/kernel/mdesc.c
arch/sparc/kernel/smp_64.c
Signed-off-by: Allen Pais <allen.pais@oracle.com>
(cherry picked from commit bd1039234cf41d0afd35f8e9a302eac9c344d18d)

9 years agosparc: Revert generic IOMMU allocator.
David S. Miller [Sat, 18 Apr 2015 19:31:25 +0000 (12:31 -0700)]
sparc: Revert generic IOMMU allocator.

I applied the wrong version of this patch series, V4 instead
of V10, due to a patchwork bundling snafu.

Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit c12f048ffdf3a5802239426dc290290929268dc9)

Conflicts:
lib/iommu-common.c
Signed-off-by: Allen Pais <allen.pais@oracle.com>
(cherry picked from commit 065168b6a8d64cc9deacc6fb62e2d2a6181e1019)

9 years agosparc: report correct hw capabilities for athena
Allen Pais [Fri, 2 Jan 2015 06:31:55 +0000 (12:01 +0530)]
sparc: report correct hw capabilities for athena

Orabug: 18314966

Signed-off-by: Jose Marchesi <jose.marchesi@oracle.com>
Reviewed-by: Jerry Snitselaar <jerry.snitselaar@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
(cherry picked from commit 9efa3c18ad85222ad49bc0a58250b9801176e734)
(cherry picked from commit db7b900f9c04d1e3886639dac073c645f534b1db)

9 years agosparc64: Setup sysfs to mark LDOM sockets, cores and threads correctly.
Allen Pais [Wed, 7 Jan 2015 12:36:22 +0000 (18:06 +0530)]
sparc64: Setup sysfs to mark LDOM sockets, cores and threads correctly.

The current sparc kernel has no representation for sockets (i.e. a 3rd level
cache shared by cores) though tools like lscpu can pull this from sysfs. This
patch walks the LDOM MD (machine description) cache hierarchy structure and
marks sockets as well as cores and threads such that a representative sysfs is
created by drivers/base/topology.c.

Addresses LDOM part of Oracle Bug 17423360.

Orabug: 17423360

Signed-off-by: Chris Hyser <chris.hyser@oracle.com>
Reviewed-by: Jerry Snitselaar <jerry.snitselaar@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
(cherry picked from commit 83f5da7f075af677aa310a83e90f43a81cb0b5a5)
(cherry picked from commit 761c43f0261e201c0148cdd807bf50b19aa0a297)

9 years agosparc64: prevent solaris control domain warnings about Domain Service handles
Allen Pais [Fri, 2 Jan 2015 05:47:00 +0000 (11:17 +0530)]
sparc64: prevent solaris control domain warnings about Domain Service handles

Solaris created its own protocol on top of domain service registration. This
matters because the control domain that linux is talking to is Solaris. The
hypervisor specs say that the handle used for service identification is simply
an opaque 64 bit number. The only constraint is that a handle never be used
twice (within a reasonable time frame) to prevent connection to a prior stale
registered handle. Solaris on the other hand reserves the bit 0x80000000 to
indicate what it calls client registration requests. These registration requests
are sent to the guest domain to prod it to send its own registration requests to
the control domain.

When a guest (linux in this case) sends its own registration requests with this
bit set, Solaris assumes that these come from clients running in the guest that
should not do this since there can only be one control domain.  Linux not
knowing this uses the top 32 bits as a quick lookup index and sets the bottom 32
bits based off jiffies.  Of course there are times when a handle is constructed
with the Solaris client bit not set and everything appears to work correctly
with no errors or warnings and times when the client bit is set and everything
works except the Solaris kernel puts a bunch of warnings into its dmesg buffer.

The fix is literally 1 character, changing the mask used to grab the bottom 32
bits of sched_clock() (jiffy based) to use only the bottom 31 bits.  Halving the
roll-over time should not be an issue. Worse case additional jiffy bits can be
shifted into the upper 32 bits of the handle.

Addresses: BZ 15161

Orabug: 18038829

Signed-off-by: Chris Hyser <chris.hyser@oracle.com>
Acked-by: Karl Volz <karl.volz@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
(cherry picked from commit 01b84806a126706ed5b725ae716608019eda24c8)
(cherry picked from commit 29965550ad60982c510435a7afbba338446986c9)

9 years agosparc64: retry domain service registration MIME-Version: 1.0 Content-Type: text/plain...
Allen Pais [Fri, 2 Jan 2015 05:44:49 +0000 (11:14 +0530)]
sparc64: retry domain service registration MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit

Domain service registration intermittently fails. Though using â€œreliable"
LDC communication, this only guarantees the data, not delivery. Analysis
indicated a timing issue that varies between boots. LDOM domain service
architecture is now sufficiently complicated that packets (domain service
registration requests in this case) do apparently get lost, the symptoms
being receiving neither an ACK or a NACK on the initial service registration
request.

This patch uses a timer and retries with delay up to N (currently 5) times
any requests that went unacknowledged, positively or negatively, before
reporting a failed registration attempt. Using timer with callback allows early
boot to progress as normal versus spinning in a loop. Also clean up of
./script/checkpatch.pl warnings and errors in ds.c.

Orabug: 17375532

Signed-off-by: Chris Hyser <chris.hyser@oracle.com>
Signed-off-by: Jerry Snitselaar <jerry.snitselaar@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
(cherry picked from commit e7fd5c877a8f05f90ef243d24ce77228099e1f8f)
(cherry picked from commit b8a5edc0c4e25089d83343b27ab5d24807427ad1)

9 years agosparc64: __init code no longer called during non __init
Allen Pais [Fri, 2 Jan 2015 05:18:41 +0000 (10:48 +0530)]
sparc64: __init code no longer called during non __init

mdesc_update calling __init memory free code through a pointer at
non-init time. Since text page was already given back and reused
this results in an illegal instruction trap. Was not caught by
linker section mismatch checks due to pointer indirection.

This patch NULL's out mops pointer after __init time and then
checks for non-NULL before calling mops->free.

Signed-off-by: Chris Hyser <chris.hyser@oracle.com>
Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Acked-by: Bob Picco <bob.picco@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>
(cherry picked from commit 6dbae4a0137d7855472c4845b5db11cffa32efc1)
(cherry picked from commit f0673a413f04de21963ab7f3912eb9a84c52c66e)

9 years agoadd OCFS2_LOCK_RECURSIVE arg_flags to ocfs2_cluster_lock() to prevent hang
Tariq Saeed [Fri, 4 Sep 2015 22:39:03 +0000 (15:39 -0700)]
add OCFS2_LOCK_RECURSIVE arg_flags to ocfs2_cluster_lock() to prevent hang

Orabug: 21793017

ocfs2_setattr called by chmod command  holds cluster wide inode lock
(Orabug 21685187) when calling posix_acl_chmod. This
latter function in turn calls ocfs2_iop_get_acl and ocfs2_iop_set_acl.
These two are also called directly from vfs layer for getfacl/setfacl
commands and therefore acquire the cluster wide inode lock. If a remote
conversion request comes after the first inode lock in ocfs2_setattr,
OCFS2_LOCK_BLOCKED will be set in l_flags. This will cause the second
call to inode lock from the  ocfs2_iop_get|set_acl() to block indefinetly.
The new flag OCFS2_LOCK_RECURSIVE will be used to prevent this blocking.

Signed-off-by: Tariq Saeed <tariq.x.saeed@oracle.com>
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agointel_pstate: enable HWP per CPU
Kristen Carlson Accardi [Tue, 14 Jul 2015 23:46:23 +0000 (16:46 -0700)]
intel_pstate: enable HWP per CPU

intel_pstate: enable HWP per CPU

Orabug: 21325983

HWP previously was only enabled at driver load time, on the boot
CPU, however, HWP must be enabled per package. Move the code to
enable HWP to the cpufreq driver init path so that it will be
called per CPU.

Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Tested-by: David Zhuang <david.zhuang@oracle.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
(cherry picked from commit ba88d4338f226766f510e207911dde8c1875e072)
Signed-off-by: Brian Maly <brian.maly@oracle.com>
Conflicts:
drivers/cpufreq/intel_pstate.c
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoocfs2: direct write will call ocfs2_rw_unlock() twice when doing aio+dio
Ryan Ding [Mon, 7 Sep 2015 05:38:00 +0000 (13:38 +0800)]
ocfs2: direct write will call ocfs2_rw_unlock() twice when doing aio+dio

ocfs2_file_write_iter() is usng the wrong return value ('written').  This
will cause ocfs2_rw_unlock() be called both in write_iter & end_io,
triggering a BUG_ON.

This issue was introduced by commit 7da839c47589 ("ocfs2: use
__generic_file_write_iter()").

Orabug: 21612107
Fixes: 7da839c47589 ("ocfs2: use __generic_file_write_iter()")
Signed-off-by: Ryan Ding <ryan.ding@oracle.com>
Reviewed-by: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Mark Fasheh <mfasheh@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit aa1057b3dec478b20c77bad07442318ae36d893c)

Conflicts:
fs/ocfs2/file.c
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agouek-rpm: configs: Enbale X86_SYSFB on OL7 too
Santosh Shilimkar [Tue, 8 Sep 2015 15:14:45 +0000 (08:14 -0700)]
uek-rpm: configs: Enbale X86_SYSFB on OL7 too

Orabug: 21802188

Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoocfs2_iop_set/get_acl() are also called from the VFS so we must take inode lock
Tariq Saeed [Thu, 3 Sep 2015 04:55:40 +0000 (21:55 -0700)]
ocfs2_iop_set/get_acl() are also called from the VFS so we must take inode lock

ocfs2_iop_set/get_acl() are also called from the VFS so we must take inode lock

Orabug: 20189959

This bug in mainline code is pointed out by Mark Fasheh. When ocfs2_iop_set_acl
and ocfs2_iop_ge_acl are entered from VFS layer, inode lock is not held. This
seems to be regression from older kernels. The patch is to fix that.

Signed-off-by: Tariq Saeed <tariq.x.saeed@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoBUG_ON(lockres->l_level != DLM_LOCK_EX && !checkpointed) tripped in ocfs2_ci_checkpointed
Tariq Saeed [Wed, 2 Sep 2015 21:37:41 +0000 (14:37 -0700)]
BUG_ON(lockres->l_level != DLM_LOCK_EX && !checkpointed) tripped in ocfs2_ci_checkpointed

Orabug: 20189959

PID: 614    TASK: ffff882a739da580  CPU: 3   COMMAND: "ocfs2dc"
 #0 [ffff882ecc3759b0] machine_kexec at ffffffff8103b35d
 #1 [ffff882ecc375a20] crash_kexec at ffffffff810b95b5
 #2 [ffff882ecc375af0] oops_end at ffffffff815091d8
 #3 [ffff882ecc375b20] die at ffffffff8101868b
 #4 [ffff882ecc375b50] do_trap at ffffffff81508bb0
 #5 [ffff882ecc375ba0] do_invalid_op at ffffffff810165e5
 #6 [ffff882ecc375c40] invalid_op at ffffffff815116fb
    [exception RIP: ocfs2_ci_checkpointed+208]
    RIP: ffffffffa0a7e940  RSP: ffff882ecc375cf0  RFLAGS: 00010002
    RAX: 0000000000000001  RBX: 000000000000654b  RCX: ffff8812dc83f1f8
    RDX: 00000000000017d9  RSI: ffff8812dc83f1f8  RDI: ffffffffa0b2c318
    RBP: ffff882ecc375d20   R8: ffff882ef6ecfa60   R9: ffff88301f272200
    R10: 0000000000000000  R11: 0000000000000000  R12: ffffffffffffffff
    R13: ffff8812dc83f4f0  R14: 0000000000000000  R15: ffff8812dc83f1f8
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #7 [ffff882ecc375d28] ocfs2_check_meta_downconvert at ffffffffa0a7edbd [ocfs2]
 #8 [ffff882ecc375d38] ocfs2_unblock_lock at ffffffffa0a84af8 [ocfs2]
 #9 [ffff882ecc375dc8] ocfs2_process_blocked_lock at ffffffffa0a85285 [ocfs2]
#10 [ffff882ecc375e18] ocfs2_downconvert_thread_do_work at ffffffffa0a85445 [ocfs2]
#11 [ffff882ecc375e68] ocfs2_downconvert_thread at ffffffffa0a854de [ocfs2]
#12 [ffff882ecc375ee8] kthread at ffffffff81090da7
#13 [ffff882ecc375f48] kernel_thread_helper at ffffffff81511884
assert is tripped because the tran is not checkpointed and the lock level is PR.

Some time ago, chmod command had been executed. As result, the following call
chain left the inode cluster lock in PR state, latter on causing the assert.
system_call_fastpath
 -> my_chmod
  -> sys_chmod
   -> sys_fchmodat
    -> notify_change
     -> ocfs2_setattr
      -> posix_acl_chmod
       -> ocfs2_iop_set_acl
        -> ocfs2_set_acl
         -> ocfs2_acl_set_mode
Here is how.
1119 int ocfs2_setattr(struct dentry *dentry, struct iattr *attr)
1120 {
1247         ocfs2_inode_unlock(inode, 1); <<< WRONG thing to do.
..
1258         if (!status && attr->ia_valid & ATTR_MODE) {
1259                 status =  posix_acl_chmod(inode, inode->i_mode);

519 posix_acl_chmod(struct inode *inode, umode_t mode)
520 {
..
539         ret = inode->i_op->set_acl(inode, acl, ACL_TYPE_ACCESS);

287 int ocfs2_iop_set_acl(struct inode *inode, struct posix_acl *acl, ...
288 {
289         return ocfs2_set_acl(NULL, inode, NULL, type, acl, NULL, NULL);

224 int ocfs2_set_acl(handle_t *handle,
225                          struct inode *inode, ...
231 {
..
252                                 ret = ocfs2_acl_set_mode(inode, di_bh,
253                                                          handle, mode);

168 static int ocfs2_acl_set_mode(struct inode *inode, struct buffer_head ...
170 {
183         if (handle == NULL) {
                   >>> BUG: inode lock not held in ex at this point <<<
184                 handle = ocfs2_start_trans(OCFS2_SB(inode->i_sb),
185                                            OCFS2_INODE_UPDATE_CREDITS);

ocfs2_setattr.#1247 we unlock and at #1259 call posix_acl_chmod. When we reach
ocfs2_acl_set_mode.#181 and do trans, the inode cluster lock is not held in EX
mode (it should be). How this could have happended?

We are the lock master, were holding lock EX and have released it in
ocfs2_setattr.#1247. Note that there are no holders of this lock at
this point. Another node needs the lock in PR, and we downconvert from
EX to PR. So the inode lock is PR when do the trans in
ocfs2_acl_set_mode.#184. The trans stays in core (not flushed to disc).
Now another node want the lock in EX, downconvert thread gets kicked (the
one that tripped assert abovt), finds an unflushed trans but the lock is
not EX (it is PR). If the lock was at EX, it would have flushed the trans
ocfs2_ci_checkpointed -> ocfs2_start_checkpoint before downconverting (to NULL)
for the request.

ocfs2_setattr must not drop inode lock ex in this code path. If it does,
takes it again before the trans, say in ocfs2_set_acl, another cluster node can
get in between, execute another setattr, overwriting the one in progress
on this node, resulting in a mode acl size combo that is a mix of the two.

Signed-off-by: Tariq Saeed <tariq.x.saeed@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoMerge branch 'topic/uek-4.1/ofed' of git://ca-git.us.oracle.com/linux-uek into uek...
Santosh Shilimkar [Fri, 4 Sep 2015 23:49:25 +0000 (16:49 -0700)]
Merge branch 'topic/uek-4.1/ofed' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* 'topic/uek-4.1/ofed' of git://ca-git.us.oracle.com/linux-uek:
  IB/rds_rdma: unloading of ofed stack causes page fault panic
  RDS-TCP: Support multiple RDS-TCP listen endpoints, one per netns.
  RDS-TCP: Make RDS-TCP work correctly when it is set up in a netns other than init_net
  net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket
  net: Modify sk_alloc to not reference count the netns of kernel sockets.
  net: Pass kern from net_proto_family.create to sk_alloc
  net: Add a struct net parameter to sock_create_kern

9 years agoMerge branch 'topic/uek-4.1/uek-carry' of git://ca-git.us.oracle.com/linux-uek into...
Santosh Shilimkar [Fri, 4 Sep 2015 23:49:18 +0000 (16:49 -0700)]
Merge branch 'topic/uek-4.1/uek-carry' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* 'topic/uek-4.1/uek-carry' of git://ca-git.us.oracle.com/linux-uek:
  DCA: fix over-warning in ioat3_dca_init

9 years agokallsyms: unbreak kallmodsyms after CONFIG_KALLMODSYMS addition
Nick Alcock [Thu, 3 Sep 2015 15:42:09 +0000 (16:42 +0100)]
kallsyms: unbreak kallmodsyms after CONFIG_KALLMODSYMS addition

The recent addition of CONFIG_KALLMODSYMS In 28df3b99a7 had the effect
of entirely disabling all module info in /proc/kallmodsyms, thus
breaking all module-specific symbol lookups from DTrace.

This is because you can't use a CONFIG_ symbol in a HOSTCC-compiled
program without including autoconf.h by hand, and we weren't, so
scripts/kallsyms.c always acted as if CONFIG_KALLMODSYMS was turned
off and didn't populate the kallsyms_modules or kallsyms_symbol_modules
tables.

(Including autoconf.h in this context is safe, because kallsyms.c never
gets compiled until after some *config target has run.  Other build
tools in a similar position, such as modpost, already do this.)

Orabug: 21539840
Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Acked-by: Kris Van Hees <kris.van.hees@oracle.com>
9 years agokallsyms: de-ifdef kallmodsyms
Nick Alcock [Sat, 15 Aug 2015 11:18:14 +0000 (12:18 +0100)]
kallsyms: de-ifdef kallmodsyms

CONFIG_KALLMODSYMS is a bit ugly because of the burden of ifdefs.  It's
hard to remove them from scripts/kallsyms.c, but kernel/kallsyms.c doesn't
need any, since even when CONFIG_KALLMODSYMS is on it does not pull in any
extra build dependencies in and of itself: it just needs to arrange to not
create the kallmodsyms /proc node when the config option is turned off.
This will have the effect of disabling /proc/kallmodsyms when
CONFIG_KALLMODSYMS=n, without cluttering up the code with so many
ifdefs. (We still need one to populate the node in the first place.)

We also reverse the code motion we did earlier to make the other ifdefs
easier to insert.

Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Orabug: 21539840

9 years agodtrace: use syscall_get_nr() to obtain syscall number
Kris Van Hees [Wed, 2 Sep 2015 23:21:20 +0000 (19:21 -0400)]
dtrace: use syscall_get_nr() to obtain syscall number

Rather than trying to get the syscall number directly from %rax on
x86_64, which is error prone due to compiler changes causing that
register to get clobbered, we use the syscall_get_nr() function to
get the same information.

Orabug: 21630345

Signed-off-by: Kris Van Hees <kris.van.hees@oracle.com>
Acked-by: Nick Alcock <nick.alcock@oracle.com>
9 years agoDCA: fix over-warning in ioat3_dca_init
Jet Chen [Wed, 2 Sep 2015 16:20:52 +0000 (09:20 -0700)]
DCA: fix over-warning in ioat3_dca_init

 We keep seeing such dmesg messages on boxes

 WARNING: CPU: 0 PID: 457 at drivers/dma/ioat/dca.c:697
 ioat3_dca_init+0x19c/0x1b0 [ioatdma]()
 [   16.609614] ioatdma 0000:00:04.0: APICID_TAG_MAP set incorrectly by
 BIOS, disabling DCA
 ...
 [<ffffffff8172807e>] dump_stack+0x4d/0x66
 [<ffffffff81067f7d>] warn_slowpath_common+0x7d/0xa0
 [<ffffffff81068034>] warn_slowpath_fmt_taint+0x44/0x50
 [<ffffffffa00228bc>] ioat3_dca_init+0x19c/0x1b0
 [ioatdma]
 [<ffffffffa0021cd6>] ioat3_dma_probe+0x386/0x3e0
 [ioatdma]
 [<ffffffffa001a192>] ioat_pci_probe+0x122/0x1b0
 [ioatdma]
 [<ffffffff81329385>] local_pci_probe+0x45/0xa0
 [<ffffffff81080d34>] work_for_cpu_fn+0x14/0x20
 [<ffffffff81083c33>] process_one_work+0x183/0x490
 [<ffffffff81084bd3>] worker_thread+0x2a3/0x410
 [<ffffffff81084930>] ? rescuer_thread+0x410/0x410
 [<ffffffff8108b852>] kthread+0xd2/0xf0
 [<ffffffff8108b780>] ?

No need to use WARN_TAINT_ONCE to generate a such big noise if this is
not a critical error for kernel. DCA driver could print out a debug
messages then quit quietly.

If this is a real BIOS bug, please ignore this patch. Let's transfer
this issue to BIOS guys.

Thread: https://lkml.org/lkml/2014/5/8/446

Orabug: 21666295

Signed-off-by: Jet Chen <jet.chen@intel.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
9 years agoMerge branch 'topic/uek-4.1/ofed.rds-p2' into topic/uek-4.1/ofed
Mukesh Kacker [Fri, 4 Sep 2015 02:08:36 +0000 (19:08 -0700)]
Merge branch 'topic/uek-4.1/ofed.rds-p2' into topic/uek-4.1/ofed

* topic/uek-4.1/ofed.rds-p2:
  IB/rds_rdma: unloading of ofed stack causes page fault panic
  RDS-TCP: Support multiple RDS-TCP listen endpoints, one per netns.
  RDS-TCP: Make RDS-TCP work correctly when it is set up in a netns other than init_net

9 years agoIB/rds_rdma: unloading of ofed stack causes page fault panic
Rama Nichanamatlu [Thu, 11 Jun 2015 17:43:54 +0000 (10:43 -0700)]
IB/rds_rdma: unloading of ofed stack causes page fault panic

This issue surfaced at the tail end of OFED functional automatic test suite
while unloading ofed modules resulting in following stack trace:
 BUG: unable to handle kernel paging request at ffffffffa0abd1a0
 IP: [<ffffffffa0abd1a0>] 0xffffffffa0abd1a0

 Modules linked in: rds(-) ib_ipoib ... dm_mod [last unloaded: rds_rdma]

 Workqueue: krdsd 0xffffffffa0abd1a0
 task: ffff880670ac8df0 ti: ffff880666654000 task.ti: ffff880666654000
 RIP: 0010:[<ffffffffa0abd1a0>]  [<ffffffffa0abd1a0>] 0xffffffffa0abd1a0
 RSP: 0018:ffff880666657de0  EFLAGS: 00010286
 RAX: 0000000000000600 RBX: ffff880664a03380 RCX: dead000000200200
 RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff880664a03380
 RBP: ffff880666657e38 R08: ffff880664a03388 R09: 0000000000000000
 R10: 0000000000000000 R11: 0000000000000000 R12: ffff880674279c80
 R13: ffff880675169800 R14: ffff880671a5dd00 R15: 0000000000000000
 FS:  0000000000000000(0000) GS:ffff88067fc00000(0000) GS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
 CR2: ffffffffa0abd1a0 CR3: 0000000001a56000 CR4: 00000000000007e0
 Stack:
  ffffffff810962d6 000000000000000b ffff880664a03388 ffff880675169800
  ffff880671a5dd15 ffff880674279cb0 ffff880674279c80 ffff880675169800
  ffff880675169bc0 ffff880674279cb0 ffff880675169818 ffff880666657eb8
 Call Trace:
  [<ffffffff810962d6>] ? process_one_work+0x146/0x450

The root cause for panic is failure to purge an active delayed work
request for active bonding initial failover work.

The fix is to cancel active bonding initial failover delayed work if
still active at module unload.

Orabug: 20861212

Signed-off-by: Rama Nichanamatlu <rama.nichanamatlu@oracle.com>
Acked-by: Mukesh Kacker <mukesh.kacker@oracle.com>
9 years agoRDS-TCP: Support multiple RDS-TCP listen endpoints, one per netns.
Sowmini Varadhan [Fri, 28 Aug 2015 14:09:04 +0000 (10:09 -0400)]
RDS-TCP: Support multiple RDS-TCP listen endpoints, one per netns.

Register pernet subsys init/stop functions that will set up
and tear down per-net RDS-TCP listen endpoints. Unregister
pernet subusys functions on 'modprobe -r' to clean up these
end points.

Enable keepalive on both accept and connect socket endpoints.
The keepalive timer expiration will ensure that client socket
endpoints will be removed as appropriate from the netns when
an interface is removed from a namespace.

Register a device notifier callback that will clean up all
sockets (and thus avoid the need to wait for keepalive timeout)
when the loopback device is unregistered from the netns indicating
that the netns is getting deleted.

Backport of upstream commit: 467fa15356acfb7b2efa38839c3e76caa4e6e0ea

Orabug: 21437445

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agoRDS-TCP: Make RDS-TCP work correctly when it is set up in a netns other than init_net
Sowmini Varadhan [Fri, 28 Aug 2015 11:16:01 +0000 (07:16 -0400)]
RDS-TCP: Make RDS-TCP work correctly when it is set up in a netns other than init_net

Open the sockets calling sock_create_kern() with the correct struct net
pointer, and use that struct net pointer when verifying the
address passed to rds_bind().

Backport of upstream commit: d5a8ac28a7ff2f250d1bedbb6008dd2f6f6f1638

Orabug: 21437445

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agonet: sk_clone_lock() should only do get_net() if the parent is not a kernel socket
Sowmini Varadhan [Fri, 28 Aug 2015 00:57:24 +0000 (20:57 -0400)]
net: sk_clone_lock() should only do get_net() if the parent is not a kernel socket

The newsk returned by sk_clone_lock should hold a get_net()
reference if, and only if, the parent is not a kernel socket
(making this similar to sk_alloc()).

E.g,. for the SYN_RECV path, tcp_v4_syn_recv_sock->..inet_csk_clone_lock
sets up the syn_recv newsk from sk_clone_lock. When the parent (listen)
socket is a kernel socket (defined in sk_alloc() as having
sk_net_refcnt == 0), then the newsk should also have a 0 sk_net_refcnt
and should not hold a get_net() reference.

Fixes: 26abe14379f8 ("net: Modify sk_alloc to not reference count the
           netns of kernel sockets.")

Backport of upstream commit: 8a68173691f036613e3d4e6bf8dc129d4a7bf383

Orabug: 21437445

Acked-by: Eric Dumazet <edumazet@google.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
9 years agonet: Modify sk_alloc to not reference count the netns of kernel sockets.
Sowmini Varadhan [Thu, 27 Aug 2015 23:23:26 +0000 (19:23 -0400)]
net: Modify sk_alloc to not reference count the netns of kernel sockets.

Now that sk_alloc knows when a kernel socket is being allocated modify
it to not reference count the network namespace of kernel sockets.

Keep track of if a socket needs reference counting by adding a flag to
struct sock called sk_net_refcnt.

Update all of the callers of sock_create_kern to stop using
sk_change_net and sk_release_kernel as those hacks are no longer
needed, to avoid reference counting a kernel socket.

Backport of upstream commits: 26abe14379f8e2fa3fd1bcf97c9a7ad9364886fe

Orabug 21437445

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
9 years agonet: Pass kern from net_proto_family.create to sk_alloc
Sowmini Varadhan [Thu, 27 Aug 2015 21:22:00 +0000 (17:22 -0400)]
net: Pass kern from net_proto_family.create to sk_alloc

In preparation for changing how struct net is refcounted
on kernel sockets pass the knowledge that we are creating
a kernel socket from sock_create_kern through to sk_alloc.

Backport of upstream commit: 11aa9c28b4209242a9de0a661a7b3405adb568a0

Orabug 21437445

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
9 years agonet: Add a struct net parameter to sock_create_kern
Sowmini Varadhan [Thu, 27 Aug 2015 19:54:39 +0000 (15:54 -0400)]
net: Add a struct net parameter to sock_create_kern

This is long overdue, and is part of cleaning up how we allocate
kernel sockets that don't reference count struct net.

Backport of upstream commit: eeb1bd5c40edb0e2fd925c8535e2fdebdbc5cef2

Orabug: 21437445

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
9 years agoxprtrdma: Add class for RDMA backwards direction transport
Chuck Lever [Wed, 26 Aug 2015 20:33:24 +0000 (14:33 -0600)]
xprtrdma: Add class for RDMA backwards direction transport

[ Proposed for v4.4 ]

To support the server-side of an NFSv4.1 backchannel on RDMA
connections, add a transport class for backwards direction
operation.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
9 years agosvcrdma: Add infrastructure to receive backwards direction RPC/RDMA replies
Chuck Lever [Wed, 26 Aug 2015 20:31:09 +0000 (14:31 -0600)]
svcrdma: Add infrastructure to receive backwards direction RPC/RDMA replies

[ Proposed for v4.4 ]

To support the NFSv4.1 backchannel on RDMA connections, add a
capability for receiving an RPC/RDMA reply on a connection
established by a client.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
9 years agosvcrdma: Add infrastructure to send backwards direction RPC/RDMA calls
Chuck Lever [Wed, 26 Aug 2015 20:30:08 +0000 (14:30 -0600)]
svcrdma: Add infrastructure to send backwards direction RPC/RDMA calls

[ Proposed for v4.4 ]

To support the NFSv4.1 backchannel on RDMA connections, add a
mechanism for sending a backwards-direction RPC/RDMA call on a
connection established by a client.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
9 years agosvcrdma: Add svc_rdma_get_context() API that is allowed to fail
Chuck Lever [Wed, 26 Aug 2015 20:27:00 +0000 (14:27 -0600)]
svcrdma: Add svc_rdma_get_context() API that is allowed to fail

[ Proposed for v4.4 ]

To support backward direction calls, I'm going to add an
svc_rdma_get_context() call in the client RDMA transport.

Called from ->buf_alloc(), we can't sleep waiting for memory.
So add an API that can get a server op_ctxt but won't sleep.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>