www.infradead.org Git - users/jedix/linux-maple.git/log

fnic: Avoid false out-of-order detection for aborted command

Orabug: 25638880

If SCSI-ML has already issued abort on a command i.e
FNIC_IOREQ_ABTS_PENDING is set and we get a IO completion avoid
this being flagged as out-of-order completion by setting the
FNIC_IO_DONE flag in fnic_fcpio_icmnd_cmpl_handler

Signed-off-by: Satish Kharat <satishkh@cisco.com>
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Brian Maly <brian.maly@oracle.com>

scsi: fnic: Correcting rport check location in fnic_queuecommand_lck

Orabug: 25638880

In fnic_queuecommand_lck the cleaning up the rport check and
handeling when it is null

Signed-off-by: Satish Kharat <satishkh@cisco.com>
[ Upstream commit 6008e96b8107a4e30a97de947bd0fac239836b58 ]
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Brian Maly <brian.maly@oracle.com>

fnic: minor white space changes

Orabug: 25638880

Signed-off-by: Satish Kharat <satishkh@cisco.com>
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Brian Maly <brian.maly@oracle.com>

scsi: fnic: Avoid sending reset to firmware when another reset is in progress

Orabug: 25638880

The fix avoids calling an internal firmware reset when a fw reset
is already in progress.

Signed-off-by: Satish Kharat <satishkh@cisco.com>
[ Upstream commit 9698b6f473555a722bf81a3371998427d5d27bde ]
Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Reviewed-by: Brian Maly <brian.maly@oracle.com>

ovl: Do d_type check only if work dir creation was successful

d_type check requires successful creation of workdir as iterates
through work dir and expects work dir to be present in it. If that's
not the case, this check will always return d_type not supported even
if underlying filesystem might be supporting it.

So don't do this check if work dir creation failed in previous step.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
(cherry picked from commit 21765194cecf2e4514ad75244df459f188140a0f)
Signed-off-by: Divya Indi <divya.indi@oracle.com>
Orabug: 25802620

ovl: Ensure upper filesystem supports d_type

In some instances xfs has been created with ftype=0 and there if a file
on lower fs is removed, overlay leaves a whiteout in upper fs but that
whiteout does not get filtered out and is visible to overlayfs users.

And reason it does not get filtered out because upper filesystem does
not report file type of whiteout as DT_CHR during iterate_dir().

So it seems to be a requirement that upper filesystem support d_type for
overlayfs to work properly. Do this check during mount and fail if d_type
is not supported.

Suggested-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
(cherry picked from commit 45aebeaf4f67468f76bedf62923a576a519a9b68)
Signed-off-by: Divya Indi <divya.indi@oracle.com>
Orabug: 25802620

sparc64: Add hardware capabilities for M8

This commit adds definitions for hardware
capabilities from both the Machine Descriptor and the
Compatibility Feature Register for M8 devices.

Orabug: 25555746

Signed-off-by: Dave Aldridge <david.j.aldridge@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: Stop performance counter before updating

This commit takes the fix as implemented in commit
b36dd4d8040c and applies it to M8 devices. Original commit
message:

In order to reliably clear the PCRx.ov bit when updating a
performance counter value, we need to stop it counting first.
If we do not do this, then we can miss performance counter
overflow events.

Orabug: 25441707

Signed-off-by: Dave Aldridge <david.j.aldridge@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: Fix a race condition when stopping performance counters

This commit takes the fix as implemented in commit
e5b7619e1de2 and applies it to M8 devices. Original commit
message:

When stopping a performance counter that is close to overflowing,
there is a race condition that can occur between writing to the
PCRx register to stop the counter (and also clearing the PCRx.ov
bit at the same time) vs the performance counter overflowing and
setting the PCRx.ov bit in the PCRx register.
The result of this race condition is that we occasionally miss
a performance counter overflow interrupt, which in turn leads
to incorrect event counting.
This race condition has been observed when counting cpu cycles.
To fix this issue when stopping a performance counter,
we simply allow it to continue counting and overflow before
stopping it. This allows the performance counter overflow
interrupt to be generated and acted upon.

Orabug: 25441707

Signed-off-by: Dave Aldridge <david.j.aldridge@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

arch/sparc: Use new misaligned load instructions for memcpy and copy_from_user

Use the new instructions for Load Misaligned Integer and Load Misaligned
Integer Alternate space for M8 architecture.

Decide when to use FP or ldm based on the following condition.
In case of FP load/alignaddr logic, there is a fixed overhead of
FP save/restore regardless of memcpy length. But the overhead due to the
ldm instruction grows with the size of memcpy. With our tests noticed
that up to length about 4096, the ldm instructions performs significanty
better than the FP alignaddr/load logic. With that into consideration,
use the new ldm instructions for length of 4096 or less. For lengths
above 4096, we will continue to use FP alignaddr/load logic.

Added the fix noticed crypto key corruption while running AES crypto tests.
This is the same problem reported in NG4memcpy. The commit f4da3628dc7c
("sparc64: Fix FPU register corruption with AES crypto offload.") fixes
the problem. Ported these changes to M8memcpy and verified the fix.

TODO: Encoded the ldmx and ldmxa instruction for now. Our build servers
are not updated with latest M8 instruction set yet. We need to decode it
back to assembly mnemonics when these instructions are available.

Orabug: 25381567

Signed-off-by: Babu Moger <babu.moger@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

arch/sparc: Add a separate kernel memcpy functions for M8

Add a dedicated kernel memory copy functions for M8 architecture.
Use M7memcpy functions from M7 architecture and update affected functions
to take advantage of new Load Misaligned load/store instructions.

Following functions are going to be affected.
memcpy, copy_from_user and copy_to_user.

Following functions will not change. Will remain same as M7.
clear_page, clear_user_page, memset and bzero.

Orabug: 25381567

Signed-off-by: Babu Moger <babu.moger@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: perf: make sure we do not set the 'picnht' bit in the PCR

Testing with the perf_fuzzer tool revealed an issue when the 'picnht' bit
was set in the Performance Control Register. Setting this bit means that
privileged software cannot access the PIC counter and we eventually end up
with messages of the form shown below, being reported.

NMI watchdog: BUG: soft lockup - CPU#788 stuck for 23s! [perf_fuzzer:336548]

This commit fixes the issue my making sure that the 'picnht' bit
is always reset when writing to the PCR register.

Orabug: 24926097

Signed-off-by: Dave Aldridge <david.j.aldridge@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: perf: move M7 pmu event definitions to seperate file

Originally the M7 pmu event definitions were in perf_event.c.
This commit moves them to a seperate header file just like the
M8 pmu event definitions.

Orabug: 23333572

Signed-off-by: Dave Aldridge <david.j.aldridge@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: perf: add perf support for M8 devices

This commit adds perf support for hardware events,
hardware cache events and kernel pmu events for M8
devices.

Orabug: 23333572

Signed-off-by: Dave Aldridge <david.j.aldridge@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: perf: Fix the mapping between perf events and perf counters

Up until now, perf event[0] has been counted on perf counter[0],
perf event[1] on perf counter[1] etc. However, there is no reason why
this has to be the case. Infact, on SPARC-M8 some events can only be
counted on a sub set of the available perf counters, rather than all
the available perf counters. Whilst adding these perf counter
constraints for the SPARC-M8 it became apparent that some confusion
had crept into the existing code.

Some functions were making the assumption that the perf event index was
equivalent to the perf counter index. This commit fixes these assumptions,
and allows for the case where perf event[0] is not counted on
perf counter[0] etc.

Orabug: 23333572

Signed-off-by: Dave Aldridge <david.j.aldridge@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

SPARC64: Enable IOMMU bypass for IB

This change enables IOMMU bypass for Infiniband devices to workaround
the rds-stress performance issue. The performance slowdown is only
seen with rds-stress test. This test generates rigorous amount of dma
map/unmap calls. Investigation shows that somewhere IOMMU/ATU code
doesn't scale and appears to be a bottleneck.

This is for uek4 only and not for upstream. When IOMMU/ATU issue gets
resolved, this workaround will be reverted.

Orabug: 25573557

Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

SPARC64: Introduce IOMMU BYPASS method

This change adds IOMMU BYPASS HV API that will allow PCIe devices to
bypass IOMMU during DMA map/unmap.

Orabug: 25573557

Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

PCI: Add PCI IDs for Infiniband

This change adds PCI IDs for Infiniband.

Orabug: 25573557

Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sched/fair: Disable the task group load_avg update for the root_task_group

Currently, the update_tg_load_avg() function attempts to update the
tg's load_avg value whenever the load changes even for root_task_group
where the load_avg value will never be used. This patch will disable
the load_avg update when the given task group is the root_task_group.

Running a Java benchmark with noautogroup and a 4.3 kernel on a
16-socket IvyBridge-EX system, the amount of CPU time (as reported by
perf) consumed by task_tick_fair() which includes update_tg_load_avg()
decreased from 0.71% to 0.22%, a more than 3X reduction. The Max-jOPs
results also increased slightly from 983015 to 986449.

Orabug: 25544560

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yuyang Du <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/1449081710-20185-4-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit aa0b7ae06387d40a988ce16a189082dee6e570bc)

Signed-off-by: Atish Patra <atish.patra@oracle.com>
Acked-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sched/fair: Move the cache-hot 'load_avg' variable into its own cacheline

If a system with large number of sockets was driven to full
utilization, it was found that the clock tick handling occupied a
rather significant proportion of CPU time when fair group scheduling
and autogroup were enabled.

Running a java benchmark on a 16-socket IvyBridge-EX system, the perf
profile looked like:

  10.52%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
   9.66%   0.05%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
   8.65%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
   8.56%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
   8.07%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
   6.91%   1.78%  java   [kernel.vmlinux]  [k] task_tick_fair
   5.24%   5.04%  java   [kernel.vmlinux]  [k] update_cfs_shares

In particular, the high CPU time consumed by update_cfs_shares()
was mostly due to contention on the cacheline that contained the
task_group's load_avg statistical counter. This cacheline may also
contains variables like shares, cfs_rq & se which are accessed rather
frequently during clock tick processing.

This patch moves the load_avg variable into another cacheline
separated from the other frequently accessed variables. It also
creates a cacheline aligned kmemcache for task_group to make sure
that all the allocated task_group's are cacheline aligned.

By doing so, the perf profile became:

   9.44%   0.00%  java   [kernel.vmlinux]  [k] smp_apic_timer_interrupt
   8.74%   0.01%  java   [kernel.vmlinux]  [k] hrtimer_interrupt
   7.83%   0.03%  java   [kernel.vmlinux]  [k] tick_sched_timer
   7.74%   0.00%  java   [kernel.vmlinux]  [k] update_process_times
   7.27%   0.03%  java   [kernel.vmlinux]  [k] scheduler_tick
   5.94%   1.74%  java   [kernel.vmlinux]  [k] task_tick_fair
   4.15%   3.92%  java   [kernel.vmlinux]  [k] update_cfs_shares

The %cpu time is still pretty high, but it is better than before. The
benchmark results before and after the patch was as follows:

  Before patch - Max-jOPs: 907533    Critical-jOps: 134877
  After patch  - Max-jOPs: 916011    Critical-jOps: 142366

Orabug: 25544560

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yuyang Du <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/1449081710-20185-3-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit b0367629acf62a78404c467cd09df447c2fea804)

Conflicts:

kernel/sched/sched.h
Signed-off-by: Atish Patra <atish.patra@oracle.com>
Acked-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sched/fair: Avoid redundant idle_cpu() call in update_sg_lb_stats()

Part of the responsibility of the update_sg_lb_stats() function is to
update the idle_cpus statistical counter in struct sg_lb_stats. This
check is done by calling idle_cpu(). The idle_cpu() function, in
turn, checks a number of fields within the run queue structure such
as rq->curr and rq->nr_running.

With the current layout of the run queue structure, rq->curr and
rq->nr_running are in separate cachelines. The rq->curr variable is
checked first followed by nr_running. As nr_running is also accessed
by update_sg_lb_stats() earlier, it makes no sense to load another
cacheline when nr_running is not 0 as idle_cpu() will always return
false in this case.

This patch eliminates this redundant cacheline load by checking the
cached nr_running before calling idle_cpu().

Orabug: 25544560

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Douglas Hatch <doug.hatch@hpe.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1448478580-26467-2-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit a426f99c91d1036767a7819aaaba6bd3191b7f06)

Signed-off-by: Atish Patra <atish.patra@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sched/fair: Clean up load average references

For cfs_rq, we have load.weight, runnable_load_avg, and load_avg.
Clean up how they are used:

  - First, as group sched_entity already largely uses load_avg, we now expand
    to use load_avg in all cases.

  - Second, for CPU-wide load balancing, we choose to use runnable_load_avg
    in all cases, which is the same as before this series.

[Atish]:
Here is the conflict in this patch.
<<<<<<< HEAD
        unsigned long nr_running = ACCESS_ONCE(rq->cfs.h_nr_running);
        unsigned long load_avg = rq->cfs.avg.load_avg;
=======
        unsigned long nr_running = READ_ONCE(rq->cfs.h_nr_running);
        unsigned long load_avg = weighted_cpuload(cpu);
>>>>>>> 7ea241a... sched/fair: Clean up load average references

ACCESS_ONCE is replaced with READ_ONCE by the upstream patch  316c1608d15c7.
As this patch is not in UEK4, ACCESS_ONCE is kept as it is.

Orabug: 25544560

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: arjan@linux.intel.com
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: fengguang.wu@intel.com
Cc: len.brown@intel.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: rafael.j.wysocki@intel.com
Cc: umgwanakikbuti@gmail.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1436918682-4971-8-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 7ea241afbf4924c58d41078599f7a32ba49fb985)

Conflicts:

kernel/sched/fair.c

Signed-off-by: Atish Patra <atish.patra@oracle.com>
Acked-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sched/fair: Provide runnable_load_avg back to cfs_rq

The cfs_rq's load_avg is composed of runnable_load_avg and blocked_load_avg.
Before this series, sometimes the runnable_load_avg is used, and sometimes
the load_avg is used. Completely replacing all uses of runnable_load_avg
with load_avg may be too big a leap, i.e., the blocked_load_avg is concerned
to result in overrated load. Therefore, we get runnable_load_avg back.

The new cfs_rq's runnable_load_avg is improved to be updated with all of the
runnable sched_eneities at the same time, so the one sched_entity updated and
the others stale problem is solved.

Orabug: 25544560

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: arjan@linux.intel.com
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: fengguang.wu@intel.com
Cc: len.brown@intel.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: rafael.j.wysocki@intel.com
Cc: umgwanakikbuti@gmail.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1436918682-4971-7-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 139622343ef31941effc6de6a5a9320371a00e62)

Signed-off-by: Atish Patra <atish.patra@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sched/fair: Remove task and group entity load when they are dead

When task exits or group is destroyed, the entity's load should be
removed from its parent cfs_rq's load. Otherwise, it will take time
for the parent cfs_rq to decay the dead entity's load to 0, which
is not desired.

Orabug: 25544560

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: arjan@linux.intel.com
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: fengguang.wu@intel.com
Cc: len.brown@intel.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: rafael.j.wysocki@intel.com
Cc: umgwanakikbuti@gmail.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1436918682-4971-6-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 1269557889b477e3e43ab99a21035ddf8f7cea4d)

Signed-off-by: Atish Patra <atish.patra@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sched/fair: Init cfs_rq's sched_entity load average

The runnable load and utilization averages of cfs_rq's sched_entity
were not initiated. Like done to a task, give new cfs_rq' sched_entity
start values to heavy its load in infant time.

Orabug: 25544560

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: arjan@linux.intel.com
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: fengguang.wu@intel.com
Cc: len.brown@intel.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: rafael.j.wysocki@intel.com
Cc: umgwanakikbuti@gmail.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1436918682-4971-5-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 540247fb5ddf6d2364f90387fa1f8f428d15e683)

Signed-off-by: Atish Patra <atish.patra@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sched/fair: Implement update_blocked_averages() for CONFIG_FAIR_GROUP_SCHED=n

The load and the utilization of idle CPUs must be updated periodically in
order to decay the blocked part.

If CONFIG_FAIR_GROUP_SCHED is not set, the load and util of idle cpus
are not decayed and stay at the values set before becoming idle.

Orabug: 25544560

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: arjan@linux.intel.com
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: fengguang.wu@intel.com
Cc: len.brown@intel.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: rafael.j.wysocki@intel.com
Cc: umgwanakikbuti@gmail.com
Link: http://lkml.kernel.org/r/1436918682-4971-4-git-send-email-yuyang.du@intel.com
[ Fixed up the SOB chain. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 6c1d47c0827304949e0eb9479f4d587f226fac8b)

Signed-off-by: Atish Patra <atish.patra@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sched/fair: Rewrite runnable load and utilization average tracking

The idea of runnable load average (let runnable time contribute to weight)
was proposed by Paul Turner and Ben Segall, and it is still followed by
this rewrite. This rewrite aims to solve the following issues:

1. cfs_rq's load average (namely runnable_load_avg and blocked_load_avg) is
   updated at the granularity of an entity at a time, which results in the
   cfs_rq's load average is stale or partially updated: at any time, only
   one entity is up to date, all other entities are effectively lagging
   behind. This is undesirable.

   To illustrate, if we have n runnable entities in the cfs_rq, as time
   elapses, they certainly become outdated:

     t0: cfs_rq { e1_old, e2_old, ..., en_old }

   and when we update:

     t1: update e1, then we have cfs_rq { e1_new, e2_old, ..., en_old }

     t2: update e2, then we have cfs_rq { e1_old, e2_new, ..., en_old }

     ...

   We solve this by combining all runnable entities' load averages together
   in cfs_rq's avg, and update the cfs_rq's avg as a whole. This is based
   on the fact that if we regard the update as a function, then:

   w * update(e) = update(w * e) and

   update(e1) + update(e2) = update(e1 + e2), then

   w1 * update(e1) + w2 * update(e2) = update(w1 * e1 + w2 * e2)

   therefore, by this rewrite, we have an entirely updated cfs_rq at the
   time we update it:

     t1: update cfs_rq { e1_new, e2_new, ..., en_new }

     t2: update cfs_rq { e1_new, e2_new, ..., en_new }

     ...

2. cfs_rq's load average is different between top rq->cfs_rq and other
   task_group's per CPU cfs_rqs in whether or not blocked_load_average
   contributes to the load.

   The basic idea behind runnable load average (the same for utilization)
   is that the blocked state is taken into account as opposed to only
   accounting for the currently runnable state. Therefore, the average
   should include both the runnable/running and blocked load averages.
   This rewrite does that.

   In addition, we also combine runnable/running and blocked averages
   of all entities into the cfs_rq's average, and update it together at
   once. This is based on the fact that:

     update(runnable) + update(blocked) = update(runnable + blocked)

   This significantly reduces the code as we don't need to separately
   maintain/update runnable/running load and blocked load.

3. How task_group entities' share is calculated is complex and imprecise.

   We reduce the complexity in this rewrite to allow a very simple rule:
   the task_group's load_avg is aggregated from its per CPU cfs_rqs's
   load_avgs. Then group entity's weight is simply proportional to its
   own cfs_rq's load_avg / task_group's load_avg. To illustrate,

   if a task_group has { cfs_rq1, cfs_rq2, ..., cfs_rqn }, then,

   task_group_avg = cfs_rq1_avg + cfs_rq2_avg + ... + cfs_rqn_avg, then

   cfs_rqx's entity's share = cfs_rqx_avg / task_group_avg * task_group's share

To sum up, this rewrite in principle is equivalent to the current one, but
fixes the issues described above. Turns out, it significantly reduces the
code complexity and hence increases clarity and efficiency. In addition,
the new averages are more smooth/continuous (no spurious spikes and valleys)
and updated more consistently and quickly to reflect the load dynamics.

As a result, we have less load tracking overhead, better performance,
and especially better power efficiency due to more balanced load.

[Atish:]
This patch moved some of the conflicting functions from proc.c to
fair.c and removed the corresponding declarations from sched.h.
The upstream patch 3289bdb42988 moved these functions from proc.c to
fair.c in addition to lot of other code restructure. Porting 3289bdb42988
breaks KABI for UEK4. That's why I have to hack around it.

Orabug: 25544560

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: arjan@linux.intel.com
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: fengguang.wu@intel.com
Cc: len.brown@intel.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: rafael.j.wysocki@intel.com
Cc: umgwanakikbuti@gmail.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1436918682-4971-3-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit 9d89c257dfb9c51a532d69397f6eed75e5168c35)

Conflicts:

kernel/sched/fair.c

Signed-off-by: Atish Patra <atish.patra@oracle.com>
Acked-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sched/fair: Remove rq's runnable avg

The current rq->avg is not used at all since its merge into the kernel,
and the code is in the scheduler's hot path, so remove it.

Orabug: 25544560

Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Yuyang Du <yuyang.du@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: arjan@linux.intel.com
Cc: bsegall@google.com
Cc: fengguang.wu@intel.com
Cc: len.brown@intel.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: rafael.j.wysocki@intel.com
Cc: umgwanakikbuti@gmail.com
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1436918682-4971-2-git-send-email-yuyang.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
(cherry picked from commit cd126afe838d7ea9b971cdea087fd498a7293c7f)

Signed-off-by: Atish Patra <atish.patra@oracle.com>
Acked-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: Allow enabling ADI on hugepages only

Since kernel provides only limted support for ADI on regular pages,
mprotect() should not allow enabling ADI on any pages other than
hugepages until rest of the support is complete.

Orabug: 25969377
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: Save ADI tags on ADI enabled platforms only

When a page is ready to be swapped out or migrated, kernel saves the
ADI tags associated with it. To determine if ADI tags need to be saved
for a page, it checks if TTE.MCD bit is set for the page. M7 and newer
processors chose to reuse the TTE.CV bit from older processors for
TTE.MCD. When the kernel is running on non-ADI capable processors,
TTE.MCD bit is actually TTE.CV and it is set on older processors by
default for older processors by kernel. This causes kernel to
attempt to save the ADI tags for the page using MCD ASIs which are
not available on older processors. This causes a data access exception.
This patch adds a conditional to save ADI tags only on ADI capable
platforms.

Orabug: 25961592

Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: increase FORCE_MAX_ZONEORDER to 16

Orabug: 25448108

This change enables large tsb sizes of up to 256MB. The performance
improvement is substantial for those programs with an enormous tsb rss.

Signed-off-by: Bob Picco <bob.picco@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: tsb size expansion

Orabug: 25448108

Some larger applications require a TSB size of magnitude not required by a
vast majority of processes. This commit enables the TSB to be expanded
to a size up to MAX_ORDER - 1, limited by TSB size order encoding and
finally limited by MMU hardware. The large TSB page order allocations are
not included in a kmem cache like current TSB sizes. The improvement is
done for tlb_type hypervisor and limited to recent sun4v. This patch should
not impact performance of other sparc64 core chip types.

Signed-off-by: Bob Picco <bob.picco@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: make tsb pointer computation symbolic

Orabug: 25448108

Define some symbolic names for tsb/tlb miss trap tsb pointer computation.

Signed-off-by: Bob Picco <bob.picco@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: fix intermittent LDom hang waiting for vdc_port_up

When LDom boots, sunvdc probes the disk using the LDC channel.
If the channel was previously configured, we need to wait until
we have received a packet from the other end to ensure the
the LDC reset was fully completed; otherwise, the primary domain
will send out NACK when vdc tries to connect. The NACK causes
ldc_reset and flushes the receive queue which may contain the
version info. The unexpected loss of version info causes
the vdc handshake to wait forever.

This patch is to workaround the ldc_reset not handling
reconnection after receiving a NACK. This patch can be
reverted after ldc_reset is properly implemented.

orabug: 25814698

Signed-off-by: Thomas Tai <thomas.tai@oracle.com>
Reviewed-by: Tom Saeger <tom.saeger@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Babu Moger <babu.moger@oracle.com>
Reviewed-by: Alexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64:block/sunvdc: Renamed bio variable name from req to bio

Changed bio variable name from req to bio for better readability.
This is just a cosmetic change.

Signed-off-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Orabug: 25128265
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64:block/sunvdc: Added io stats accounting for bio based vdisk

As vdisk now bypass block layer and directly submits IO, iostats does
not get accounted (which happens at block). Added IO accounting at bio
layer for vdisk block device.

Signed-off-by: Vijay Kumar <vijay.ac.kumar@oracle.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Orabug: 25128265
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: Remove node restriction from PRIQ MSI assignments

The current design of PRIQ only allows MSI routing to CPUs within the same
I/O node. At best this prevents certain types of CPU add/removals as well
as various affinity assignments and at worst can result in dropped
interrupts with CPU DR.

This prior design used a per-node hash table with collision chain (and RCU
synchronization) shared among the root complexes. While it would be
possible to support moving these records between the node hash tables this
is needlessly complicated when a single global table could be used instead
assuming the hashing mechanism was modified to limit the potential for
performance degradation through potentially longer collision chains and
more lookups.

The new design for MSI uses the bottom N bits of the device handle as a
hash versus hashing device handle and msidata together as before. This
exploits the existing numbering scheme to result in zero collisions (even
in an exactly sized hash table). If the numbering scheme is changed on a
future system, that system will still be functional though presumably
with some performance loss. The MSI data is then used as an index into a
array as has been done for prior linux sun4v interrupt model support. Given
the very small number of INTX interrupts, their handling remains
essentially unchanged (now uses a single global hash table, versus per I/O
node as before).

MSI Interrupts from Hypervisor data to ISR:

                                     struct pci_pbm_info *pbms[MAX_PBMS];
                                             +-----------+
                                             |           |
  From Hypervisor:                           |           |
                                             +-----------+
  - root complex devhandle  +---/HASH/------>|           |
  - msidata +--+                             |           |-------+
               |                             +-----------+       |
               |                             |           |       |
               |                             |           |       |
               |                             +-----------+       |
               |                             |           |       |
               |                             |           |       |
               |                             +-----------+       |
               |                                                 |
               |           struct pci_pbm_info                   |
               |              +-----------+<---------------------+
               |              |           |
               |              |    PBM    |
               |              |           |
               |              |           |---+
               |              |           |   |
               |              |           |   |
+--------------+              +-----------+   |
|                                             |
| struct priq_irq **msi_priq                  |
|    +-------------+<-------------------------+
|    |             |
|    |             |
+--->+-------------+       struct priq_irq
     |             |-------->+---------+
     |             |         |         |
     +-------------+         |         |
     |             |         +---------+
     |             |         |   irq   | generic_handle_irq(irq)
     +-------------+         +---------+
     |             |
     |             |
     +-------------+

Signed-off-by: Chris Hyser <chris.hyser@oracle.com>
Reviewed-by Thomas Tai <thomas.tai@oracle.com>

Orabug: 25110748
Signed-off-by: Allen Pais <allen.pais@oracle.com>

blk-mq: Clean up all_q_list on request_queue deletion

Entries were not being removed from all_q_list when corresponding request
queue structs were freed with subsequent embedded list pointers being
trashed when the queue memory was re-allocated. Additionally, due to use of
cached allocation pool (kmem_cache_alloc_node), the same data structure
would be reinserted into the same list twice which cannot work given the
kernel linked list implementation.

Signed-off-by: Chris Hyser <chris.hyser@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Orabug: 25569331
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: kern_addr_valid regression

Orabug: 25860542

I encountered this bug when using /proc/kcore to examine the kernel. Plus a
coworker inquired about debugging tools. We computed pa but did
not use it during the maximum physical address bits test. Instead we used
the identity mapped virtual address which will always fail this test.

I believe the defect came in here:
[bpicco@zareason linus.git]$ git describe --contains bb4e6e85daa52
v3.18-rc1~87^2~4
.

Signed-off-by: Bob Picco <bob.picco@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

Merge branch topic/uek-4.1/sparc of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* topic/uek-4.1/sparc: (32 commits)
  sparc64: Detect DAX ra+pgsz when hvapi minor doesn't indicate it
  sparc64: DAX memory will use RA+PGSZ feature in HV
  sparc64: Disable DAX flow control
  sparc64: Add DAX hypervisor services
  sparc64: DAX memory needs persistent mappings
  sparc64: Fix incorrect error print in DAX driver when validating ccb
  sparc64: DAX request for non 4MB memory should return with unique errno
  Revert "sparc64: DAX request for non 4MB memory should return with unique errno"
  sparc64: DAX request to mmap non 4MB memory should fail with a debug print
  sparc64: DAX request for non 4MB memory should return with unique errno
  sparc64: Incorrect print by DAX driver when old driver API is used
  sparc64: DAX request to dequeue half of a long CCB should not succeed
  sparc64: dax_overflow_check reports incorrect data
  sparc64: Ignored DAX ref count causes lockup
  sparc64: disable dax page range checking on RA
  sparc64: Oracle Data Analytics Accelerator (DAX) driver
  sparc64: fix an issue when trying to bring hotplug cpus online
  sparc64: Fix memory corruption when THP is enabled
  sparc64: Fix address range for page table free Orabug: 25704426
  sparc64: Add support for 2G hugepages
  ...

sparc64: Detect DAX ra+pgsz when hvapi minor doesn't indicate it

Orabug: 25911008

The RA+PGSZ HV API feature is detected via a controlled experiment.
The experiment constructs a small DAX request and places the output
buffer at the very end of an 8k page. Then it checks to see if the 8k
page boundary was honored, and if so, then we've got the ability to pass
the page size along with a real address. Once the HV API minor number
is bumped to 1, this will be the primary method of detection.

Signed-off-by: Rob Gardner <rob.gardner@oracle.com>
Reviewed-by: Jonathan Helman <jonathan.helman@oracle.com>
(cherry picked from commit 013d5b9909e804817dcd939f50f242f09237feac)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: DAX memory will use RA+PGSZ feature in HV

Orabug: 25911008
Orabug: 25931417

The reported kernel panics and "other oddities" are caused by
corruption of kernel memory by DAX output. This is happening due to an
apparent change between UEK2 and UEK4, whereby the underlying h/w page
size for memory acquired via kmalloc has changed. UEK2 used 4mb pages,
which the dax driver used to limit the output from the coprocessor,
who refuses to cross "page boundaries". But in UEK4 it appears that a
more intelligent approach to memory is used, and kernel memory may be
backed by a variety of huge h/w page sizes, ie, 256mb and 2gb. This
now allows DAX to produce output up to this much larger page size,
thus going beyond the actual allocation. We do not have any way to
kmalloc memory with a certain backing page size, and we cannot feed
DAX a virtual address if we are not certain of its page size.

Recent hypervisor f/w has provided a powerful new feature: the ability
to convey page size bits along with a real address (RA). This gives us
the opportunity to avoid using the TLB/TSB as a parameter passing
mechanism and we can use this to avoid using virtual addresses at all
in a DAX ccb. We now use this mechanism to set the page size for
dax_alloc memory to 4mb even if the underlying h/w page size for the
memory is much larger. Memory allocated by the application via
conventional methods is not affected.

This HV feature is available on M7 f/w with minor number 1, so this is
used to determine if the driver can provide the memory allocation
service. If the feature is not available, DAX will still work, but all
the responsibility for memory allocation falls back to the
application.

The sudden ENOACCESS errors are a result of another hypervisor change.
Newest HV firmware has begun to enforce the privileged flag (bit 14)
in arg2 of the ccb_submit hypercall. This flag is described in the API
wiki as "CCB virtual addresses are treated as privileged" and in the
VM spec as "Virtual addresses within CCBs are translated in privileged
context". The explanation given later in the VM spec is that if a CCB
contains any virtual address whose TTE has the priv bit set
(_PAGE_P_4V), then the priv flag in the ccb_submit api must also be
set, or else the hypervisor will refuse to perform the translation,
and an ENOACCESS error will be thrown. Since privileged virtual
addresses are no longer used as a result of this very commit, this
problem simply disappears.

Signed-off-by: Rob Gardner <rob.gardner@oracle.com>
Reviewed-by: Jonathan Helman <jonathan.helman@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: Disable DAX flow control

Orabug: 25997202

The flow control feature of DAX only works for output buffers
and it doesn't appear to be possible to calculate the exact
size in bytes of an input buffer. So we have no way to bound
the amount of data that DAX reads as input. This has correctness
and security implications, so until we figure out something better
to do, the temporary workaround is to disable flow control completely
and fall back to 4Mb virtual page backed dax_alloc memory, which
will allow page boundaries to limit the size of all buffers. A new
module parameter "flow_enable" is provided to allow this decision
to be reverted at module load time if needed.

Signed-off-by: Rob Gardner <rob.gardner@oracle.com>
Reviewed-by: Jonathan Helman <jonathan.helman@oracle.com>
Reviewed-by: Sanath Kumar <sanath.s.kumar@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: Add DAX hypervisor services

This provides the HV API for coprocessor services which
is needed to support a device driver for the DAX.

Orabug: 25996411

Reviewed-by: Bob Picco <bob.picco@oracle.com>
Signed-off-by: Sanath Kumar <sanath.s.kumar@oracle.com>
Signed-off-by: Rob Gardner <rob.gardner@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

Merge branch topic/uek-4.1/upstream-cherry-picks of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* topic/uek-4.1/upstream-cherry-picks:
KVM: VMX: fix vmwrite to invalid VMCS

KVM: VMX: fix vmwrite to invalid VMCS

fpu_activate is called outside of vcpu_load(), which means it should not
touch VMCS, but fpu_activate needs to.  Avoid the call by moving it to a
point where we know that the guest needs eager FPU and VMCS is loaded.

This will get rid of the following trace

vmwrite error: reg 6800 value 0 (err 1)
  [<ffffffff8162035b>] dump_stack+0x19/0x1b
  [<ffffffffa046c701>] vmwrite_error+0x2c/0x2e [kvm_intel]
  [<ffffffffa045f26f>] vmcs_writel+0x1f/0x30 [kvm_intel]
  [<ffffffffa04617e5>] vmx_fpu_activate.part.61+0x45/0xb0 [kvm_intel]
  [<ffffffffa0461865>] vmx_fpu_activate+0x15/0x20 [kvm_intel]
  [<ffffffffa0560b91>] kvm_arch_vcpu_create+0x51/0x70 [kvm]
  [<ffffffffa0548011>] kvm_vm_ioctl+0x1c1/0x760 [kvm]
  [<ffffffff8118b55a>] ? handle_mm_fault+0x49a/0xec0
  [<ffffffff811e47d5>] do_vfs_ioctl+0x2e5/0x4c0
  [<ffffffff8127abbe>] ? file_has_perm+0xae/0xc0
  [<ffffffff811e4a51>] SyS_ioctl+0xa1/0xc0
  [<ffffffff81630949>] system_call_fastpath+0x16/0x1b

(Note: we also unconditionally activate FPU in vmx_vcpu_reset(), so the
removed code added nothing.)

Fixes: c447e76b4cab ("kvm/fpu: Enable eager restore kvm FPU for MPX")
Cc: <stable@vger.kernel.org>
Reported-by: Vlastimil Holer <vlastimil.holer@gmail.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit 370777daab3f024f1645177039955088e2e9ae73)
OraBug: 25956028
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>

Merge branch topic/uek-4.1/dtrace of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* topic/uek-4.1/dtrace:
dtrace: fix handling of save_stack_trace sentinel (x86 only)
dtrace: DTrace walltime lock-free implementation

Merge branch 'topic/uek-4.1/drivers' of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* topic/uek-4.1/drivers:
Revert "i40e: enable VSI broadcast promiscuous mode instead of adding broadcast filter"

Revert "i40e: enable VSI broadcast promiscuous mode instead of adding broadcast filter"

Orabug: 25877447

This reverts commit 690e746888bf9a307845d00fbac02bcfd2157fa2.
This fixes dhcp on OL7 under KVM.

Signed-off-by: Brian Maly <brian.maly@oracle.com>

sparc64: DAX memory needs persistent mappings

Orabug: 25888596

Memory allocated on behalf of dax_alloc() is mapped by two
distinct MMU translations, one set for userland, which is backed by
regular 8k virtual pages, and another set for feeding to the
hypervisor, and these are kernel virtual addresses backed
by 4Mb pages. These latter translations are only used when
the memory is allocated/deallocated and when a dax transaction
is submitted. So the translations are unlikely to be in the TLB,
and eventually may be evicted from the kernel TSB after some time.
This leads to ENOMAP errors reported by the hypervisor. In order
to avoid this, we "touch" such pages just before dax_submit
which will fault translations into the tlb/tsb if necessary.

A future performance optimization is to take advantage of the "real
address has pagesize" feature which is available in very recent
versions of hypervisor f/w. In this case we can convert the address
in the ccb to a real address with the correct bits to specify
a 4Mb pagesize, and change the address type to real. Then the
memory touch would be unnecessary and the HV would not need to
probe the tlb/tsb at all.

Signed-off-by: Rob Gardner <rob.gardner@oracle.com>
Reviewed-by: HÃ¥kon Bugge <haakon.bugge@oracle.com>
Reviewed-by: Jonathan Helman <jonathan.helman@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: Fix incorrect error print in DAX driver when validating ccb

Orabug: 25835254

This fixes an incorrect stringification in a macro that prints
invalid address type in a CCB.

Reviewed-by: Rob Gardner <rob.gardner@oracle.com>
Signed-off-by: Sanath Kumar <sanath.s.kumar@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: DAX request for non 4MB memory should return with unique errno

Orabug:25852910

With this change libdax can detect that mmap failed due to lack of flow
control in DAX HW and it can proceed with dax_subpage allocation.

Reviewed-by: Rob Gardner <rob.gardner@oracle.com>
Signed-off-by: Sanath Kumar <sanath.s.kumar@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

Revert "sparc64: DAX request for non 4MB memory should return with unique errno"

This reverts commit c835e602d654df4753cc25bacfd2e3ae023fbc23.

Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: DAX request to mmap non 4MB memory should fail with a debug print

Orabug:25852910

When the dax flow control is disabled, dax_alloc_mem(...) only allows 4MB
allocation in the driver. Any other requests are reported as an error
with an error print to the kernel log. We now have a dax subpage
allocator in libdax which gets triggered when the driver reports the
above said failure. In other words it is a normal use case for the driver
to report this. So make this a debug print.

Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Sanath Kumar <sanath.s.kumar@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: DAX request for non 4MB memory should return with unique errno

Orabug:25852910

With this change libdax can detect that mmap failed due to lack of flow
control in DAX HW and it can proceed with dax_subpage allocation.

Reviewed-by: Rob Gardner <rob.gardner@oracle.com>
Signed-off-by: Sanath Kumar <sanath.s.kumar@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: Incorrect print by DAX driver when old driver API is used

Orabug: 25835133

If an old dax driver API is used, it is the driver API version
that is old, irrespective of the libdax version.

Reviewed-by: Rob Gardner <rob.gardner@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Sanath Kumar <sanath.s.kumar@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: DAX request to dequeue half of a long CCB should not succeed

Orabug: 25827254

When a dequeue call is made to the driver such that the last CCB in the
request is the first half of a long ccb, the driver does not report an
error and releases the BIP buffer for the CCBs dequeued up until then.
This was detected by a ioctl_test23 written by Stanislav specifically to
validate the above scenario.

This bug fix addresses both the above issues.

Reviewed-by: Rob Gardner <rob.gardner@oracle.com>
Signed-off-by: Sanath Kumar <sanath.s.kumar@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: dax_overflow_check reports incorrect data

Orabug: 25820395

The range reported for a page overflow is incorrect when
the address is page_size aligned. Add 1 to the address so
that it always reports the next page_size boundary.

Reviewed-by: Rob Gardner <rob.gardner@oracle.com>
Signed-off-by: Sanath Kumar <sanath.s.kumar@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: Ignored DAX ref count causes lockup

Orabug: 25870705

The dax_mm structure has a reference count that respresents
the number of dax_vma structures that point to it. The reference
count is duly incremented and decremented each time memory is
allocated via the dax_alloc/mmap path. However, the reference
count is never used for its intended purpose, which is to
prevent the dax_mm structure from being freed while there
are references to it. The result of this is that if dax_free
is called after the process's dax_mm is cleaned up, the dax_vma
will have a reference to the freed object, leading to panics
due to null pointers and/or lockups due to inalid spinlock state.
Code changed to actually check the reference count before freeing
a dax_mm.

Signed-off-by: Rob Gardner <rob.gardner@oracle.com>
Reviewed-by: Sanath Kumar <sanath.s.kumar@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: disable dax page range checking on RA

Orabug: 25820812

This fix is to address a hypervisor f/w change which affects
T7 machines with 3.0 cpu and T8 machines. Newer HV f/w
pays attention to the page size field in real addresses.
This is not a bug; I asked them to implement this long ago.
A value of zero here means 8k, so RA buffers are restricted
to one page when running on newer firmware. The PRM defines a page
size of 0xF to mean "disable page range checking". We set this
value for all real addresses. Old f/w will overwrite this field
anyway, so it's safe for us to always set it.

Signed-off-by: Rob Gardner <rob.gardner@oracle.com>
Reviewed-by: Jonathan Helman <jonathan.helman@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: Oracle Data Analytics Accelerator (DAX) driver

Orabug: 23072809

DAX is a coprocessor which resides on the SPARC M7 processor chip, and
has direct access to the CPU's L3 caches as well as physical
memory. It can perform several operations on data streams with
various input and output formats. The driver is merely a transport
mechanism and does not have knowledge of the various opcodes and data
formats. A user space library provides high level services and
translates these into low level commands which are then passed into
the driver and subsequently the hypervisor and the coprocessor.

See Documentation/sparc/dax.txt for more details.

Reviewed-by: Jonathan Helman <jonathan.helman@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: David Aldridge <david.j.aldridge@oracle.com>
Reviewed-by: Stanislav Kholmanskikh <stanislav.kholmanskikh@oracle.com>
Signed-off-by: Rob Gardner <rob.gardner@oracle.com>
Signed-off-by: Sanath Kumar <sanath.s.kumar@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: fix an issue when trying to bring hotplug cpus online

When booting the kernel with maxcpus= on the command line
and then subsequently trying to bring cpus online using:

echo 1 > /sys/devices/system/cpu/cpu[n]/online

messages of the form:

ldom_startcpu_cpuid: sun4v_cpu_start() gives error 6
Processor[n] is stuck

were being reported on the console.

This commit fixes this issue by ensuring that any cpus not
booted during initial kernel boot are explicitly stopped.
This then allows them to be successfully brought online later
if required.

Orabug: 25667277

Signed-off-by: Dave Aldridge <david.j.aldridge@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: Fix memory corruption when THP is enabled

The memory corruption was happening due to incorrect
TLB/TSB flushing of hugepages.

Orabug: 25704426

Reported-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: Fix address range for page table free Orabug: 25704426

Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: Add support for 2G hugepages

Orabug: 25704426

Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: Fix size check in huge_pte_alloc

Orabug: 25704426

Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: Fix build error in flush_tsb_user_page

Patch "sparc64: Add 64K page size support"
unconditionally used __flush_huge_tsb_one_entry()
which is available only when hugetlb support is
enabled.

Another issue was incorrect TSB flushing for 64K
pages in flush_tsb_user().

Orabug: 25704426

Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: Add 64K page size support

Orabug: 25704426

Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: Remove xl-hugepages and add multi-page size
support

with several fixes from Bob Picco.

Orabug: 25704426

Signed-off-by: Nitin Gupta <nitin.m.gupta@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: do not dequeue stale VDS IO work entries

This change ensures that stale (ones that are marked to be dropped) VDS IO
work entries are not dequeued.

Orabug: 25455138

Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Reviewed-by: Jack Schwartz <jack.schwartz@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

SPARC64: Virtual Disk Device (vdsdev) Read-Only Option (options=ro) not working

Add read-only (options=ro) support to virtual disk server (VDS).

Signed-off-by: George Kennedy <george.kennedy@oracle.com>
Reviewed-by: Alexandre Chartre <Alexandre.Chartre@oracle.com>
Orabug: 23623853
Signed-off-by: Allen Pais <allen.pais@oracle.com>

arch/sparc: Fix FPU register corruption with AES crypto test on M7

We noticed crypto key corruption while running AES crypto tests
with M7 memcpy changes. Investigating further, we found that this
was the same problem reported previously with NG4memcpy. The commit
f4da3628dc7c ("sparc64: Fix FPU register corruption with AES crypto
offload") fixes the problem. Ported these changes to M7memcpy.

Orabug: 25265878

Signed-off-by: Babu Moger <babu.moger@oracle.com>
Tested-by: Stanislav Kholmanskikh <stanislav.kholmanskikh@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sunvnet: xoff not needed when removing port link

The sunvnet netdev is connected to the controlling ldom's vswitch
for network bridging.  However, for higher performance between ldoms,
there also is a channel between each client ldom.  These connections are
represented in the sunvnet driver by a queue for each ldom.  The driver
uses select_queue to tell the stack which queue to use by tracking the mac
addresses on the other end of each port.  When a connected ldom shuts down,
the driver receives an LDC_EVENT_RESET and the port is removed from the
driver, thus a queue with no ldom on the other end will never be selected
for Tx.

The driver was trying to reinforce the "don't use this queue" notion with
netif_tx_stop_queue() and netif_tx_wake_queue(), which really should only
be used to signal a Tx queue is full (aka XOFF).  This misuse of queue
state resulted in NETDEV WATCHDOG messages and lots of unnecessary calls
into the driver's tx_timeout handler.  Simply removing these takes care
of the problem.

Orabug: 25190537

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry-picked from commit 9c5a3a1f9388100d4b03e85faf0cce8264985302)

Signed-off-by: Allen Pais <allen.pais@oracle.com>

sunvnet: count multicast packets

Make sure multicast packets get counted in the device.

Orabug: 25190537

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry-picked from commit b12a96f5cd04583f45a1b6554b8f3786b26db913)

Signed-off-by: Allen Pais <allen.pais@oracle.com>

sunvnet: track port queues correctly

Track our used and unused queue indexies correctly. Otherwise, as ports
dropped out and returned, they all eventually ended up with the same
queue index.

Orabug: 25190537

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry-picked from commit e1f1e5f711265ee9d881afd12ff252b2d01e1174)

Signed-off-by: Allen Pais <allen.pais@oracle.com>

sunvnet: add stats to track ldom to ldom packets and bytes

In this driver, there is a "port" created for the connection to each of
the other ldoms; a netdev queue is mapped to each port, and they are
collected under a single netdev. The generic netdev statistics show
us all the traffic in and out of our network device, but don't show
individual queue/port stats. This patch breaks out the traffic counts
for the individual ports and gives us a little view into the state of
those connections.

Orabug: 25190537

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry-picked from commit 0f512c84544b9a8f8de53b6f4bc0c372c45d8693)

Signed-off-by: Allen Pais <allen.pais@oracle.com>

ldmvsw: better use of link up and down on ldom vswitch

When an ldom VM is bound, the network vswitch infrastructure is set up for
it, but was being forced 'UP' by the userland switch configuration script.
When 'UP' but not actually connected to a running VM, the ipv6 neighbor
probes fail (not a horrible thing) and start cluttering up the kernel logs.
Funny thing: these are debug messages that never actually show up, but
we do see the net_ratelimited messages that say N callbacks were
suppressed.

This patch defers the netif_carrier_on() until an actual link has been
established with the VM, as indicated by receiving an LDC_EVENT_UP from
the underlying LDC protocol. Similarly, we take the link down when we
see the LDC_EVENT_RESET. Now when we see the ndo_open(), we reset the
link to get things talking again.

Orabug: 25525312

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry-picked from commit 867fa150f8f7ee6e9e5a9ab768e2d0dc675a968b)

Signed-off-by: Allen Pais <allen.pais@oracle.com>

Merge branch topic/uek-4.1/sparc of git://ca-git.us.oracle.com/linux-uek into uek/uek-4.1

* topic/uek-4.1/sparc: (28 commits)
  megaraid: Fix unaligned warning
  sparc64: Restrict number of processes
  SPARC64: vds_blk_rw() does not handle drives with q->limits.chunk_sectors > 0
  sparc64: Improve boot time by per cpu map update
  arch/sparc: memblock resizes are not handled properly
  SPARC64: LDOM vnet "Got unexpected MCAST reply"
  ldmvsw: disable tso and gso for bridge operations
  ldmvsw: update and simplify version string
  sunvnet: remove extra rcu_read_unlocks
  sunvnet: straighten up message event handling logic
  sunvnet: add memory barrier before check for tx enable
  sunvnet: update version and version printing
  sunvnet: remove unused variable in maybe_tx_wakeup
  sunvnet: make sunvnet common code dynamically loadable
  hwrng: n2 - update version info
  hwrng: n2 - support new hardware register layout
  hwrng: n2 - add device data descriptions
  hwrng: n2 - limit error spewage when self-test fails
  hwrng: n2 - Attach on T5/M5, T7/M7 SPARC CPUs
  tcp: fix tcp_fastopen unaligned access complaints on sparc
  ...

megaraid: Fix unaligned warning

The MegaRAID userland descriptor structures do not properly align
pointers on their natural boundaries. This causes warnings to be issued
when storcli or the SNMP daemon are in use.

Quiesce the warning until the user-kernel interface has been fixed.

Orabug: 24817799

Signed-off-by: Kashyap Desai <kashyap.desai@broadcom.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
(cherry picked from commit 069af368ac74dc0130f91836b9f85f7cd5b18749)
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: Restrict number of processes

Orabug: 24523680

If the number of processes exceeds the number of supported context IDs
then there are random segfaults seen in the user programs.

The data collected when debugging this bug showed that two processes
with the same context IDs were present in the TLB shared between them
thus resulting in incorrect translations. Since the bug occurs after the
kernel hits the max context ID supported by the processor, the context
wraparound code found in get_new_mmu_context(...) and
smp_new_mmu_context_version_client(...) are under suspicion.

The plan is that this will get fixed when we implement "context domain"
feature for sparc in the kernel. For now this patch temporarily
restricts the number of processes allowed by the kernel based on the
number of context IDs supported by the processor. This way we never reach
that condition of having incorrect translations.

For non root users fork will fail if the number of existing processes is
greater than (max_user_nctx - 100). Where max_user_nctx is the maximum
number of context IDs supported by the processor. For root user the fork
will fail if the the number of existing processes reaches max_user_nctx.
Extra context IDs are given to root to recover the system if users reach
their limit and cannot recover the system (i.e cannot even execute
'kill' to reduce the number of processes).

Signed-off-by: Sanath Kumar <sanath.s.kumar@oracle.com>
Reviewed-by: Rob Gardner <rob.gardner@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

SPARC64: vds_blk_rw() does not handle drives with q->limits.chunk_sectors > 0

Drives with q->limits.chunk_sectors > 0 are not properly handled by
vds_blk_rw(). Drives such as NVME set chunk_sectors to indicate a performance
boundary (see call to blk_queue_chunk_sectors() in nvme_alloc_ns()). Currently,
when vds_blk_rw() calls bio_add_page() and the chunk_sectors boundary would be
crossed, bio_add_page() returns zero and vds_blk_rw() fails with -EIO.

The proposed fix now adds an additional check to vds_blk_rw() when
bio_add_page() returns zero that checks for bio->bi_iter.bi_size != 0. If
bi_size != 0, it indicates that a page or pages have been successfully added by
bio_add_page(). When this added condition has been hit, exit the for loop in
vds_blk_rw() and submit the outstanding IOs and continue.

Signed-off-by: George Kennedy <george.kennedy@oracle.com>
Reviewed-By: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Reviewed-By: Liam Merwick <Liam.Merwick@oracle.com>
Orabug: 25373818
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: Improve boot time by per cpu map update

Currently, smp_fill_in_sib_core_maps is invoked during cpu_up to setup
all the core/sibling map correctly. This happens in the order of O(n^2)
as it iterates over all the online cpus twice when each cpu comes online.
This increases smp_init() execution time exponentially leading to a
higher boot time.

Optimize the code path by comparing only the current cpu with online
cpus and set the maps for both the cpus simultaneously. Take this
opportunity to merge all three for loops into one as well. Here is
the smp_init() time after and before the fix.

Number of cpus:    before fix: after the fix:
512    2.30s .283s
1024    14.23s .493s

Orabug: 25496463

Signed-off-by: Atish Patra <atish.patra@oracle.com>
Reviewed-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

arch/sparc: memblock resizes are not handled properly

In add_node_ranges() when memblock resize happens, the iterator keeps using
the previous freed array. This bug cause hangs on machine where there are
over 128 memory blocks during boot. For example, on machines where memory
interleaving is small.
The problem is seen on T4-4 because it cant have 2T of memory, and memory
is interleaved at 8G. So we have 2T/8G = 256 regions to set node IDs. The
starting size of regions array is 128. Thus, we have to double at least one
time (actually we have to double twice because some memory is already
reserved and thus we need more than 256 regions). We start using an
incorrect pointer to the array after the first doubling.

Orabug: 25415396

Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: Babu Moger <babu.moger@oracle.com>
Reviewed-by: Babu Moger <babu.moger@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>

SPARC64: LDOM vnet "Got unexpected MCAST reply"

Handle unexpected MCAST reply as a debug warning the same as is done in
Solaris 12. Please see bug 24954702 for details.

Signed-off-by: George Kennedy <george.kennedy@oracle.com>
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Orabug: 24954702
Signed-off-by: Allen Pais <allen.pais@oracle.com>

ldmvsw: disable tso and gso for bridge operations

The ldmvsw driver is specifically for supporting the ldom virtual
networking by running in the primary ldom and using the LDC to connect
the remaining ldoms to the outside world via a bridge. With TSO and GSO
supported while connected the bridge, things tend to misbehave as seen
in our case by delayed packets, enough to begin triggering retransmits
and affecting overall throughput. By turning off advertised support for
TSO and GSO we restore stable traffic flow through the bridge.

Orabug: 23293104

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit bc221a34ac473b444a7cfdd0c152b4c71f79326b)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

ldmvsw: update and simplify version string

New version and simplify the print code.

Orabug: 23293104

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 7602011f59cc32ebc3a5f9058d6ba11b096c8c50)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sunvnet: remove extra rcu_read_unlocks

The RCU read lock is grabbed first thing in sunvnet_start_xmit_common()
so it always needs to be released. This removes the conditional release
in the dropped packet error path and removes a couple of superfluous
calls in the middle of the code.

Orabug: 23293104

Reported-by: Bijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit daa86e50f649fccadafc53994ddc4254d75a008b)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sunvnet: straighten up message event handling logic

The use of gotos for handling the incoming events made this code
harder to read and support than it should be. This patch straightens
out and clears up the logic.

Orabug: 23293104

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit bf091f3f362b3c562a18bbf7a2d3e2f3a36eba1d)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sunvnet: add memory barrier before check for tx enable

In order to allow the underlying LDC and outstanding memory operations
to potentially catch up with the driver's Tx requests, add a memory
barrier before checking again for available tx descriptors.

Orabug: 23293104

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit fd263fb6e718c5bdf35cbc1de4f781c71794d2a4)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sunvnet: update version and version printing

There have been several changes since the first version of this code, so
we bump the version number. While we're at it, we can simplify the
version printing a bit and drop a couple lines of code.

Orabug: 23293104

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit f2f3e210bffe5c8f8b30d0b0c7b0f733ff5db334)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sunvnet: remove unused variable in maybe_tx_wakeup

The vio_dring_state *dr variable is unused in maybe_tx_wakeup().
As the comments indicate, we call maybe_tx_wakeup() whenever we
get a STOPPED LDC message on the port. If the queue is stopped,
we want to wake it up so that we will send another START message
at the next TX and trigger the consumer to drain the dring.

Orabug: 23293104

Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit d4aa89cc2bbe021722c946eb11b21ebb0f13c825)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sunvnet: make sunvnet common code dynamically loadable

When the sunvnet_common code was split out for use by both sunvnet
and the newer ldmvsw, it was made into a static kernel library, which
limits the usefulness of sunvnet and ldmvsw as loadables, since most
of the real work is being done in the shared code. Also, this is
simply dead code in kernels that aren't running the LDoms.

This patch makes the sunvnet_common into a dynamically loadable
module and makes sunvnet and ldmvsw dependent on sunvnet_common.

Orabug: 23293104

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 2493b842f258e14938f278e44ecc26970dfabbf0)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

hwrng: n2 - update version info

Orabug: 25127795

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit 0ff1436fb2e3da085f7177d03ce4362c45b75d57)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

hwrng: n2 - support new hardware register layout

Add the new register layout constants and the requisite logic
for using them.

Orabug: 25127795

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit 07e25d43be8502bd8ab6122c4f6449ebf30e98f7)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

hwrng: n2 - add device data descriptions

Since we're going to need to keep track of more than just one
attribute of the hardware, we'll change the use of the data field
from the match struct from a single flag to a struct pointer.
This patch adds the struct template and initial descriptions.

Orabug: 25127795

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit becbc4940ad8e8ff560e1ceee33d9bb4fe4c9225)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

hwrng: n2 - limit error spewage when self-test fails

If the self-test fails, it probably won't actually suddenly
start working. Currently, this causes an endless spew of
error messages on the console and in the logs, so this patch
adds a limiter to the test.

Orabug: 25127795

Reported-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit db602a7f940a71870c17e39bcbe4e4d7a4a8273e)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

hwrng: n2 - Attach on T5/M5, T7/M7 SPARC CPUs

n2rng: Attach on T5/M5, T7/M7 SPARC CPUs

(space to tab fixes after variable names)

Orabug: 25127795

Signed-off-by: Anatoly Pugachev <matorola@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
(cherry picked from commit c1e9b3b0eea12899b7749571af21cc60822cf2b6)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

tcp: fix tcp_fastopen unaligned access complaints on sparc

Fix up a data alignment issue on sparc by swapping the order
of the cookie byte array field with the length field in
struct tcp_fastopen_cookie, and making it a proper union
to clean up the typecasting.

This addresses log complaints like these:
    log_unaligned: 113 callbacks suppressed
    Kernel unaligned access at TPC[976490] tcp_try_fastopen+0x2d0/0x360
    Kernel unaligned access at TPC[9764ac] tcp_try_fastopen+0x2ec/0x360
    Kernel unaligned access at TPC[9764c8] tcp_try_fastopen+0x308/0x360
    Kernel unaligned access at TPC[9764e4] tcp_try_fastopen+0x324/0x360
    Kernel unaligned access at TPC[976490] tcp_try_fastopen+0x2d0/0x360

Orabug: 25163405

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(cherry picked from commit 003c941057eaa868ca6fedd29a274c863167230d)
Signed-off-by: Allen Pais <allen.pais@oracle.com>

vds: Add physical block support

Version 1.2 of the virtual IO device protocol added physical block
support. Start sending the underlaying physical block device size.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Rob Gardner <rob.gardner@oracle.com>
Orabug: 19420123
Signed-off-by: Allen Pais <allen.pais@oracle.com>

sparc64: Add missing hardware capabilities for M7

Some M7 hardware capabilities were not being reported
correctly. This commit fixes the issue by adding definitions
for all the missing capabilities from both the Machine
Descriptor and the Compatibility Feature Register.

Orabug: 25555746

Signed-off-by: Dave Aldridge <david.j.aldridge@oracle.com>
Reviewed-by: Shannon Nelson <shannon.nelson@oracle.com>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Signed-off-by: Allen Pais <allen.pais@oracle.com>