Sunil Mushran [Wed, 22 Jun 2011 23:56:27 +0000 (16:56 -0700)]
ocfs2/cluster: Cluster up now includes network connections too
The cluster up check only checks to see if the node is heartbeating or not.
If yes it continues assuming that the node is connected to all the nodes. But
if that is not the case, the cluster join aborts with a stack of errors that
are not easy to comprehend.
This patch adds the network connect check upfront and prints the nodes that
the node is not yet connected to, before aborting.
Sunil Mushran [Mon, 20 Jun 2011 17:07:08 +0000 (10:07 -0700)]
ocfs2/cluster: Abort heartbeat start on hard-ro devices
Currently if the heartbeat device is hard-ro, the o2hb thread keeps chugging
along and dumping errors along the way. The user needs to manually stop the
heartbeat.
The patch addresses this shortcoming by adding a limit to the number of times
the hb thread will iterate in an unsteady state. If the hb thread does not
ready steady state in that many interation, the start is aborted.
Maxim Uvarov [Tue, 27 Mar 2012 23:44:21 +0000 (16:44 -0700)]
remove unused mutex hpidebuglock
Remove unused mutex and clean up following warnings
drivers/net/hxge/hpi/hpi.c:35: warning: data definition has no type or storage class
drivers/net/hxge/hpi/hpi.c:35: warning: type defaults to "int" in declaration of "DECLARE_MUTEX"
drivers/net/hxge/hpi/hpi.c:35: warning: parameter names (without types) in function declaration
This patch is part 1 of 3 patches that rolls back changes that appear
to have broken delete. Those patches were originally added to fix a deadlock
with quotas enabled. Considering we are not enabling quotas in OVM3, it is
ok to rollback these patches. We will have to solve the deadlock in another
way.
Sunil Mushran [Wed, 22 Jun 2011 23:56:20 +0000 (16:56 -0700)]
ocfs2/cluster: Add new function o2net_fill_node_map()
Patch adds function o2net_fill_node_map() to return the bitmap of nodes that
it is connected to. This bitmap is also accessible by the user via the debugfs
file, /sys/kernel/debug/o2net/connected_nodes.
Sunil Mushran [Tue, 8 Nov 2011 21:00:19 +0000 (13:00 -0800)]
ocfs2: Tighten free bit calculation in the global bitmap
When clearing bits in the global bitmap, we do not test the current bit value.
This patch tightens the code by considering the possiblity that the bit being
cleared was already cleared.
Now this should not happen. But we are seeing stray instances in which free
bit count in the global bitmap exceeds the total bit count. In each instance
the bitmap is correct. Only the free bit count is incorrect.
This patch checks the current bit value and increments the free bit count
only if the bit was previously set. It also prints information to allow
us to debug further.
Dave Kleikamp [Thu, 22 Mar 2012 22:19:44 +0000 (17:19 -0500)]
btrfs: btrfs_direct_IO_bvec() needs to check for sector alignment
I left out a vital piece of btrfs_direct_IO_bvec(). I didn't think
check_direct_IO() was needed, but it is. This patch creates its
equivalent, check_direct_IO_bvec().
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
Merge branch 'stable/not-upstreamed' into uek2-merge
* stable/not-upstreamed:
Xen: Export host physical CPU information to dom0
xen/mce: Change the machine check point
Add mcelog support from xen platform
mismerged the Makefile. The end result that the line
obj-$(CONFIG_XEN_PVHVM) += platform_pci.o was removed
leading to no PVonHVM driver to tell QEMU to unplug the device.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
x86/PCI: reduce severity of host bridge window conflict warnings
Host bridge windows are top-level resources, so if we find a host bridge
window conflict, it's probably with a hard-coded legacy reservation.
Moving host bridge windows is theoretically possible, but we don't support
it; we just ignore windows with conflicts, and it's not worth making this
a user-visible error.
Jan Kara [Mon, 15 Aug 2011 17:09:33 +0000 (10:09 -0700)]
ocfs2: Avoid livelock in ocfs2_readpage()
[Pulled before push to mainline]
When someone writes to an inode, readers accessing the same inode via
ocfs2_readpage() just busyloop trying to get ip_alloc_sem because
do_generic_file_read() looks up the page again and retries ->readpage()
when previous attempt failed with AOP_TRUNCATED_PAGE. When there are enough
readers, they can occupy all CPUs and in non-preempt kernel the system is
deadlocked because writer holding ip_alloc_sem is never run to release the
semaphore. Fix the problem by making reader block on ip_alloc_sem to break
the busy loop.
Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Joel Becker <jlbec@evilplan.org>
Mark Fasheh [Mon, 15 Aug 2011 17:08:44 +0000 (10:08 -0700)]
ocfs2: serialize unaligned aio
[Pulled before push to mainline]
Fix a corruption that can happen when we have (two or more) outstanding
aio's to an overlapping unaligned region. Ext4
(e9e3bcecf44c04b9e6b505fd8e2eb9cea58fb94d) and xfs recently had to fix
similar issues.
In our case what happens is that we can have an outstanding aio on a region
and if a write comes in with some bytes overlapping the original aio we may
decide to read that region into a page before continuing (typically because
of buffered-io fallback). Since we have no ordering guarantees with the
aio, we can read stale or bad data into the page and then write it back out.
If the i/o is page and block aligned, then we avoid this issue as there
won't be any need to read data from disk.
I took the same approach as Eric in the ext4 patch and introduced some
serialization of unaligned async direct i/o. I don't expect this to have an
effect on the most common cases of AIO. Unaligned aio will be slower
though, but that's far more acceptable than data corruption.
Signed-off-by: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Joel Becker <jlbec@evilplan.org>
Tiger Yang [Mon, 15 Aug 2011 16:54:55 +0000 (09:54 -0700)]
ocfs2: Bugfix for hard readonly mount
[Pulled before push to mainline]
ocfs2 cannot currently mount a device that is readonly at the media
("hard readonly"). Fix the broken places.
see detail: http://oss.oracle.com/bugzilla/show_bug.cgi?id=1322
[ Description edited -- Joel ]
Signed-off-by: Tiger Yang <tiger.yang@oracle.com> Reviewed-by: Sunil Mushran <sunil.mushran@oracle.com> Signed-off-by: Joel Becker <jlbec@evilplan.org>
Dave Kleikamp [Wed, 21 Mar 2012 20:06:42 +0000 (15:06 -0500)]
btrfs: create btrfs_file_write_iter()
Since btrfs has it's own aio_write function, it needs a similar
write_iter. This patch patterns btrfs_file_write_iter on
btrfs_file_aio_write, then has the latter call the former.
Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
When the system is under heavy load, there can be a significant delay
between the getscl() and time_after() calls inside sclhi(). That delay
may cause the time_after() check to trigger after SCL has gone high,
causing sclhi() to return -ETIMEDOUT.
To fix the problem, double check that SCL is still low after the
timeout has been reached, before deciding to return -ETIMEDOUT.
Signed-off-by: Ville Syrjala <syrjala@sci.fi> Signed-off-by: Jean Delvare <khali@linux-fr.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
NCT6775F and NCT6776F have their own set of registers for FAN_STOP_TIME. The
correct registers were used to read FAN_STOP_TIME, but writes used the wrong
registers. Fix it.
Newer version of binutils are more strict about specifying the
correct options to enable certain classes of instructions.
The sparc32 build is done for v7 in order to support sun4c systems
which lack hardware integer multiply and divide instructions.
So we have to pass -Av8 when building the assembler routines that
use these instructions and get patched into the kernel when we find
out that we have a v8 capable cpu.
Reported-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When pre-allocating skbs for received packets, we set ip_summed =
CHECKSUM_UNNCESSARY. We used to change it back to CHECKSUM_NONE when
the received packet had an incorrect checksum or unhandled protocol.
Commit bc8acf2c8c3e43fcc192762a9f964b3e9a17748b ('drivers/net: avoid
some skb->ip_summed initializations') mistakenly replaced the latter
assignment with a DEBUG-only assertion that ip_summed ==
CHECKSUM_NONE. This assertion is always false, but it seems no-one
has exercised this code path in a DEBUG build.
Fix this by moving our assignment of CHECKSUM_UNNECESSARY into
efx_rx_packet_gro().
Signed-off-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch (as1519) fixes a bug in the block layer's disk-events
polling. The polling is done by a work routine queued on the
system_nrt_wq workqueue. Since that workqueue isn't freezable, the
polling continues even in the middle of a system sleep transition.
Obviously, polling a suspended drive for media changes and such isn't
a good thing to do; in the case of USB mass-storage devices it can
lead to real problems requiring device resets and even re-enumeration.
The patch fixes things by creating a new system-wide, non-reentrant,
freezable workqueue and using it for disk-events polling.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Then we unblock events, when they are suppose to be blocked. This can
trigger events related block/genhd.c warnings, but also can crash in
sd_check_events() or other places.
I'm able to reproduce crashes with the following scripts (with
connected usb dongle as sdb disk).
function stop_me()
{
for i in `jobs -p` ; do kill $i 2> /dev/null ; done
exit
}
trap stop_me SIGHUP SIGINT SIGTERM
for ((i = 0; i < 10; i++)) ; do
while true; do fdisk -l $DEV 2>&1 > /dev/null ; done &
done
while true ; do
echo 1 > $ENABLE
sleep 1
echo 0 > $ENABLE
done
</snip>
I use the script to verify patch fixing oops in sd_revalidate_disk
http://marc.info/?l=linux-scsi&m=132935572512352&w=2
Without Jun'ichi Nomura patch titled "Fix NULL pointer dereference in
sd_revalidate_disk" or this one, script easily crash kernel within
a few seconds. With both patches applied I do not observe crash.
Unfortunately after some time (dozen of minutes), script will hung in:
As drawback, I do not teardown on sysfs file create error, because I do
not know how to nullify disk->ev (since it can be used). However add_disk
error handling practically does not exist too, and things will work
without this sysfs file, except events will not be exported to user
space.
Since 2.6.39 (1196f8b), when a driver returns -ENOMEDIUM for open(),
__blkdev_get() calls rescan_partitions() to remove
in-kernel partition structures and raise KOBJ_CHANGE uevent.
However it ends up calling driver's revalidate_disk without open
and could cause oops.
In the case of SCSI:
process A process B
----------------------------------------------
sys_open
__blkdev_get
sd_open
returns -ENOMEDIUM
scsi_remove_device
<scsi_device torn down>
rescan_partitions
sd_revalidate_disk
<oops>
Oopses are reported here:
http://marc.info/?l=linux-scsi&m=132388619710052
This patch separates the partition invalidation from rescan_partitions()
and use it for -ENOMEDIUM case.
Signed-off-by: Axel Lin <axel.lin@gmail.com> Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch adds support for the Sitecom LN-031 USB adapter with a AX88178 chip.
Added USB id to find correct driver for AX88178 1000 Ethernet adapter.
Signed-off-by: Joerg Neikes <j.neikes@midlandgate.de> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This driver attempts to use two TX rings but lacks proper support :
1) IRQ handler only takes care of TX completion on first TX ring
2) the stop/start logic uses the legacy functions (for non multiqueue
drivers)
This means all packets witk skb mark set to 1 are sent through high
queue but are never cleaned and queue eventualy fills and block the
device, triggering the infamous "NETDEV WATCHDOG" message.
Lets use a single TX ring to fix the problem, this driver is not a real
multiqueue one yet.
Minimal fix for stable kernels.
Reported-by: Thomas Meyer <thomas@m3y3r.de> Tested-by: Thomas Meyer <thomas@m3y3r.de> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Jay Cliburn <jcliburn@gmail.com> Cc: Chris Snook <chris.snook@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When forwarding was set and a new net device is register,
we need add this device to the all-router mcast group.
Signed-off-by: Li Wei <lw@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This commit fixes tcp_shift_skb_data() so that it does not shift
SACKed data below snd_una.
This fixes an issue whose symptoms exactly match reports showing
tp->sacked_out going negative since 3.3.0-rc4 (see "WARNING: at
net/ipv4/tcp_input.c:3418" thread on netdev).
Since 2008 (832d11c5cd076abc0aa1eaf7be96c81d1a59ce41)
tcp_shift_skb_data() had been shifting SACKed ranges that were below
snd_una. It checked that the *end* of the skb it was about to shift
from was above snd_una, but did not check that the end of the actual
shifted range was above snd_una; this commit adds that check.
Shifting SACKed ranges below snd_una is problematic because for such
ranges tcp_sacktag_one() short-circuits: it does not declare anything
as SACKed and does not increase sacked_out.
Before the fixes in commits cc9a672ee522d4805495b98680f4a3db5d0a0af9
and daef52bab1fd26e24e8e9578f8fb33ba1d0cb412, shifting SACKed ranges
below snd_una happened to work because tcp_shifted_skb() was always
(incorrectly) passing in to tcp_sacktag_one() an skb whose end_seq
tcp_shift_skb_data() had already guaranteed was beyond snd_una. Hence
tcp_sacktag_one() never short-circuited and always increased
tp->sacked_out in this case.
After those two fixes, my testing has verified that shifting SACKed
ranges below snd_una could cause tp->sacked_out to go negative with
the following sequence of events:
(1) tcp_shift_skb_data() sees an skb whose end_seq is beyond snd_una,
then shifts a prefix of that skb that is below snd_una
(2) tcp_shifted_skb() increments the packet count of the
already-SACKed prev sk_buff
(3) tcp_sacktag_one() sees the end of the new SACKed range is below
snd_una, so it short-circuits and doesn't increase tp->sacked_out
(5) tcp_clean_rtx_queue() sees the SACKed skb has been ACKed,
decrements tp->sacked_out by this "inflated" pcount that was
missing a matching increase in tp->sacked_out, and hence
tp->sacked_out underflows to a u32 like 0xFFFFFFFF, which casted
to s32 is negative.
(6) this leads to the warnings seen in the recent "WARNING: at
net/ipv4/tcp_input.c:3418" thread on the netdev list; e.g.:
tcp_input.c:3418 WARN_ON((int)tp->sacked_out < 0);
More generally, I think this bug can be tickled in some cases where
two or more ACKs from the receiver are lost and then a DSACK arrives
that is immediately above an existing SACKed skb in the write queue.
This fix changes tcp_shift_skb_data() to abort this sequence at step
(1) in the scenario above by noticing that the bytes are below snd_una
and not shifting them.
Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
otherwise source IPv6 address of ICMPV6_MGM_QUERY packet
might be random junk if IPv6 is disabled on interface or
link-local address is not yet ready (DAD).
Signed-off-by: Ulrich Weber <ulrich.weber@sophos.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
In tcp_mark_head_lost() we should not attempt to fragment a SACKed skb
to mark the first portion as lost. This is for two primary reasons:
(1) tcp_shifted_skb() coalesces adjacent regions of SACKed skbs. When
doing this, it preserves the sum of their packet counts in order to
reflect the real-world dynamics on the wire. But given that skbs can
have remainders that do not align to MSS boundaries, this packet count
preservation means that for SACKed skbs there is not necessarily a
direct linear relationship between tcp_skb_pcount(skb) and
skb->len. Thus tcp_mark_head_lost()'s previous attempts to fragment
off and mark as lost a prefix of length (packets - oldcnt)*mss from
SACKed skbs were leading to occasional failures of the WARN_ON(len >
skb->len) in tcp_fragment() (which used to be a BUG_ON(); see the
recent "crash in tcp_fragment" thread on netdev).
(2) there is no real point in fragmenting off part of a SACKed skb and
calling tcp_skb_mark_lost() on it, since tcp_skb_mark_lost() is a NOP
for SACKed skbs.
Signed-off-by: Neal Cardwell <ncardwell@google.com> Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Shreyas N Bhatewara <sbhatewara@vmware.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
When tcp_shifted_skb() shifts bytes from the skb that is currently
pointed to by 'highest_sack' then the increment of
TCP_SKB_CB(skb)->seq implicitly advances tcp_highest_sack_seq(). This
implicit advancement, combined with the recent fix to pass the correct
SACKed range into tcp_sacktag_one(), caused tcp_sacktag_one() to think
that the newly SACKed range was before the tcp_highest_sack_seq(),
leading to a call to tcp_update_reordering() with a degree of
reordering matching the size of the newly SACKed range (typically just
1 packet, which is a NOP, but potentially larger).
This commit fixes this by simply calling tcp_sacktag_one() before the
TCP_SKB_CB(skb)->seq advancement that can advance our notion of the
highest SACKed sequence.
Correspondingly, we can simplify the code a little now that
tcp_shifted_skb() should update the lost_cnt_hint in all cases where
skb == tp->lost_skb_hint.
Signed-off-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This patch fixes a (mostly cosmetic) bug introduced by the patch
'ppp: Use SKB queue abstraction interfaces in fragment processing'
found here: http://www.spinics.net/lists/netdev/msg153312.html
The above patch rewrote and moved the code responsible for cleaning
up discarded fragments but the new code does not catch every case
where this is necessary. This results in some discarded fragments
remaining in the queue, and triggering a 'bad seq' error on the
subsequent call to ppp_mp_reconstruct. Fragments are discarded
whenever other fragments of the same frame have been lost.
This can generate a lot of unwanted and misleading log messages.
This patch also adds additional detail to the debug logging to
make it clearer which fragments were lost and which other fragments
were discarded as a result of losses. (Run pppd with 'kdebug 1'
option to enable debug logging.)
Signed-off-by: Ben McKeegan <ben@netservers.co.uk> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reported-by: Niccolò Belli <darkbasic@linuxsystems.it> Tested-by: Niccolò Belli <darkbasic@linuxsystems.it> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
1. While function neigh_periodic_work scans the neighbor hash table
pointed by field tbl->nht, it unlocks and locks tbl->lock between
buckets in order to call cond_resched.
2. Assume that function neigh_periodic_work calls cond_resched, that is,
the lock tbl->lock is available, and function neigh_hash_grow runs.
3. Once function neigh_hash_grow finishes, and RCU calls
neigh_hash_free_rcu, the original struct neigh_hash_table that function
neigh_periodic_work was using doesn't exist anymore.
4. Once back at neigh_periodic_work, whenever the old struct
neigh_hash_table is accessed, things can go badly.
Signed-off-by: Michel Machado <michel@digirati.com.br> CC: "David S. Miller" <davem@davemloft.net> CC: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
We have several reports which says acer-wmi is loaded on ideapads
and register rfkill for wifi which can not be unblocked.
Since ideapad-laptop also register rfkill for wifi and it works
reliably, it will be fine acer-wmi is not going to register rfkill
for wifi once VPC2004 is found.
Also put IBM0068/LEN0068 in the list. Though thinkpad_acpi has no
wifi rfkill capability, there are reports which says acer-wmi also
block wireless on Thinkpad E520/E420.
Signed-off-by: Ike Panhc <ike.pan@canonical.com> Signed-off-by: Matthew Garrett <mjg@redhat.com> Cc: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
There will be better to check the wireless capability flag
(ACER_CAP_WIRELESS) before register wireless rfkill because maybe
the machine doesn't have wifi module or the module removed by user.
Tested on Acer Travelmate 8572
Tested on Acer Aspire 4739Z
Tested-by: AceLan Kao <acelan.kao@canonical.com> Cc: Carlos Corbacho <carlos@strangeworlds.co.uk> Cc: Matthew Garrett <mjg@redhat.com> Cc: Dmitry Torokhov <dtor@mail.ru> Cc: Corentin Chary <corentincj@iksaif.net> Cc: Thomas Renninger <trenn@suse.de> Signed-off-by: Lee, Chun-Yi <jlee@suse.com> Signed-off-by: Matthew Garrett <mjg@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The AMW0 function in acer-wmi works on Lenovo ideapad S205 for control
the wifi hardware state. We also found there have a 0x78 EC register
exposes the state of wifi hardware switch on the machine.
So, add this patch to support Lenovo ideapad S205 wifi hardware switch
in acer-wmi driver.
complete_walk() returns either ECHILD or ESTALE. do_last() turns this into
ECHILD unconditionally. If not in RCU mode, this error will reach userspace
which is complete nonsense.
Is possible that we stop queue and then do not wake up it again,
especially when packets are transmitted fast. That can be easily
reproduced with modified tx queue entry_num to some small value e.g. 16.
If mac80211 already hold local->queue_stop_reason_lock, then we can wait
on that lock in both rt2x00queue_pause_queue() and
rt2x00queue_unpause_queue(). After drooping ->queue_stop_reason_lock
is possible that __ieee80211_wake_queue() will be performed before
__ieee80211_stop_queue(), hence we stop queue and newer wake up it
again.
Another race condition is possible when between rt2x00queue_threshold()
check and rt2x00queue_pause_queue() we will process all pending tx
buffers on different cpu. This might happen if for example interrupt
will be triggered on cpu performing rt2x00mac_tx().
To prevent race conditions serialize pause/unpause by queue->tx_lock.
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Acked-by: Gertjan van Wingerde <gwingerde@gmail.com> Signed-off-by: John W. Linville <linville@tuxdriver.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Disabling all runtime PM during system shutdown turns out not to be a
good idea, because some devices may need to be woken up from a
low-power state at that time.
The whole point of disabling runtime PM for system shutdown was to
prevent untimely runtime-suspend method calls. This patch (as1504)
accomplishes the same result by incrementing the usage count for each
device and waiting for ongoing runtime-PM callbacks to finish. This
is what we already do during system suspend and hibernation, which
makes sense since the shutdown method is pretty much a legacy analog
of the pm->poweroff method.
This fixes a recent regression on some OMAP systems introduced by
commit af8db1508f2c9f3b6e633e2d2d906c6557c617f9 (PM / driver core:
disable device's runtime PM during shutdown).
Reported-and-tested-by: NeilBrown <neilb@suse.de> Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Kostyantyn Shlyakhovoy <x0155534@ti.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Some older Panasonic made camcorders (Panasonic AG-EZ30 and NV-DX110,
Grundig Scenos DLC 2000) reject requests with ack_busy_X if a request is
sent immediately after they sent a response to a prior transaction.
This causes firewire-core to fail probing of the camcorder with "giving
up on config rom for node id ...". Consequently, programs like kino or
dvgrab are unaware of the presence of a camcorder.
Such transaction failures happen also with the ieee1394 driver stack
(of the 2.4...2.6 kernel series until 2.6.36 inclusive) but with a lower
likelihood, such that kino or dvgrab are generally able to use these
camcorders via the older driver stack. The cause for firewire-ohci's or
firewire-core's worse behavior is not yet known. Gap count optimization
in firewire-core is not the cause. Perhaps the slightly higher latency
of transaction completion in the older stack plays a role. (ieee1394:
AR-resp DMA context tasklet -> packet completion ktread -> user process;
firewire-core: tasklet -> user process.)
This change introduces retries and delays after ack_busy_X into
firewire-core's Config ROM reader, such that at least firewire-core's
probing and /dev/fw* creation are successful. This still leaves the
problem that userland processes are facing transaction failures.
gscanbus's built-in retry routines deal with them successfully, but
neither kino's nor dvgrab's do ever succeed.
But at least DV capture with "dvgrab -noavc -card 0" works now. Live
video preview in kino works too, but not actual capture.
One way to prevent Configuration ROM reading failures in application
programs is to modify libraw1394 to synthesize read responses by means
of firewire-core's Configuration ROM cache. This would only leave
CMP and FCP transaction failures as a potential problem source for
applications.
Reported-and-tested-by: Thomas Seilund <tps@netmaster.dk> Reported-and-tested-by: René Fritz <rene@colorcube.de> Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Clemens points out that we need to use compat_ptr() in order to safely
cast from u64 to addresses of a 32-bit usermode client.
Before, our conversion went wrong
- in practice if the client cast from pointer to integer such that
sign-extension happened, (libraw1394 and libdc1394 at least were not
doing that, IOW were not affected)
or
- in theory on s390 (which doesn't have FireWire though) and on the
tile architecture, regardless of what the client does.
The bug would usually be observed as the initial get_info ioctl failing
with "Bad address" (EFAULT).
Reported-by: Carl Karsten <carl@personnelware.com> Reported-by: Clemens Ladisch <clemens@ladisch.de> Signed-off-by: Stefan Richter <stefanr@s5r6.in-berlin.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Right now we won't touch ASPM state if ASPM is disabled, except in the case
where we find a device that appears to be too old to reliably support ASPM.
Right now we'll clear it in that case, which is almost certainly the wrong
thing to do. The easiest way around this is just to disable the blacklisting
when ASPM is disabled.
Commit f0fbf0abc093 ("x86: integrate delay functions") converted
delay_tsc() into a random delay generator for 64 bit. The reason is
that it merged the mostly identical versions of delay_32.c and
delay_64.c. Though the subtle difference of the result was:
static void delay_tsc(unsigned long loops)
{
- unsigned bclock, now;
+ unsigned long bclock, now;
Now the function uses rdtscl() which returns the lower 32bit of the
TSC. On 32bit that's not problematic as unsigned long is 32bit. On 64
bit this fails when the lower 32bit are close to wrap around when
bclock is read, because the following check
if ((now - bclock) >= loops)
break;
evaluated to true on 64bit for e.g. bclock = 0xffffffff and now = 0
because the unsigned long (now - bclock) of these values results in
0xffffffff00000001 which is definitely larger than the loops
value. That explains Tvortkos observation:
"Because I am seeing udelay(500) (_occasionally_) being short, and
that by delaying for some duration between 0us (yep) and 491us."
Make those variables explicitely u32 again, so this works for both 32
and 64 bit.
Current code has put_ioctx() called asynchronously from aio_fput_routine();
that's done *after* we have killed the request that used to pin ioctx,
so there's nothing to stop io_destroy() waiting in wait_for_all_aios()
from progressing. As the result, we can end up with async call of
put_ioctx() being the last one and possibly happening during exit_mmap()
or elf_core_dump(), neither of which expects stray munmap() being done
to them...
We do need to prevent _freeing_ ioctx until aio_fput_routine() is done
with that, but that's all we care about - neither io_destroy() nor
exit_aio() will progress past wait_for_all_aios() until aio_fput_routine()
does really_put_req(), so the ioctx teardown won't be done until then
and we don't care about the contents of ioctx past that point.
Since actual freeing of these suckers is RCU-delayed, we don't need to
bump ioctx refcount when request goes into list for async removal.
All we need is rcu_read_lock held just over the ->ctx_lock-protected
area in aio_fput_routine().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Have ioctx_alloc() return an extra reference, so that caller would drop it
on success and not bother with re-grabbing it on failure exit. The current
code is obviously broken - io_destroy() from another thread that managed
to guess the address io_setup() would've returned would free ioctx right
under us; gets especially interesting if aio_context_t * we pass to
io_setup() points to PROT_READ mapping, so put_user() fails and we end
up doing io_destroy() on kioctx another thread has just got freed...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Acked-by: Benjamin LaHaise <bcrl@kvack.org> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Kazuya Mio [Thu, 1 Dec 2011 07:51:07 +0000 (16:51 +0900)]
wake up s_wait_unfrozen when ->freeze_fs fails
dd slept infinitely when fsfeeze failed because of EIO.
To fix this problem, if ->freeze_fs fails, freeze_super() wakes up
the tasks waiting for the filesystem to become unfrozen.
When s_frozen isn't SB_UNFROZEN in __generic_file_aio_write(),
the function sleeps until FITHAW ioctl wakes up s_wait_unfrozen.
However, if ->freeze_fs fails, s_frozen is set to SB_UNFROZEN and then
freeze_super() returns an error number. In this case, FITHAW ioctl returns
EINVAL because s_frozen is already SB_UNFROZEN. There is no way to wake up
s_wait_unfrozen, so __generic_file_aio_write() sleeps infinitely.
Signed-off-by: Kazuya Mio <k-mio@sx.jp.nec.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Akira Fujita [Thu, 20 Oct 2011 22:56:10 +0000 (18:56 -0400)]
ext4: fix deadlock in ext4_ordered_write_end()
If ext4_jbd2_file_inode() in ext4_ordered_write_end() fails for some
reasons, this function returns to caller without unlocking the page.
It leads to the deadlock, and the patch fixes this issue.
Jan Kara [Wed, 16 Nov 2011 11:34:48 +0000 (19:34 +0800)]
mm: Make task in balance_dirty_pages() killable
There is no reason why task in balance_dirty_pages() shouldn't be killable
and it helps in recovering from some error conditions (like when filesystem
goes in error state and cannot accept writeback anymore but we still want to
kill processes using it to be able to unmount it).
There will be follow up patches to further abort the generic_perform_write()
and other filesystem write loops, to avoid large write + SIGKILL combination
exceeding the dirty limit and possibly strange OOM.
Reported-by: Kazuya Mio <k-mio@sx.jp.nec.com> Tested-by: Kazuya Mio <k-mio@sx.jp.nec.com> Reviewed-by: Neil Brown <neilb@suse.de> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Kazuya Mio [Thu, 20 Oct 2011 23:23:08 +0000 (19:23 -0400)]
ext4: fix the deadlock in mpage_da_map_and_submit()
If ext4_jbd2_file_inode() in mpage_da_map_and_submit() fails due to
journal abort, this function returns to caller without unlocking the
page. It leads to the deadlock, and the patch fixes this issue by
calling mpage_da_submit_io().
Jan Kara [Fri, 24 Jun 2011 18:29:41 +0000 (14:29 -0400)]
ext4: Rewrite ext4_page_mkwrite() to use generic helpers
Rewrite ext4_page_mkwrite() to use __block_page_mkwrite() helper. This
removes the need of using i_alloc_sem to avoid races with truncate which
seems to be the wrong locking order according to lock ordering documented in
mm/rmap.c. Also calling ext4_da_write_begin() as used by the old code seems to
be problematic because we can decide to flush delay-allocated blocks which
will acquire s_umount semaphore - again creating unpleasant lock dependency
if not directly a deadlock.
Also add a check for frozen filesystem so that we don't busyloop in page fault
when the filesystem is frozen.
Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Narendra_K@Dell.com [Fri, 18 Mar 2011 17:22:14 +0000 (10:22 -0700)]
x86/PCI: Preserve existing pci=bfsort whitelist for Dell systems
Commit 6e8af08dfa40b747002207d3ce8e8b43a050d99f enables pci=bfsort on
future Dell systems. But the identification string 'Dell System' matches
on already existing whitelist, which do not have SMBIOS type 0xB1,
causing pci=bfsort not being set on existing whitelist.
This patch fixes the regression by moving the type 0xB1 check beyond the
existing whitelist so that existing whitelist is walked before.
Konrad Rzeszutek Wilk [Tue, 13 Mar 2012 22:05:58 +0000 (18:05 -0400)]
xen/blkback: Disable DISCARD support for loopback device (but leave for phy).
Until we back-port the changes from 3.3 which alter the loopback device
to support the full gamma of discard attributes. Otherwise we have to
punch through loop device to retrieve the underlaying disk size and
do other nasty things to get the proper information.
Also there is the outstanding issue that Logical Volumes won't pass
through the DISCARD support, so in most cases we can't take advantage of
this code until that gets fixed.
Fixes Oracle BZ#13779884 Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Mike Snitzer [Wed, 6 Jul 2011 19:30:50 +0000 (21:30 +0200)]
block: eliminate potential for infinite loop in blkdev_issue_discard
Due to the recently identified overflow in read_capacity_16() it was
possible for max_discard_sectors to be zero but still have discards
enabled on the associated device's queue.
Eliminate the possibility for blkdev_issue_discard to infinitely loop.
Interestingly this issue wasn't identified until a device, whose
discard_granularity was 0 due to read_capacity_16 overflow, was consumed
by blk_stack_limits() to construct limits for a higher-level DM
multipath device. The multipath device's resulting limits never had the
discard limits stacked because blk_stack_limits() will only do so if
the bottom device's discard_granularity != 0. This resulted in the
multipath device's limits.max_discard_sectors being 0.
Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
[Fixes Oracle BZ13779884] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk>
Konrad Rzeszutek Wilk [Wed, 14 Mar 2012 00:47:44 +0000 (20:47 -0400)]
config: Use the xen-acpi-processor instead of the cpufreq-xen driver.
The xen-acpi-processor (CONFIG_XEN_ACPI_PROCESSOR=y) is a more modern
version of the cpufreq-xen driver that can automatically inhibit
the cpufreq scaling drivers.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Konrad Rzeszutek Wilk [Fri, 3 Feb 2012 21:03:20 +0000 (16:03 -0500)]
xen/acpi-processor: C and P-state driver that uploads said data to hypervisor.
This driver solves three problems:
1). Parse and upload ACPI0007 (or PROCESSOR_TYPE) information to the
hypervisor - aka P-states (cpufreq data).
2). Upload the the Cx state information (cpuidle data).
3). Inhibit CPU frequency scaling drivers from loading.
The reason for wanting to solve 1) and 2) is such that the Xen hypervisor
is the only one that knows the CPU usage of different guests and can
make the proper decision of when to put CPUs and packages in proper states.
Unfortunately the hypervisor has no support to parse ACPI DSDT tables, hence it
needs help from the initial domain to provide this information. The reason
for 3) is that we do not want the initial domain to change P-states while the
hypervisor is doing it as well - it causes rather some funny cases of P-states
transitions.
For this to work, the driver parses the Power Management data and uploads said
information to the Xen hypervisor. It also calls acpi_processor_notify_smm()
to inhibit the other CPU frequency scaling drivers from being loaded.
Everything revolves around the 'struct acpi_processor' structure which
gets updated during the bootup cycle in different stages. At the startup, when
the ACPI parser starts, the C-state information is processed (processor_idle)
and saved in said structure as 'power' element. Later on, the CPU frequency
scaling driver (powernow-k8 or acpi_cpufreq), would call the the
acpi_processor_* (processor_perflib functions) to parse P-states information
and populate in the said structure the 'performance' element.
Since we do not want the CPU frequency scaling drivers from loading
we have to call the acpi_processor_* functions to parse the P-states and
call "acpi_processor_notify_smm" to stop them from loading.
There is also one oddity in this driver which is that under Xen, the
physical online CPU count can be different from the virtual online CPU count.
Meaning that the macros 'for_[online|possible]_cpu' would process only
up to virtual online CPU count. We on the other hand want to process
the full amount of physical CPUs. For that, the driver checks if the ACPI IDs
count is different from the APIC ID count - which can happen if the user
choose to use dom0_max_vcpu argument. In such a case a backup of the PM
structure is used and uploaded to the hypervisor.
[v1-v2: Initial RFC implementations that were posted]
[v3: Changed the name to passthru suggested by Pasi Kärkkäinen <pasik@iki.fi>]
[v4: Added vCPU != pCPU support - aka dom0_max_vcpus support]
[v5: Cleaned up the driver, fix bug under Athlon XP]
[v6: Changed the driver to a CPU frequency governor]
[v7: Jan Beulich <jbeulich@suse.com> suggestion to make it a cpufreq scaling driver
made me rework it as driver that inhibits cpufreq scaling driver]
[v8: Per Jan's review comments, fixed up the driver] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Conflicts: