linux-dev - Linux kernel development work

Age	Commit message (Collapse)	Author	Files	Lines
2006-12-11	[PATCH] LOG2: Make powerpc's __ilog2_u64() take a 64-bit argument	David Howells	1	-1/+1
	Make powerpc's __ilog2_u64() take a 64-bit argument. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-11	Make sure we populate the initroot filesystem late enough	Linus Torvalds	4	-9/+6
	We should not initialize rootfs before all the core initializers have run. So do it as a separate stage just before starting the regular driver initializers. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-11	Make SLES9 "get_kernel_version" work on the kernel binary again	Linus Torvalds	4	-10/+17
	As reported by Andy Whitcroft, at least the SLES9 initrd build process depends on getting the kernel version from the kernel binary. It does that by simply trawling the binary and looking for the signature of the "linux_banner" string (the string "Linux version " to be exact. Which is really broken in itself, but whatever..) That got broken when the string was changed to allow /proc/version to change the UTS release information dynamically, and "get_kernel_version" thus returned "%s" (see commit a2ee8649ba6d71416712e798276bf7c40b64e6e5: "[PATCH] Fix linux banner utsname information"). This just restores "linux_banner" as a static string, which should fix the version finding. And /proc/version simply uses a different string. To avoid wasting even that miniscule amount of memory, the early boot string should really be marked __initdata, but that just causes the same bug in SLES9 to re-appear, since it will then find other occurrences of "Linux version " first. Cc: Andy Whitcroft <apw@shadowen.org> Acked-by: Herbert Poetzl <herbert@13thfloor.at> Cc: Andi Kleen <ak@suse.de> Cc: Andrew Morton <akpm@osdl.org> Cc: Steve Fox <drfickle@us.ibm.com> Acked-by: Olaf Hering <olaf@aepfle.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-11	[PATCH] smc91x: Kill off excessive versatile hooks.	Paul Mundt	1	-90/+0
	This looks like a result of too many auto-merges. The CONFIG_ARCH_VERSATILE case was handled a total of 6 times. This kills 5 of them. Signed-off-by: Paul Mundt <lethal@linux-sh.org> -- drivers/net/smc91x.h \| 90 --------------------------------------------------- 1 file changed, 90 deletions(-) Signed-off-by: Jeff Garzik <jeff@garzik.org>
2006-12-11	[PATCH] myri10ge: update driver version to 1.1.0	Brice Goglin	1	-1/+1
	Update driver version to 1.1.0. Signed-off-by: Brice Goglin <brice@myri.com> Signed-off-by: Jeff Garzik <jeff@garzik.org>
2006-12-11	[PATCH] myri10ge: fix big_bytes in case of vlan frames	Brice Goglin	1	-2/+2
	Fix sizing of big_bytes in the case of vlan frames. The 4 VLAN_HLEN bytes were omitted, leading to sizing the big buffer 4 bytes smaller than it should be. Due to how rx buffers are carved from pages, this was harmless for the common (9000, 1500) byte MTUs, but could lead to data corruption for some MTUs. Signed-off-by: Brice Goglin <brice@myri.com> Signed-off-by: Jeff Garzik <jeff@garzik.org>
2006-12-11	[PATCH] myri10ge: Full vlan frame in small_bytes	Brice Goglin	1	-2/+2
	Receive full vlan frames into smalls when running with a jumbo MTU. Signed-off-by: Brice Goglin <brice@myri.com> Signed-off-by: Jeff Garzik <jeff@garzik.org>
2006-12-11	[PATCH] myri10ge: drop contiguous skb routines	Brice Goglin	1	-204/+8
	Drop the old routines that used the physically contigous skb now that we use the physical pages. And rename myri10ge_page_rx_done() to myri10ge_rx_done() as it was previously. Signed-off-by: Brice Goglin <brice@myri.com> Signed-off-by: Jeff Garzik <jeff@garzik.org>
2006-12-11	[PATCH] myri10ge: switch to page-based skb	Brice Goglin	1	-79/+93
	Switch to physical page skb, by calling the new page-based allocation routines and using myri10ge_page_rx_done(). Signed-off-by: Brice Goglin <brice@myri.com> Signed-off-by: Jeff Garzik <jeff@garzik.org>
2006-12-11	[PATCH] myri10ge: add page-based skb routines	Brice Goglin	1	-0/+190
	Add physical page skb allocation routines and page based rx_done, to be used by upcoming patches. Signed-off-by: Brice Goglin <brice@myri.com> Signed-off-by: Jeff Garzik <jeff@garzik.org>
2006-12-11	[PATCH] myri10ge: indentation cleanups	Brice Goglin	1	-4/+4
	Indentation cleanups to synchronize to our tree which is automatically indent'ed. Signed-off-by: Brice Goglin <brice@myri.com> Signed-off-by: Jeff Garzik <jeff@garzik.org>
2006-12-11	[PATCH] chelsio: working NAPI	Stephen Hemminger	4	-83/+67
	This driver tries to enable/disable NAPI at runtime, but does so in an unsafe manner, and the NAPI interrupt handling is a mess. Replace it with a compile time selected NAPI implementation. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: Jeff Garzik <jeff@garzik.org>
2006-12-11	[PATCH] MACB: Use __raw register access	Haavard Skinnemoen	2	-3/+3
	Since macb is a chip-internal device, use __raw_readl and __raw_writel instead of readl/writel. This will perform native-endian accesses, which is the right thing to do on both AVR32 and ARM devices. Signed-off-by: Haavard Skinnemoen <hskinnemoen@atmel.com> Signed-off-by: Jeff Garzik <jeff@garzik.org>
2006-12-11	[PATCH] MACB: Use struct delayed_work instead of struct work_struct	Haavard Skinnemoen	2	-4/+4
	The macb driver calls schedule_delayed_work() and friends, so we need to use a struct delayed_work along with it. The conversion was explained by David Howells on lkml Dec 5 2006: http://lkml.org/lkml/2006/12/5/269 Signed-off-by: Haavard Skinnemoen <hskinnemoen@atmel.com> Signed-off-by: Jeff Garzik <jeff@garzik.org>
2006-12-11	[PATCH] ucc_geth: Initialize mdio_lock.	Scott Wood	1	-0/+2
	Signed-off-by: Scott Wood <scottwood@freescale.com> Signed-off-by: Jeff Garzik <jeff@garzik.org>
2006-12-11	[PATCH] ucc_geth: compilation error fixes	Scott Wood	1	-5/+5
	Fix compilation failures when building the ucc_geth driver with spinlock debugging. Signed-off-by: Scott Wood <scottwood@freescale.com> Signed-off-by: Jeff Garzik <jeff@garzik.org>
2006-12-11	AT91 MMC update for 2.6.19	Andrew Victor	1	-17/+24
	The driver is usable on the newer SAM9 processors so replace all text references to AT91RM9200 with just AT91. The controller bug where all the words are byte-swapped is fixed on the AT91SAM9 processors. The byte-swapping work-around therefore only needs to be done if cpu_is_at91rm9200(). [Original patch from Wojtek Kaniewski] The AT91RM9200 and AT91SAM9260 processors support two MMC/SD slots - the slot which is connected is now passed via the platform_data and the correct slot selected in the AT91_MCI_SDCR register. The driver should not be calling at91_set_gpio_output() since the VCC pin should have already been configured as an output in the processor/board setup code. The driver should call at91_set_gpio_value(). Signed-off-by: Andrew Victor <andrew@sanpeople.com> Signed-off-by: Pierre Ossman <drzeus@drzeus.cx>
2006-12-11	mmc: Change SDHCI iomem error to a warning	Pierre Ossman	1	-2/+2
	Some controllers report an invalid iomem size, but seem to work correctly anyway. Change our current error to just a warning and hope it doesn't cause too much problems. Signed-off-by: Pierre Ossman <drzeus@drzeus.cx>
2006-12-11	mmc: fix "prev->state: 2 != TASK_RUNNING??" problem on SD/MMC card removal	Vitaly Wool	1	-1/+3
	Currently on SD/MMC card removal the system exhibits the following message (the platform is ARM Versatile): prev->state: 2 != TASK_RUNNING?? mmcqd/762[CPU#0]: BUG in __schedule at linux-2.6/kernel/sched.c:3826 (akpm: someone tried to fix this, but it's still wrong) Signed-off-by: Vitaly Wool <vitalywool@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Pierre Ossman <drzeus@drzeus.cx>
2006-12-11	AT91 MMC 5 : Minor cleanups	Andrew Victor	1	-9/+9
	A number of small cleanups to the AT91RM9200 MMC driver: - fix warnings generated by pr_debug(). - prepend "AT91 MMC:" to printk() messages. Signed-off-by: Andrew Victor <andrew@sanpeople.com> Signed-off-by: Pierre Ossman <drzeus@drzeus.cx>
2006-12-11	AT91 MMC 4 : Interrupt handler cleanup	Andrew Victor	1	-45/+43
	This patch simplifies the AT91RM9200 MMC interrupt handler code so that it doesn't re-read the Interrupt Status and Interrupt Mask registers multiple times. Also defined AT91_MCI_ERRORS instead of using the hard-coded 0xffff0000. Signed-off-by: Andrew Victor <andrew@sanpeople.com> Signed-off-by: Pierre Ossman <drzeus@drzeus.cx>
2006-12-11	AT91 MMC 3 : Move global mci_clk variable	Andrew Victor	1	-11/+11
	Move the global 'mci_clk' variable into the local 'at91mci_host' structure. Signed-off-by: Andrew Victor <andrew@sanpeople.com> Signed-off-by: Pierre Ossman <drzeus@drzeus.cx>
2006-12-11	AT91 MMC 2 : Use platform resources	Andrew Victor	1	-7/+34
	Use the I/O base-address and IRQ passed to the driver via the platform_device resources instead of using hardcoded values. Signed-off-by: Andrew Victor <andrew@sanpeople.com> Signed-off-by: Pierre Ossman <drzeus@drzeus.cx>
2006-12-11	AT91 MMC 1: Pass host structure.	Andrew Victor	1	-89/+81
	The I/O base address is now stored in the 'at91mci_host' structure. We therefore have to pass this structure to at91_mci_read() and at91_mci_write(). Signed-off-by: Andrew Victor <andrew@sanpeople.com> Signed-off-by: Pierre Ossman <drzeus@drzeus.cx>
2006-12-10	[MIPS] Export local_flush_data_cache_page for sake of IDE.	Ralf Baechle	1	-0/+1
	On a CPU with aliases the IDE core needs to flush caches in the special IDE variants of insw, insl etc. If IDE support is built as a module this will only work if local_flush_data_cache_page happens is exported as a module. As per policy export local_flush_data_cache_page as GPL symbol only. Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-12-10	[MIPS] Export pm_power_off	Ralf Baechle	1	-0/+2
	This is required for ipmi_poweroff.c to work as a module. Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-12-10	[MIPS] Export csum_partial_copy_nocheck.	Ralf Baechle	1	-0/+3
	ibmtr.c and typhoon.c use it. Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-12-10	[MIPS] Move die and die_if_kernel() from system.h to ptrace.h	Ralf Baechle	2	-9/+8
	This eleminates the need to include ptrace.h into system.h and fixes a harmless namespace conflict on the PC symbol in bpck.c. Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-12-10	[MIPS] Discard .exit.text at linktime.	Ralf Baechle	1	-3/+1
	This fixes fairly unobvious breakage of various drivers. Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-12-10	[MIPS] Fix build of several IDE drivers by providing pci_get_legacy_ide_irq	Ralf Baechle	1	-0/+6
	Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2006-12-10	[CRYPTO] dm-crypt: Select CRYPTO_CBC	Herbert Xu	1	-0/+1
	As CBC is the default chaining method for cryptoloop, we should select it from cryptoloop to ease the transition. Spotted by Rene Herman. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10	[PATCH] add MODULE_* attributes to bit reversal library	Cal Peake	1	-0/+4
	Add MODULE_* attributes to the new bit reversal library. Most notably MODULE_LICENSE which prevents superfluous kernel tainting. Signed-off-by: Cal Peake <cp@absolutedigital.net> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10	[PATCH] kvm: userspace interface	Avi Kivity	18	-0/+9814
	web site: http://kvm.sourceforge.net mailing list: kvm-devel@lists.sourceforge.net (http://lists.sourceforge.net/lists/listinfo/kvm-devel) The following patchset adds a driver for Intel's hardware virtualization extensions to the x86 architecture. The driver adds a character device (/dev/kvm) that exposes the virtualization capabilities to userspace. Using this driver, a process can run a virtual machine (a "guest") in a fully virtualized PC containing its own virtual hard disks, network adapters, and display. Using this driver, one can start multiple virtual machines on a host. Each virtual machine is a process on the host; a virtual cpu is a thread in that process. kill(1), nice(1), top(1) work as expected. In effect, the driver adds a third execution mode to the existing two: we now have kernel mode, user mode, and guest mode. Guest mode has its own address space mapping guest physical memory (which is accessible to user mode by mmap()ing /dev/kvm). Guest mode has no access to any I/O devices; any such access is intercepted and directed to user mode for emulation. The driver supports i386 and x86_64 hosts and guests. All combinations are allowed except x86_64 guest on i386 host. For i386 guests and hosts, both pae and non-pae paging modes are supported. SMP hosts and UP guests are supported. At the moment only Intel hardware is supported, but AMD virtualization support is being worked on. Performance currently is non-stellar due to the naive implementation of the mmu virtualization, which throws away most of the shadow page table entries every context switch. We plan to address this in two ways: - cache shadow page tables across tlb flushes - wait until AMD and Intel release processors with nested page tables Currently a virtual desktop is responsive but consumes a lot of CPU. Under Windows I tried playing pinball and watching a few flash movies; with a recent CPU one can hardly feel the virtualization. Linux/X is slower, probably due to X being in a separate process. In addition to the driver, you need a slightly modified qemu to provide I/O device emulation and the BIOS. Caveats (akpm: might no longer be true): - The Windows install currently bluescreens due to a problem with the virtual APIC. We are working on a fix. A temporary workaround is to use an existing image or install through qemu - Windows 64-bit does not work. That's also true for qemu, so it's probably a problem with the device model. [bero@arklinux.org: build fix] [simon.kagstrom@bth.se: build fix, other fixes] [uril@qumranet.com: KVM: Expose interrupt bitmap] [akpm@osdl.org: i386 build fix] [mingo@elte.hu: i386 fixes] [rdreier@cisco.com: add log levels to all printks] [randy.dunlap@oracle.com: Fix sparse NULL and C99 struct init warnings] [anthony@codemonkey.ws: KVM: AMD SVM: 32-bit host support] Signed-off-by: Yaniv Kamay <yaniv@qumranet.com> Signed-off-by: Avi Kivity <avi@qumranet.com> Cc: Simon Kagstrom <simon.kagstrom@bth.se> Cc: Bernhard Rosenkraenzer <bero@arklinux.org> Signed-off-by: Uri Lublin <uril@qumranet.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Roland Dreier <rolandd@cisco.com> Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Anthony Liguori <anthony@codemonkey.ws> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10	[PATCH] clocksource: small cleanup	Daniel Walker	4	-12/+18
	Mostly changing alignment. Just some general cleanup. [akpm@osdl.org: build fix] Signed-off-by: Daniel Walker <dwalker@mvista.com> Acked-by: John Stultz <johnstul@us.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10	[PATCH] clocksource: add usage of CONFIG_SYSFS	Daniel Walker	1	-0/+2
	Simply adds some ifdefs to remove clocksoure sysfs code when CONFIG_SYSFS isn't turn on. Signed-off-by: Daniel Walker <dwalker@mvista.com> Acked-by: John Stultz <johnstul@us.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10	[PATCH] user of the jiffies rounding patch: Slab	Arjan van de Ven	1	-3/+5
	This patch introduces users of the round_jiffies() function in the slab code. The slab code has a few "run every second" timers for background work; these are obviously not timing critical as long as they happen roughly at the right frequency. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10	[PATCH] user of the jiffies rounding code: JBD	Arjan van de Ven	1	-1/+1
	This patch introduces a user: of the round_jiffies() function; the "5 second" ext3/jbd wakeup. While "every 5 seconds" doesn't sound as a problem, there can be many of these (and these timers do add up over all the kernel). The "5 second" wakeup isn't really timing sensitive; in addition even with rounding it'll still happen every 5 seconds (with the exception of the very first time, which is likely to be rounded up to somewhere closer to 6 seconds) Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10	[PATCH] round_jiffies infrastructure	Arjan van de Ven	2	-0/+138
	Introduce a round_jiffies() function as well as a round_jiffies_relative() function. These functions round a jiffies value to the next whole second. The primary purpose of this rounding is to cause all "we don't care exactly when" timers to happen at the same jiffy. This avoids multiple timers firing within the second for no real reason; with dynamic ticks these extra timers cause wakeups from deep sleep CPU sleep states and thus waste power. The exact wakeup moment is skewed by the cpu number, to avoid all cpus from waking up at the exact same time (and hitting the same lock/cachelines there) [akpm@osdl.org: fix variable type] Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10	[PATCH] fdtable: Implement new pagesize-based fdtable allocator	Vadim Lobanov	2	-142/+72
	This patch provides an improved fdtable allocation scheme, useful for expanding fdtable file descriptor entries. The main focus is on the fdarray, as its memory usage grows 128 times faster than that of an fdset. The allocation algorithm sizes the fdarray in such a way that its memory usage increases in easy page-sized chunks. The overall algorithm expands the allowed size in powers of two, in order to amortize the cost of invoking vmalloc() for larger allocation sizes. Namely, the following sizes for the fdarray are considered, and the smallest that accommodates the requested fd count is chosen: pagesize / 4 pagesize / 2 pagesize <- memory allocator switch point pagesize * 2 pagesize * 4 ...etc... Unlike the current implementation, this allocation scheme does not require a loop to compute the optimal fdarray size, and can be done in efficient straightline code. Furthermore, since the fdarray overflows the pagesize boundary long before any of the fdsets do, it makes sense to optimize run-time by allocating both fdsets in a single swoop. Even together, they will still be, by far, smaller than the fdarray. The fdtable->open_fds is now used as the anchor for the fdset memory allocation. Signed-off-by: Vadim Lobanov <vlobanov@speakeasy.net> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dipankar Sarma <dipankar@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10	[PATCH] fdtable: Remove the free_files field	Vadim Lobanov	5	-27/+11
	An fdtable can either be embedded inside a files_struct or standalone (after being expanded). When an fdtable is being discarded after all RCU references to it have expired, we must either free it directly, in the standalone case, or free the files_struct it is contained within, in the embedded case. Currently the free_files field controls this behavior, but we can get rid of it entirely, as all the necessary information is already recorded. We can distinguish embedded and standalone fdtables using max_fds, and if it is embedded we can divine the relevant files_struct using container_of(). Signed-off-by: Vadim Lobanov <vlobanov@speakeasy.net> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dipankar Sarma <dipankar@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10	[PATCH] fdtable: Make fdarray and fdsets equal in size	Vadim Lobanov	13	-81/+48
	Currently, each fdtable supports three dynamically-sized arrays of data: the fdarray and two fdsets. The code allows the number of fds supported by the fdarray (fdtable->max_fds) to differ from the number of fds supported by each of the fdsets (fdtable->max_fdset). In practice, it is wasteful for these two sizes to differ: whenever we hit a limit on the smaller-capacity structure, we will reallocate the entire fdtable and all the dynamic arrays within it, so any delta in the memory used by the larger-capacity structure will never be touched at all. Rather than hogging this excess, we shouldn't even allocate it in the first place, and keep the capacities of the fdarray and the fdsets equal. This patch removes fdtable->max_fdset. As an added bonus, most of the supporting code becomes simpler. Signed-off-by: Vadim Lobanov <vlobanov@speakeasy.net> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dipankar Sarma <dipankar@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10	[PATCH] fdtable: Delete pointless code in dup_fd()	Vadim Lobanov	1	-6/+5
	The dup_fd() function creates a new files_struct and fdtable embedded inside that files_struct, and then possibly expands the fdtable using expand_files(). The out_release error path is invoked when expand_files() returns an error code. However, when this attempt to expand fails, the fdtable is left in its original embedded form, so it is pointless to try to free the associated fdarray and fdsets. Signed-off-by: Vadim Lobanov <vlobanov@speakeasy.net> Cc: Dipankar Sarma <dipankar@in.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10	[PATCH] dio: lock refcount operations	Zach Brown	1	-31/+45
	The wait_for_more_bios() function name was poorly chosen. While looking to clean it up it I noticed that the dio struct refcounting between the bio completion and dio submission paths was racey. The bio submission path was simply freeing the dio struct if atomic_dec_and_test() indicated that it dropped the final reference. The aio bio completion path was dereferencing its dio struct pointer after dropping its reference based on the remaining number of references. These two paths could race and result in the aio bio completion path dereferencing a freed dio, though this was not observed in the wild. This moves the refcount under the bio lock so that bio completion can drop its reference and decide to wake all in one atomic step. Once testing and waking is locked dio_await_one() can test its sleeping condition and mark itself uninterruptible under the lock. It gets simpler and wait_for_more_bios() disappears. The addition of the interrupt masking spin lock acquiry in dio_bio_submit() looks alarming. This lock acquiry existed in that path before the recent dio completion patch set. We shouldn't expect significant performance regression from returning to the behaviour that existed before the completion clean up work. This passed 4k block ext3 O_DIRECT fsx and aio-stress on an SMP machine. Signed-off-by: Zach Brown <zach.brown@oracle.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Suparna Bhattacharya <suparna@in.ibm.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: <xfs-masters@oss.sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10	[PATCH] dio: only call aio_complete() after returning -EIOCBQUEUED	Zach Brown	3	-62/+39
	The only time it is safe to call aio_complete() is when the ->ki_retry function returns -EIOCBQUEUED to the AIO core. direct_io_worker() has historically done this by relying on its caller to translate positive return codes into -EIOCBQUEUED for the aio case. It did this by trying to keep conditionals in sync. direct_io_worker() knew when finished_one_bio() was going to call aio_complete(). It would reverse the test and wait and free the dio in the cases it thought that finished_one_bio() wasn't going to. Not surprisingly, it ended up getting it wrong. 'ret' could be a negative errno from the submission path but it failed to communicate this to finished_one_bio(). direct_io_worker() would return < 0, it's callers wouldn't raise -EIOCBQUEUED, and aio_complete() would be called. In the future finished_one_bio()'s tests wouldn't reflect this and aio_complete() would be called for a second time which can manifest as an oops. The previous cleanups have whittled the sync and async completion paths down to the point where we can collapse them and clearly reassert the invariant that we must only call aio_complete() after returning -EIOCBQUEUED. direct_io_worker() will only return -EIOCBQUEUED when it is not the last to drop the dio refcount and the aio bio completion path will only call aio_complete() when it is the last to drop the dio refcount. direct_io_worker() can ensure that it is the last to drop the reference count by waiting for bios to drain. It does this for sync ops, of course, and for partial dio writes that must fall back to buffered and for aio ops that saw errors during submission. This means that operations that end up waiting, even if they were issued as aio ops, will not call aio_complete() from dio. Instead we return the return code of the operation and let the aio core call aio_complete(). This is purposely done to fix a bug where AIO DIO file extensions would call aio_complete() before their callers have a chance to update i_size. Now that direct_io_worker() is explicitly returning -EIOCBQUEUED its callers no longer have to translate for it. XFS needs to be careful not to free resources that will be used during AIO completion if -EIOCBQUEUED is returned. We maintain the previous behaviour of trying to write fs metadata for O_SYNC aio+dio writes. Signed-off-by: Zach Brown <zach.brown@oracle.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Suparna Bhattacharya <suparna@in.ibm.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Cc: <xfs-masters@oss.sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10	[PATCH] dio: remove duplicate bio wait code	Zach Brown	1	-29/+12
	Now that we have a single refcount and waiting path we can reuse it in the async 'should_wait' path. It continues to rely on the fragile link between the conditional in dio_complete_aio() which decides to complete the AIO and the conditional in direct_io_worker() which decides to wait and free. By waiting before dropping the reference we stop dio_bio_end_aio() from calling dio_complete_aio() which used to wake up the waiter after seeing the reference count drop to 0. We hoist this wake up into dio_bio_end_aio() which now notices when it's left a single remaining reference that is held by the waiter. Signed-off-by: Zach Brown <zach.brown@oracle.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Suparna Bhattacharya <suparna@in.ibm.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10	[PATCH] dio: formalize bio counters as a dio reference count	Zach Brown	1	-74/+66
	Previously we had two confusing counts of bio progress. 'bio_count' was decremented as bios were processed and freed by the dio core. It was used to indicate final completion of the dio operation. 'bios_in_flight' reflected how many bios were between submit_bio() and bio->end_io. It was used by the sync path to decide when to wake up and finish completing bios and was ignored by the async path. This patch collapses the two notions into one notion of a dio reference count. bios hold a dio reference when they're between submit_bio and bio->end_io. Since bios_in_flight was only used in the sync path it is now equivalent to dio->refcount - 1 which accounts for direct_io_worker() holding a reference for the duration of the operation. dio_bio_complete() -> finished_one_bio() was called from the sync path after finding bios on the list that the bio->end_io function had deposited. finished_one_bio() can not drop the dio reference on behalf of these bios now because bio->end_io already has. The is_async test in finished_one_bio() meant that it never actually did anything other than drop the bio_count for sync callers. So we remove its refcount decrement, don't call it from dio_bio_complete(), and hoist its call up into the async dio_bio_complete() caller after an explicit refcount decrement. It is renamed dio_complete_aio() to reflect the remaining work it actually does. Signed-off-by: Zach Brown <zach.brown@oracle.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Suparna Bhattacharya <suparna@in.ibm.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10	[PATCH] dio: call blk_run_address_space() once per op	Zach Brown	1	-5/+3
	We only need to call blk_run_address_space() once after all the bios for the direct IO op have been submitted. This removes the chance of calling blk_run_address_space() after spurious wake ups as the sync path waits for bios to drain. It's also one less difference betwen the sync and async paths. In the process we remove a redundant dio_bio_submit() that its caller had already performed. Signed-off-by: Zach Brown <zach.brown@oracle.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Suparna Bhattacharya <suparna@in.ibm.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10	[PATCH] dio: centralize completion in dio_complete()	Zach Brown	1	-52/+42
	There have been a lot of bugs recently due to the way direct_io_worker() tries to decide how to finish direct IO operations. In the worst examples it has failed to call aio_complete() at all (hang) or called it too many times (oops). This set of patches cleans up the completion phase with the goal of removing the complexity that lead to these bugs. We end up with one path that calculates the result of the operation after all off the bios have completed. We decide when to generate a result of the operation using that path based on the final release of a refcount on the dio structure. I tried to progress towards the final state in steps that were relatively easy to understand. Each step should compile but I only tested the final result of having all the patches applied. I've tested these on low end PC drives with aio-stress, the direct IO tests I could manage to get running in LTP, orasim, and some home-brew functional tests. In http://lkml.org/lkml/2006/9/21/103 IBM reports success with ext2 and ext3 running DIO LTP tests. They found that XFS bug which has since been addressed in the patch series. This patch: The mechanics which decide the result of a direct IO operation were duplicated in the sync and async paths. The async path didn't check page_errors which can manifest as silently returning success when the final pointer in an operation faults and its matching file region is filled with zeros. The sync path and async path differed in whether they passed errors to the caller's dio->end_io operation. The async path was passing errors to it which trips an assertion in XFS, though it is apparently harmless. This centralizes the completion phase of dio ops in one place. AIO will now return EFAULT consistently and all paths fall back to the previously sync behaviour of passing the number of bytes 'transferred' to the dio->end_io callback, regardless of errors. dio_await_completion() doesn't have to propogate EIO from non-uptodate bios now that it's being propogated through dio_complete() via dio->io_error. This lets it return void which simplifies its sole caller. Signed-off-by: Zach Brown <zach.brown@oracle.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Suparna Bhattacharya <suparna@in.ibm.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10	[PATCH] md: assorted md and raid1 one-liners	NeilBrown	2	-1/+3
	Fix few bugs that meant that: - superblocks weren't alway written at exactly the right time (this could show up if the array was not written to - writting to the array causes lots of superblock updates and so hides these errors). - restarting device recovery after a clean shutdown (version-1 metadata only) didn't work as intended (or at all). 1/ Ensure superblock is updated when a new device is added. 2/ Remove an inappropriate test on MD_RECOVERY_SYNC in md_do_sync. The body of this if takes one of two branches depending on whether MD_RECOVERY_SYNC is set, so testing it in the clause of the if is wrong. 3/ Flag superblock for updating after a resync/recovery finishes. 4/ If we find the neeed to restart a recovery in the middle (version-1 metadata only) make sure a full recovery (not just as guided by bitmaps) does get done. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10	[PATCH] md: return a non-zero error to bi_end_io as appropriate in raid5	NeilBrown	1	-4/+12
	Currently raid5 depends on clearing the BIO_UPTODATE flag to signal an error to higher levels. While this should be sufficient, it is safer to explicitly set the error code as well - less room for confusion. Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>