linux-dev - Linux kernel development work

Age	Commit message (Collapse)	Author	Files	Lines
2006-12-10	[PATCH] dio: lock refcount operations	Zach Brown	1	-31/+45
	The wait_for_more_bios() function name was poorly chosen. While looking to clean it up it I noticed that the dio struct refcounting between the bio completion and dio submission paths was racey. The bio submission path was simply freeing the dio struct if atomic_dec_and_test() indicated that it dropped the final reference. The aio bio completion path was dereferencing its dio struct pointer after dropping its reference based on the remaining number of references. These two paths could race and result in the aio bio completion path dereferencing a freed dio, though this was not observed in the wild. This moves the refcount under the bio lock so that bio completion can drop its reference and decide to wake all in one atomic step. Once testing and waking is locked dio_await_one() can test its sleeping condition and mark itself uninterruptible under the lock. It gets simpler and wait_for_more_bios() disappears. The addition of the interrupt masking spin lock acquiry in dio_bio_submit() looks alarming. This lock acquiry existed in that path before the recent dio completion patch set. We shouldn't expect significant performance regression from returning to the behaviour that existed before the completion clean up work. This passed 4k block ext3 O_DIRECT fsx and aio-stress on an SMP machine. Signed-off-by: Zach Brown <zach.brown@oracle.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Suparna Bhattacharya <suparna@in.ibm.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: <xfs-masters@oss.sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10	[PATCH] dio: only call aio_complete() after returning -EIOCBQUEUED	Zach Brown	1	-55/+35
	The only time it is safe to call aio_complete() is when the ->ki_retry function returns -EIOCBQUEUED to the AIO core. direct_io_worker() has historically done this by relying on its caller to translate positive return codes into -EIOCBQUEUED for the aio case. It did this by trying to keep conditionals in sync. direct_io_worker() knew when finished_one_bio() was going to call aio_complete(). It would reverse the test and wait and free the dio in the cases it thought that finished_one_bio() wasn't going to. Not surprisingly, it ended up getting it wrong. 'ret' could be a negative errno from the submission path but it failed to communicate this to finished_one_bio(). direct_io_worker() would return < 0, it's callers wouldn't raise -EIOCBQUEUED, and aio_complete() would be called. In the future finished_one_bio()'s tests wouldn't reflect this and aio_complete() would be called for a second time which can manifest as an oops. The previous cleanups have whittled the sync and async completion paths down to the point where we can collapse them and clearly reassert the invariant that we must only call aio_complete() after returning -EIOCBQUEUED. direct_io_worker() will only return -EIOCBQUEUED when it is not the last to drop the dio refcount and the aio bio completion path will only call aio_complete() when it is the last to drop the dio refcount. direct_io_worker() can ensure that it is the last to drop the reference count by waiting for bios to drain. It does this for sync ops, of course, and for partial dio writes that must fall back to buffered and for aio ops that saw errors during submission. This means that operations that end up waiting, even if they were issued as aio ops, will not call aio_complete() from dio. Instead we return the return code of the operation and let the aio core call aio_complete(). This is purposely done to fix a bug where AIO DIO file extensions would call aio_complete() before their callers have a chance to update i_size. Now that direct_io_worker() is explicitly returning -EIOCBQUEUED its callers no longer have to translate for it. XFS needs to be careful not to free resources that will be used during AIO completion if -EIOCBQUEUED is returned. We maintain the previous behaviour of trying to write fs metadata for O_SYNC aio+dio writes. Signed-off-by: Zach Brown <zach.brown@oracle.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Suparna Bhattacharya <suparna@in.ibm.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Cc: <xfs-masters@oss.sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10	[PATCH] dio: remove duplicate bio wait code	Zach Brown	1	-29/+12
	Now that we have a single refcount and waiting path we can reuse it in the async 'should_wait' path. It continues to rely on the fragile link between the conditional in dio_complete_aio() which decides to complete the AIO and the conditional in direct_io_worker() which decides to wait and free. By waiting before dropping the reference we stop dio_bio_end_aio() from calling dio_complete_aio() which used to wake up the waiter after seeing the reference count drop to 0. We hoist this wake up into dio_bio_end_aio() which now notices when it's left a single remaining reference that is held by the waiter. Signed-off-by: Zach Brown <zach.brown@oracle.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Suparna Bhattacharya <suparna@in.ibm.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10	[PATCH] dio: formalize bio counters as a dio reference count	Zach Brown	1	-74/+66
	Previously we had two confusing counts of bio progress. 'bio_count' was decremented as bios were processed and freed by the dio core. It was used to indicate final completion of the dio operation. 'bios_in_flight' reflected how many bios were between submit_bio() and bio->end_io. It was used by the sync path to decide when to wake up and finish completing bios and was ignored by the async path. This patch collapses the two notions into one notion of a dio reference count. bios hold a dio reference when they're between submit_bio and bio->end_io. Since bios_in_flight was only used in the sync path it is now equivalent to dio->refcount - 1 which accounts for direct_io_worker() holding a reference for the duration of the operation. dio_bio_complete() -> finished_one_bio() was called from the sync path after finding bios on the list that the bio->end_io function had deposited. finished_one_bio() can not drop the dio reference on behalf of these bios now because bio->end_io already has. The is_async test in finished_one_bio() meant that it never actually did anything other than drop the bio_count for sync callers. So we remove its refcount decrement, don't call it from dio_bio_complete(), and hoist its call up into the async dio_bio_complete() caller after an explicit refcount decrement. It is renamed dio_complete_aio() to reflect the remaining work it actually does. Signed-off-by: Zach Brown <zach.brown@oracle.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Suparna Bhattacharya <suparna@in.ibm.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10	[PATCH] dio: call blk_run_address_space() once per op	Zach Brown	1	-5/+3
	We only need to call blk_run_address_space() once after all the bios for the direct IO op have been submitted. This removes the chance of calling blk_run_address_space() after spurious wake ups as the sync path waits for bios to drain. It's also one less difference betwen the sync and async paths. In the process we remove a redundant dio_bio_submit() that its caller had already performed. Signed-off-by: Zach Brown <zach.brown@oracle.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Suparna Bhattacharya <suparna@in.ibm.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10	[PATCH] dio: centralize completion in dio_complete()	Zach Brown	1	-52/+42
	There have been a lot of bugs recently due to the way direct_io_worker() tries to decide how to finish direct IO operations. In the worst examples it has failed to call aio_complete() at all (hang) or called it too many times (oops). This set of patches cleans up the completion phase with the goal of removing the complexity that lead to these bugs. We end up with one path that calculates the result of the operation after all off the bios have completed. We decide when to generate a result of the operation using that path based on the final release of a refcount on the dio structure. I tried to progress towards the final state in steps that were relatively easy to understand. Each step should compile but I only tested the final result of having all the patches applied. I've tested these on low end PC drives with aio-stress, the direct IO tests I could manage to get running in LTP, orasim, and some home-brew functional tests. In http://lkml.org/lkml/2006/9/21/103 IBM reports success with ext2 and ext3 running DIO LTP tests. They found that XFS bug which has since been addressed in the patch series. This patch: The mechanics which decide the result of a direct IO operation were duplicated in the sync and async paths. The async path didn't check page_errors which can manifest as silently returning success when the final pointer in an operation faults and its matching file region is filled with zeros. The sync path and async path differed in whether they passed errors to the caller's dio->end_io operation. The async path was passing errors to it which trips an assertion in XFS, though it is apparently harmless. This centralizes the completion phase of dio ops in one place. AIO will now return EFAULT consistently and all paths fall back to the previously sync behaviour of passing the number of bytes 'transferred' to the dio->end_io callback, regardless of errors. dio_await_completion() doesn't have to propogate EIO from non-uptodate bios now that it's being propogated through dio_complete() via dio->io_error. This lets it return void which simplifies its sole caller. Signed-off-by: Zach Brown <zach.brown@oracle.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Suparna Bhattacharya <suparna@in.ibm.com> Acked-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-10	[PATCH] io-accounting: direct-io	Andrew Morton	1	-0/+8
	Account for direct-io. Cc: Jay Lan <jlan@sgi.com> Cc: Shailabh Nagar <nagar@watson.ibm.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Chris Sturtivant <csturtiv@sgi.com> Cc: Tony Ernst <tee@sgi.com> Cc: Guillaume Thouvenin <guillaume.thouvenin@bull.net> Cc: David Wright <daw@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03	[PATCH] lockdep: annotate direct io	Ingo Molnar	1	-2/+4
	Teach special (rwsem-in-irq) locking code to the lock validator. Has no effect on non-lockdep kernels. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Cc: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23	[PATCH] Kill PF_SYNCWRITE flag	Jens Axboe	1	-10/+8
	A process flag to indicate whether we are doing sync io is incredibly ugly. It also causes performance problems when one does a lot of async io and then proceeds to sync it. Part of the io will go out as async, and the other part as sync. This causes a disconnect between the previously submitted io and the synced io. For io schedulers such as CFQ, this will cause us lost merges and suboptimal behaviour in scheduling. Remove PF_SYNCWRITE completely from the fsync/msync paths, and let the O_DIRECT path just directly indicate that the writes are sync by using WRITE_SYNC instead. Signed-off-by: Jens Axboe <axboe@suse.de>
2006-04-01	BUG_ON() Conversion in fs/direct-io.c	Eric Sesterhenn	1	-2/+1
	this changes if() BUG(); constructs to BUG_ON() which is cleaner and can better optimized away Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de> Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-03-29	Fixes a regression from the recent "remove ->get_blocks() support"	Nathan Scott	1	-3/+4
	change. inode->i_blkbits should be used when making a get_block_t request of a filesystem instead of dio->blkbits, as that does not indicate the filesystem block size all the time (depends on request alignment - see start of __blockdev_direct_IO). Signed-off-by: Nathan Scott <nathans@sgi.com> Acked-by: Badari Pulavarty <pbadari@us.ibm.com>
2006-03-26	[PATCH] remove ->get_blocks() support	Badari Pulavarty	1	-13/+14
	Now that get_block() can handle mapping multiple disk blocks, no need to have ->get_blocks(). This patch removes fs specific ->get_blocks() added for DIO and makes it users use get_block() instead. Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25	[PATCH] direct-io: bug fix in dio handling write error	Chen, Kenneth W	1	-1/+7
	There is a bug in direct-io on propagating write error up to the higher I/O layer. When performing an async ODIRECT write to a block device, if a device error occurred (like media error or disk is pulled), the error code is only propagated from device driver to the DIO layer. The error code stops at finished_one_bio(). The aysnc write, however, is supposedly have a corresponding AIO event with appropriate return code (in this case -EIO). Application which waits on the async write event, will hang forever since such AIO event is lost forever (if such app did not use the timeout option in io_getevents call. Regardless, an AIO event is lost). The discovery of above bug leads to another discovery of potential race window with dio->result. The fundamental problem is that dio->result is overloaded with dual use: an indicator of fall back path for partial dio write, and an error indicator used in the I/O completion path. In the event of device error, the setting of -EIO to dio->result clashes with value used to track partial write that activates the fall back path. It was also pointed out that it is impossible to use dio->result to track partial write and at the same time to track error returned from device driver. Because direct_io_work can only determines whether it is a partial write at the end of io submission and in mid stream of those io submission, a return code could be coming back from the driver. Thus messing up all the subsequent logic. Proposed fix is to separating out error code returned by the IO completion path from partial IO submit tracking. A new variable is added to dio structure specifically to track io error returned in the completion path. Signed-off-by: Ken Chen <kenneth.w.chen@intel.com> Acked-by: Zach Brown <zach.brown@oracle.com> Acked-by: Suparna Bhattacharya <suparna@in.ibm.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-15	Fix a direct I/O locking issue revealed by the new mutex code.	Nathan Scott	1	-9/+12
	Affects only XFS (i.e. DIO_OWN_LOCKING case) - currently it is not possible to get i_mutex locking correct when using DIO_OWN direct I/O locking in a filesystem due to indeterminism in the possible return code/lock/unlock combinations. This can cause a direct read to attempt a double i_mutex unlock inside XFS. We're now ensuring __blockdev_direct_IO always exits with the inode i_mutex (still) held for a direct reader. Tested with the three different locking modes (via direct block device access, ext3 and XFS) - both reading and writing; cannot find any regressions resulting from this change, and it clearly fixes the mutex_unlock warning originally reported here: http://marc.theaimsgroup.com/?l=linux-kernel&m=114189068126253&w=2 Signed-off-by: Nathan Scott <nathans@sgi.com> Acked-by: Christoph Hellwig <hch@lst.de>
2006-02-03	[PATCH] fix O_DIRECT read of last block in a sparse file	Jeff Moyer	1	-1/+8
	Currently, if you open a file O_DIRECT, truncate it to a size that is not a multiple of the disk block size, and then try to read the last block in the file, the read will return 0. The problem is in do_direct_IO, here: /* Handle holes / if (!buffer_mapped(map_bh)) { char kaddr; ... if (dio->block_in_file >= i_size_read(dio->inode)>>blkbits) { /* We hit eof */ page_cache_release(page); goto out; } We shift off any remaining bytes in the final block of the I/O, resulting in a 0-sized read. I've attached a patch that fixes this. I'm not happy about how ugly the math is getting, so suggestions are more than welcome. I've tested this with a simple program that performs the steps outlined for reproducing the problem above. Without the patch, we get a 0-sized result from read. With the patch, we get the correct return value from the short read. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Suparna Bhattacharya <suparna@in.ibm.com> Cc: Mingming Cao <cmm@us.ibm.com> Cc: Joel Becker <Joel.Becker@oracle.com> Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-09	[PATCH] mutex subsystem, semaphore to mutex: VFS, ->i_sem	Jes Sorensen	1	-15/+15
	This patch converts the inode semaphore to a mutex. I have tested it on XFS and compiled as much as one can consider on an ia64. Anyway your luck with it might be different. Modified-by: Ingo Molnar <mingo@elte.hu> (finished the conversion) Signed-off-by: Jes Sorensen <jes@sgi.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2005-10-29	[PATCH] core remove PageReserved	Nick Piggin	1	-1/+3
	Remove PageReserved() calls from core code by tightening VM_RESERVED handling in mm/ to cover PageReserved functionality. PageReserved special casing is removed from get_page and put_page. All setting and clearing of PageReserved is retained, and it is now flagged in the page_alloc checks to help ensure we don't introduce any refcount based freeing of Reserved pages. MAP_PRIVATE, PROT_WRITE of VM_RESERVED regions is tentatively being deprecated. We never completely handled it correctly anyway, and is be reintroduced in future if required (Hugh has a proof of concept). Once PageReserved() calls are removed from kernel/power/swsusp.c, and all arch/ and driver code, the Set and Clear calls, and the PG_reserved bit can be trivially removed. Last real user of PageReserved is swsusp, which uses PageReserved to determine whether a struct page points to valid memory or not. This still needs to be addressed (a generic page_is_ram() should work). A last caveat: the ZERO_PAGE is now refcounted and managed with rmap (and thus mapcounted and count towards shared rss). These writes to the struct page could cause excessive cacheline bouncing on big systems. There are a number of ways this could be addressed if it is an issue. Signed-off-by: Nick Piggin <npiggin@suse.de> Refcount bug fix for filemap_xip.c Signed-off-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-06-24	[PATCH] pass iocb to dio_iodone_t	Christoph Hellwig	1	-1/+1
	XFS will have to look at iocb->private to fix aio+dio. No other filesystem is using the blockdev_direct_IO* end_io callback. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-04-16	[PATCH] Direct IO async short read fix	Daniel McNeil	1	-3/+17
	The direct I/O code is mapping the read request to the file system block. If the file size was not on a block boundary, the result would show the the read reading past EOF. This was only happening for the AIO case. The non-AIO case truncates the result to match file size (in direct_io_worker). This patch does the same thing for the AIO case, it truncates the result to match the file size if the read reads past EOF. When I/O completes the result can be truncated to match the file size without using i_size_read(), thus the aio result now matches the number of bytes read to the end of file. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-04-16	Linux-2.6.12-rc2	Linus Torvalds	1	-0/+1258
	Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip!