linux-dev - Linux kernel development work

Age	Commit message (Collapse)	Author	Files	Lines
2010-10-21	Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw	Linus Torvalds	28	-395/+617
	* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw: (22 commits) GFS2: fixed typo GFS2: Fix type mapping for demote_rq interface GFS2 fatal: filesystem consistency error on rename GFS2: Improve journal allocation via sysfs GFS2: Add "norecovery" mount option as a synonym for "spectator" GFS2: Fix spectator umount issue GFS2: Fix compiler warning from previous patch GFS2: reserve more blocks for transactions GFS2: Fix journal check for spectator mounts GFS2: Remove upgrade mount option GFS2: Remove localcaching mount option GFS2: Remove ignore_local_fs mount argument GFS2: Make . and .. qstrs constant GFS2: Use new workqueue scheme GFS2: Update handling of DLM return codes to match reality GFS2: Don't enforce min hold time when two demotes occur in rapid succession GFS2: Fix whitespace in previous patch GFS2: fallocate support GFS2: Add a bug trap in allocation code GFS2: No longer experimental ...
2010-10-21	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client	Linus Torvalds	63	-13658/+1061
	* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (22 commits) ceph: do not carry i_lock for readdir from dcache fs/ceph/xattr.c: Use kmemdup rbd: passing wrong variable to bvec_kunmap_irq() rbd: null vs ERR_PTR ceph: fix num_pages_free accounting in pagelist ceph: add CEPH_MDS_OP_SETDIRLAYOUT and associated ioctl. ceph: don't crash when passed bad mount options ceph: fix debugfs warnings block: rbd: removing unnecessary test block: rbd: fixed may leaks ceph: switch from BKL to lock_flocks() ceph: preallocate flock state without locks held ceph: add pagelist_reserve, pagelist_truncate, pagelist_set_cursor ceph: use mapping->nrpages to determine if mapping is empty ceph: only invalidate on check_caps if we actually have pages ceph: do not hide .snap in root directory rbd: introduce rados block device (rbd), based on libceph ceph: factor out libceph from Ceph file system ceph-rbd: osdc support for osd call and rollback operations ceph: messenger and osdc changes for rbd ...
2010-10-21	Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/hfsplus	Linus Torvalds	16	-649/+765
	* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/hfsplus: (29 commits) hfsplus: fix getxattr return value hfsplus: remove the unused hfsplus_kmap/hfsplus_kunmap helpers hfsplus: create correct initial catalog entries for device files hfsplus: remove superflous rootflags field in hfsplus_inode_info hfsplus: fix link corruption hfsplus: validate btree flags hfsplus: handle more on-disk corruptions without oopsing hfsplus: hfs_bnode_find() can fail, resulting in hfs_bnode_split() breakage hfsplus: fix oops on mount with corrupted btree extent records hfsplus: fix rename over directories hfsplus: convert tree_lock to mutex hfsplus: add missing extent locking in hfsplus_write_inode hfsplus: protect readdir against removals from open_dir_list hfsplus: use atomic bitops for the superblock flags hfsplus: add per-superblock lock for volume header updates hfsplus: remove the rsrc_inodes list hfsplus: do not cache and write next_alloc hfsplus: fix error handling in hfsplus_symlink hfsplus: merge mknod/mkdir/creat hfsplus: clean up hfsplus_write_inode ...
2010-10-21	BKL: remove BKL from freevxfs	Arnd Bergmann	2	-22/+2
	All uses of the BKL in freevxfs were the result of a pushdown into code that doesn't really need it. As Christoph points out, this is a read-only file system, which eliminates most of the races in readdir/lookup. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Christoph Hellwig <hch@infradead.org>
2010-10-21	BKL: remove BKL from qnx4	Arnd Bergmann	3	-21/+1
	All uses of the BKL in qnx4 were the result of a pushdown into code that doesn't really need it. As Christoph points out, this is a read-only file system, which eliminates most of the races in readdir/lookup. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Anders Larsen <al@alarsen.net> Cc: Christoph Hellwig <hch@infradead.org>
2010-10-21	BKL: introduce CONFIG_BKL.	Arnd Bergmann	9	-0/+9
	With all the patches we have queued in the BKL removal tree, only a few dozen modules are left that actually rely on the BKL, and even there are lots of low-hanging fruit. We need to decide what to do about them, this patch illustrates one of the options: Every user of the BKL is marked as 'depends on BKL' in Kconfig, and the CONFIG_BKL becomes a user-visible option. If it gets disabled, no BKL using module can be built any more and the BKL code itself is compiled out. The one exception is file locking, which is practically always enabled and does a 'select BKL' instead. This effectively forces CONFIG_BKL to be enabled until we have solved the fs/lockd mess and can apply the patch that removes the BKL from fs/locks.c. Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2010-10-21	Merge branch 'v2.6.36'	Alex Elder	1	-0/+2

2010-10-21	cifs: convert cifs_tcp_ses_lock from a rwlock to a spinlock	Suresh Jayaraman	7	-58/+58
	cifs_tcp_ses_lock is a rwlock with protects the cifs_tcp_ses_list, server->smb_ses_list and the ses->tcon_list. It also protects a few ref counters in server, ses and tcon. In most cases the critical section doesn't seem to be large, in a few cases where it is slightly large, there seem to be really no benefit from concurrent access. I briefly considered RCU mechanism but it appears to me that there is no real need. Replace it with a spinlock and get rid of the last rwlock in the cifs code. Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de> Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-21	pipe: fix failure to return error code on ->confirm()	Nicolas Kaiser	1	-1/+1
	The arguments were transposed, we want to assign the error code to 'ret', which is being returned. Signed-off-by: Nicolas Kaiser <nikai@nikai.net> Cc: stable@kernel.org Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-10-21	UBIFS: do not allocate unneeded scan buffer	Artem Bityutskiy	1	-7/+1
	In 'ubifs_replay_journal()' we allocate 'sbuf' for scanning the log. However, we already have 'c->sbuf' for these purposes, so do not allocate yet another one. This reduces UBIFS memory consumption while recovering. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2010-10-21	UBIFS: do not forget to cancel timers	Artem Bityutskiy	1	-0/+4
	This is a bug-fix: when we unmount, and we are currently in R/O mode because of an error - we do not sync write-buffers, which means we also do not cancel write-buffer timers we may possibly have armed. This patch fixes the issue. The issue can easily be reproduced by enabling UBIFS failure debug mode (echo 4 > /sys/module/ubifs/parameters/debug_tsts) and unmounting as soon as a failure happen. At some point the system oopses because we have an armed hrtimer but UBIFS is unmounted already. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2010-10-21	UBIFS: remove a bit of unneeded code	Artem Bityutskiy	1	-6/+3
	This is a clean-up patch which: 1. Removes explicite 'hrtimer_cancel()' after 'ubifs_wbuf_sync()' in 'ubifs_remount_ro()', because the timers will be canceled by 'ubifs_wbuf_sync()', no need to cancel them for the second time. 2. Remove "if (c->jheads)" check from 'ubifs_put_super()', because at journal heads must always be allocated there, since we checked earlier that we were mounted R/W, and the olny situation when journal heads are not allocated is when mounter or re-mounted R/O. Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
2010-10-20	ceph: do not carry i_lock for readdir from dcache	Sage Weil	2	-26/+18
	We were taking dcache_lock inside of i_lock, which introduces a dependency not found elsewhere in the kernel, complicationg the vfs locking scalability work. Since we don't actually need it here anyway, remove it. We only need i_lock to test for the I_COMPLETE flag, so be careful to do so without dcache_lock held. Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20	fs/ceph/xattr.c: Use kmemdup	Julia Lawall	1	-2/+1
	Convert a sequence of kmalloc and memcpy to use kmemdup. The semantic patch that performs this transformation is: (http://coccinelle.lip6.fr/) // <smpl> @@ expression a,flag,len; expression arg,e1,e2; statement S; @@ a = - $kmalloc\\|kzalloc$(len,flag) + kmemdup(arg,len,flag) <... when != a if (a == NULL \|\| ...) S ...> - memcpy(a,arg,len+1); // </smpl> Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20	ceph: add CEPH_MDS_OP_SETDIRLAYOUT and associated ioctl.	Greg Farnum	2	-1/+69
	Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20	ceph: fix debugfs warnings	Randy Dunlap	1	-1/+2
	Include "super.h" outside of CONFIG_DEBUG_FS to eliminate a compiler warning: fs/ceph/debugfs.c:266: warning: 'struct ceph_fs_client' declared inside parameter list fs/ceph/debugfs.c:266: warning: its scope is only this definition or declaration, which is probably not what you want fs/ceph/debugfs.c:271: warning: 'struct ceph_fs_client' declared inside parameter list Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
2010-10-20	ceph: switch from BKL to lock_flocks()	Sage Weil	1	-5/+6
	Switch from using the BKL explicitly to the new lock_flocks() interface. Eventually this will turn into a spinlock. Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20	ceph: preallocate flock state without locks held	Greg Farnum	2	-15/+44
	When the lock_kernel() turns into lock_flocks() and a spinlock, we won't be able to do allocations with the lock held. Preallocate space without the lock, and retry if the lock state changes out from underneath us. Signed-off-by: Greg Farnum <gregf@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20	ceph: use mapping->nrpages to determine if mapping is empty	Sage Weil	1	-12/+1
	This is simpler and faster. Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20	ceph: only invalidate on check_caps if we actually have pages	Sage Weil	1	-1/+1
	The i_rdcache_gen value only implies we MAY have cached pages; actually check the mapping to see if it's worth bothering with an invalidate. Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20	ceph: do not hide .snap in root directory	Sage Weil	1	-1/+0
	Snaps in the root directory are now supported by the MDS, and harmless on older versions. Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20	ceph: factor out libceph from Ceph file system	Yehuda Sadeh	62	-14078/+926
	This factors out protocol and low-level storage parts of ceph into a separate libceph module living in net/ceph and include/linux/ceph. This is mostly a matter of moving files around. However, a few key pieces of the interface change as well: - ceph_client becomes ceph_fs_client and ceph_client, where the latter captures the mon and osd clients, and the fs_client gets the mds client and file system specific pieces. - Mount option parsing and debugfs setup is correspondingly broken into two pieces. - The mon client gets a generic handler callback for otherwise unknown messages (mds map, in this case). - The basic supported/required feature bits can be expanded (and are by ceph_fs_client). No functional change, aside from some subtle error handling cases that got cleaned up in the refactoring process. Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20	ceph-rbd: osdc support for osd call and rollback operations	Yehuda Sadeh	3	-0/+29
	This will be used for rbd snapshots administration. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
2010-10-20	ceph: messenger and osdc changes for rbd	Yehuda Sadeh	6	-101/+436
	Allow the messenger to send/receive data in a bio. This is added so that we wouldn't need to copy the data into pages or some other buffer when doing IO for an rbd block device. We can now have trailing variable sized data for osd ops. Also osd ops encoding is more modular. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20	ceph: refactor osdc requests creation functions	Yehuda Sadeh	2	-57/+155
	The osd requests creation are being decoupled from the vino parameter, allowing clients using the osd to use other arbitrary object names that are not necessarily vino based. Also, calc_raw_layout now takes a snap id. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-20	ceph: lookup pool in osdmap by name	Yehuda Sadeh	2	-0/+15
	Implement a pool lookup by name. This will be used by rbd. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
2010-10-19	Merge branch 'devel-stable' into devel	Russell King	19	-93/+170

2010-10-19	cifs: cancel_delayed_work() + flush_scheduled_work() -> cancel_delayed_work_sync()	Tejun Heo	1	-2/+1
	flush_scheduled_work() is going away. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-19	Clean up two declarations of blob_len	Shirish Pargaonkar	1	-5/+8
	- Eliminate double declaration of variable blob_len - Modify function build_ntlmssp_auth_blob to return error code as well as length of the blob. Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com> Reviewed-by: Jeff Layton <jlayton@samba.org> Signed-off-by: Steve French <sfrench@us.ibm.com>
2010-10-19	fix rawctl compat ioctls breakage on amd64 and itanic	Al Viro	1	-70/+0
	RAW_SETBIND and RAW_GETBIND 32bit versions are fscked in interesting ways. 1) fs/compat_ioctl.c has COMPATIBLE_IOCTL(RAW_SETBIND) followed by HANDLE_IOCTL(RAW_SETBIND, raw_ioctl). The latter is ignored. 2) on amd64 (and itanic) the damn thing is broken - we have int + u64 + u64 and layouts on i386 and amd64 are _not_ the same. raw_ioctl() would work there, but it's never called due to (1). As it is, i386 /sbin/raw definitely doesn't work on amd64 boxen. 3) switching to raw_ioctl() as is would not work on e.g. sparc64 and ppc64, which would be rather sad, seeing that normal userland there is 32bit. The thing is, slapping __packed on the struct in question does not DTRT - it eliminates all padding. The real solution is to use compat_u64. 4) of course, all that stuff has no business being outside of raw.c in the first place - there should be ->compat_ioctl() for /dev/rawctl instead of messing with compat_ioctl.c. [akpm@linux-foundation.org: coding-style fixes] [arnd@arndb.de: port to 2.6.36] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2010-10-19	Merge branch 'v2.6.36-rc8' into for-2.6.37/barrier	Jens Axboe	115	-996/+1740
	Conflicts: block/blk-core.c drivers/block/loop.c mm/swapfile.c Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-10-19	block: fix accounting bug on cross partition merges	Yasuaki Ishimatsu	1	-0/+12
	/proc/diskstats would display a strange output as follows. $ cat /proc/diskstats \|grep sda 8 0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089 8 1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691 ~~~~~~~~~~ 8 2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390 8 3 sda3 54 487 2188 92 0 0 0 0 0 88 92 8 4 sda4 4 0 8 0 0 0 0 0 0 0 0 8 5 sda5 81 2027 2130 138 0 0 0 0 0 87 137 Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE. The detailed root cause is as follows. Assuming that there are two partition, sda1 and sda2. 1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight is 0 and sda2's one is 1. \| hd_struct->in_flight --------------------------- sda1 \| 0 sda2 \| 1 --------------------------- 2. A bio belongs to sda1 is issued and is merged into the request mentioned on step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed from sda2 region to sda1 region. However the two partition's hd_struct->in_flight are not changed. \| hd_struct->in_flight --------------------------- sda1 \| 0 sda2 \| 1 --------------------------- 3. The request is finished and blk_account_io_done() is called. In this case, sda2's hd_struct->in_flight, not a sda1's one, is decremented. \| hd_struct->in_flight --------------------------- sda1 \| -1 sda2 \| 1 --------------------------- The patch fixes the problem by caching the partition lookup inside the request structure, hence making sure that the increment and decrement will always happen on the same partition struct. This also speeds up IO with accounting enabled, since it cuts down on the number of lookups we have to do. When reloading partition tables, quiesce IO to ensure that no request references to the partition struct exists. When it is safe to free the partition table, the IO for that device is restarted again. Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: stable@kernel.org Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2010-10-18	xfs: semaphore cleanup	Thomas Gleixner	1	-1/+1
	Get rid of init_MUTEX[_LOCKED]() and use sema_init() instead. (Ported to current XFS code by <aelder@sgi.com>.) Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-10-18	xfs: Extend project quotas to support 32bit project ids	Arkadiusz Mi?kiewicz	16	-44/+80
	This patch adds support for 32bit project quota identifiers. On disk format is backward compatible with 16bit projid numbers. projid on disk is now kept in two 16bit values - di_projid_lo (which holds the same position as old 16bit projid value) and new di_projid_hi (takes existing padding) and converts from/to 32bit value on the fly. xfs_admin (for existing fs), mkfs.xfs (for new fs) needs to be used to enable PROJID32BIT support. Signed-off-by: Arkadiusz Miśkiewicz <arekm@maven.pl> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-10-18	xfs: remove xfs_buf wrappers	Christoph Hellwig	16	-48/+33
	Stop having two different names for many buffer functions and use the more descriptive xfs_buf_* names directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-10-18	xfs: remove xfs_cred.h	Christoph Hellwig	12	-58/+18
	We're not actually passing around credentials inside XFS for a while now, so remove all xfs_cred.h with it's cred_t typedef and all instances of it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-10-18	xfs: remove xfs_globals.h	Christoph Hellwig	2	-24/+0
	This header only provides one extern that isn't actually declared anywhere, and shadowed by a macro. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-10-18	xfs: remove xfs_version.h	Christoph Hellwig	3	-30/+1
	It used to have a place when it contained an automatically generated CVS version, but these days it's entirely superflous. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-10-18	xfs: remove xfs_refcache.h	Christoph Hellwig	1	-52/+0
	This header has been completely unused for a couple of years. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-10-18	xfs: fix the xfs_trans_committed	Christoph Hellwig	1	-2/+3
	Use the correct prototype for xfs_trans_committed instead of casting it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-10-18	xfs: remove unused t_callback field in struct xfs_trans	Christoph Hellwig	2	-6/+0
	Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-10-18	xfs: fix bogus m_maxagi check in xfs_iget	Christoph Hellwig	1	-2/+2
	These days inode64 should only control which AGs we allocate new inodes from, while we still try to support reading all existing inodes. To make this actually work the check ontop of xfs_iget needs to be relaxed to allow inodes in all allocation groups instead of just those that we allow allocating inodes from. Note that we can't simply remove the check - it prevents us from accessing invalid data when fed invalid inode numbers from NFS or bulkstat. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-10-18	xfs: do not use xfs_mod_incore_sb_batch for per-cpu counters	Christoph Hellwig	2	-107/+85
	Update the per-cpu counters manually in xfs_trans_unreserve_and_mod_sb and remove support for per-cpu counters from xfs_mod_incore_sb_batch to simplify it. And added benefit is that we don't have to take m_sb_lock for transactions that only modify per-cpu counters. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-10-18	xfs: do not use xfs_mod_incore_sb for per-cpu counters	Christoph Hellwig	5	-40/+35
	Export xfs_icsb_modify_counters and always use it for modifying the per-cpu counters. Remove support for per-cpu counters from xfs_mod_incore_sb to simplify it. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-10-18	xfs: remove XFS_MOUNT_NO_PERCPU_SB	Christoph Hellwig	3	-29/+19
	Fail the mount if we can't allocate memory for the per-CPU counters. This is consistent with how we handle everything else in the mount path and makes the superblock counter modification a lot simpler. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-10-18	xfs: pack xfs_buf structure more tightly	Dave Chinner	1	-11/+19
	pahole reports the struct xfs_buf has quite a few holes in it, so packing the structure better will reduce the size of it by 16 bytes. Also, move all the fields used in cache lookups into the first cacheline. Before on x86_64: /* size: 320, cachelines: 5 / / sum members: 298, holes: 6, sum holes: 22 / After on x86_64: / size: 304, cachelines: 5 / / padding: 6 / / last cacheline: 48 bytes */ Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>
2010-10-18	xfs: convert buffer cache hash to rbtree	Dave Chinner	4	-76/+74
	The buffer cache hash is showing typical hash scalability problems. In large scale testing the number of cached items growing far larger than the hash can efficiently handle. Hence we need to move to a self-scaling cache indexing mechanism. I have selected rbtrees for indexing becuse they can have O(log n) search scalability, and insert and remove cost is not excessive, even on large trees. Hence we should be able to cache large numbers of buffers without incurring the excessive cache miss search penalties that the hash is imposing on us. To ensure we still have parallel access to the cache, we need multiple trees. Rather than hashing the buffers by disk address to select a tree, it seems more sensible to separate trees by typical access patterns. Most operations use buffers from within a single AG at a time, so rather than searching lots of different lists, separate the buffer indexes out into per-AG rbtrees. This means that searches during metadata operation have a much higher chance of hitting cache resident nodes, and that updates of the tree are less likely to disturb trees being accessed on other CPUs doing independent operations. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>
2010-10-18	xfs: serialise inode reclaim within an AG	Dave Chinner	3	-0/+33
	Memory reclaim via shrinkers has a terrible habit of having N+M concurrent shrinker executions (N = num CPUs, M = num kswapds) all trying to shrink the same cache. When the cache they are all working on is protected by a single spinlock, massive contention an slowdowns occur. Wrap the per-ag inode caches with a reclaim mutex to serialise reclaim access to the AG. This will block concurrent reclaim in each AG but still allow reclaim to scan multiple AGs concurrently. Allow shrinkers to move on to the next AG if it can't get the lock, and if we can't get any AG, then start blocking on locks. To prevent reclaimers from continually scanning the same inodes in each AG, add a cursor that tracks where the last reclaim got up to and start from that point on the next reclaim. This should avoid only ever scanning a small number of inodes at the satart of each AG and not making progress. If we have a non-shrinker based reclaim pass, ignore the cursor and reset it to zero once we are done. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Alex Elder <aelder@sgi.com>
2010-10-18	xfs: batch inode reclaim lookup	Dave Chinner	1	-33/+77
	Batch and optimise the per-ag inode lookup for reclaim to minimise scanning overhead. This involves gang lookups on the radix trees to get multiple inodes during each tree walk, and tighter validation of what inodes can be reclaimed without blocking befor we take any locks. This is based on ideas suggested in a proof-of-concept patch posted by Nick Piggin. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>
2010-10-18	xfs: implement batched inode lookups for AG walking	Dave Chinner	2	-23/+45
	With the reclaim code separated from the generic walking code, it is simple to implement batched lookups for the generic walk code. Separate out the inode validation from the execute operations and modify the tree lookups to get a batch of inodes at a time. Reclaim operations will be optimised separately. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Alex Elder <aelder@sgi.com>