Age | Commit message (Collapse) | Author | Files | Lines |
|
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw: (22 commits)
GFS2: fixed typo
GFS2: Fix type mapping for demote_rq interface
GFS2 fatal: filesystem consistency error on rename
GFS2: Improve journal allocation via sysfs
GFS2: Add "norecovery" mount option as a synonym for "spectator"
GFS2: Fix spectator umount issue
GFS2: Fix compiler warning from previous patch
GFS2: reserve more blocks for transactions
GFS2: Fix journal check for spectator mounts
GFS2: Remove upgrade mount option
GFS2: Remove localcaching mount option
GFS2: Remove ignore_local_fs mount argument
GFS2: Make . and .. qstrs constant
GFS2: Use new workqueue scheme
GFS2: Update handling of DLM return codes to match reality
GFS2: Don't enforce min hold time when two demotes occur in rapid succession
GFS2: Fix whitespace in previous patch
GFS2: fallocate support
GFS2: Add a bug trap in allocation code
GFS2: No longer experimental
...
|
|
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (22 commits)
ceph: do not carry i_lock for readdir from dcache
fs/ceph/xattr.c: Use kmemdup
rbd: passing wrong variable to bvec_kunmap_irq()
rbd: null vs ERR_PTR
ceph: fix num_pages_free accounting in pagelist
ceph: add CEPH_MDS_OP_SETDIRLAYOUT and associated ioctl.
ceph: don't crash when passed bad mount options
ceph: fix debugfs warnings
block: rbd: removing unnecessary test
block: rbd: fixed may leaks
ceph: switch from BKL to lock_flocks()
ceph: preallocate flock state without locks held
ceph: add pagelist_reserve, pagelist_truncate, pagelist_set_cursor
ceph: use mapping->nrpages to determine if mapping is empty
ceph: only invalidate on check_caps if we actually have pages
ceph: do not hide .snap in root directory
rbd: introduce rados block device (rbd), based on libceph
ceph: factor out libceph from Ceph file system
ceph-rbd: osdc support for osd call and rollback operations
ceph: messenger and osdc changes for rbd
...
|
|
* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/hfsplus: (29 commits)
hfsplus: fix getxattr return value
hfsplus: remove the unused hfsplus_kmap/hfsplus_kunmap helpers
hfsplus: create correct initial catalog entries for device files
hfsplus: remove superflous rootflags field in hfsplus_inode_info
hfsplus: fix link corruption
hfsplus: validate btree flags
hfsplus: handle more on-disk corruptions without oopsing
hfsplus: hfs_bnode_find() can fail, resulting in hfs_bnode_split() breakage
hfsplus: fix oops on mount with corrupted btree extent records
hfsplus: fix rename over directories
hfsplus: convert tree_lock to mutex
hfsplus: add missing extent locking in hfsplus_write_inode
hfsplus: protect readdir against removals from open_dir_list
hfsplus: use atomic bitops for the superblock flags
hfsplus: add per-superblock lock for volume header updates
hfsplus: remove the rsrc_inodes list
hfsplus: do not cache and write next_alloc
hfsplus: fix error handling in hfsplus_symlink
hfsplus: merge mknod/mkdir/creat
hfsplus: clean up hfsplus_write_inode
...
|
|
All uses of the BKL in freevxfs were the result of a pushdown into
code that doesn't really need it. As Christoph points out, this
is a read-only file system, which eliminates most of the races in
readdir/lookup.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Christoph Hellwig <hch@infradead.org>
|
|
All uses of the BKL in qnx4 were the result of a pushdown into
code that doesn't really need it. As Christoph points out, this
is a read-only file system, which eliminates most of the races in
readdir/lookup.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Anders Larsen <al@alarsen.net>
Cc: Christoph Hellwig <hch@infradead.org>
|
|
With all the patches we have queued in the BKL removal tree, only a
few dozen modules are left that actually rely on the BKL, and even
there are lots of low-hanging fruit. We need to decide what to do
about them, this patch illustrates one of the options:
Every user of the BKL is marked as 'depends on BKL' in Kconfig,
and the CONFIG_BKL becomes a user-visible option. If it gets
disabled, no BKL using module can be built any more and the BKL
code itself is compiled out.
The one exception is file locking, which is practically always
enabled and does a 'select BKL' instead. This effectively forces
CONFIG_BKL to be enabled until we have solved the fs/lockd
mess and can apply the patch that removes the BKL from fs/locks.c.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
|
|
cifs_tcp_ses_lock is a rwlock with protects the cifs_tcp_ses_list,
server->smb_ses_list and the ses->tcon_list. It also protects a few
ref counters in server, ses and tcon. In most cases the critical section
doesn't seem to be large, in a few cases where it is slightly large, there
seem to be really no benefit from concurrent access. I briefly considered RCU
mechanism but it appears to me that there is no real need.
Replace it with a spinlock and get rid of the last rwlock in the cifs code.
Signed-off-by: Suresh Jayaraman <sjayaraman@suse.de>
Signed-off-by: Steve French <sfrench@us.ibm.com>
|
|
The arguments were transposed, we want to assign the error code to
'ret', which is being returned.
Signed-off-by: Nicolas Kaiser <nikai@nikai.net>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
|
In 'ubifs_replay_journal()' we allocate 'sbuf' for scanning the log.
However, we already have 'c->sbuf' for these purposes, so do not
allocate yet another one. This reduces UBIFS memory consumption while
recovering.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
|
|
This is a bug-fix: when we unmount, and we are currently in R/O
mode because of an error - we do not sync write-buffers, which
means we also do not cancel write-buffer timers we may possibly
have armed. This patch fixes the issue.
The issue can easily be reproduced by enabling UBIFS failure debug
mode (echo 4 > /sys/module/ubifs/parameters/debug_tsts) and
unmounting as soon as a failure happen. At some point the system
oopses because we have an armed hrtimer but UBIFS is unmounted
already.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
|
|
This is a clean-up patch which:
1. Removes explicite 'hrtimer_cancel()' after 'ubifs_wbuf_sync()' in
'ubifs_remount_ro()', because the timers will be canceled by
'ubifs_wbuf_sync()', no need to cancel them for the second time.
2. Remove "if (c->jheads)" check from 'ubifs_put_super()', because
at journal heads must always be allocated there, since we checked
earlier that we were mounted R/W, and the olny situation when
journal heads are not allocated is when mounter or re-mounted R/O.
Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
|
|
We were taking dcache_lock inside of i_lock, which introduces a dependency
not found elsewhere in the kernel, complicationg the vfs locking
scalability work. Since we don't actually need it here anyway, remove
it.
We only need i_lock to test for the I_COMPLETE flag, so be careful to do
so without dcache_lock held.
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
Convert a sequence of kmalloc and memcpy to use kmemdup.
The semantic patch that performs this transformation is:
(http://coccinelle.lip6.fr/)
// <smpl>
@@
expression a,flag,len;
expression arg,e1,e2;
statement S;
@@
a =
- \(kmalloc\|kzalloc\)(len,flag)
+ kmemdup(arg,len,flag)
<... when != a
if (a == NULL || ...) S
...>
- memcpy(a,arg,len+1);
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
Include "super.h" outside of CONFIG_DEBUG_FS to eliminate a compiler warning:
fs/ceph/debugfs.c:266: warning: 'struct ceph_fs_client' declared inside parameter list
fs/ceph/debugfs.c:266: warning: its scope is only this definition or declaration, which is probably not what you want
fs/ceph/debugfs.c:271: warning: 'struct ceph_fs_client' declared inside parameter list
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
|
|
Switch from using the BKL explicitly to the new lock_flocks() interface.
Eventually this will turn into a spinlock.
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
When the lock_kernel() turns into lock_flocks() and a spinlock, we won't
be able to do allocations with the lock held. Preallocate space without
the lock, and retry if the lock state changes out from underneath us.
Signed-off-by: Greg Farnum <gregf@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
This is simpler and faster.
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
The i_rdcache_gen value only implies we MAY have cached pages; actually
check the mapping to see if it's worth bothering with an invalidate.
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
Snaps in the root directory are now supported by the MDS, and harmless on
older versions.
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
This factors out protocol and low-level storage parts of ceph into a
separate libceph module living in net/ceph and include/linux/ceph. This
is mostly a matter of moving files around. However, a few key pieces
of the interface change as well:
- ceph_client becomes ceph_fs_client and ceph_client, where the latter
captures the mon and osd clients, and the fs_client gets the mds client
and file system specific pieces.
- Mount option parsing and debugfs setup is correspondingly broken into
two pieces.
- The mon client gets a generic handler callback for otherwise unknown
messages (mds map, in this case).
- The basic supported/required feature bits can be expanded (and are by
ceph_fs_client).
No functional change, aside from some subtle error handling cases that got
cleaned up in the refactoring process.
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
This will be used for rbd snapshots administration.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
|
|
Allow the messenger to send/receive data in a bio. This is added
so that we wouldn't need to copy the data into pages or some other buffer
when doing IO for an rbd block device.
We can now have trailing variable sized data for osd
ops. Also osd ops encoding is more modular.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
The osd requests creation are being decoupled from the
vino parameter, allowing clients using the osd to use
other arbitrary object names that are not necessarily
vino based. Also, calc_raw_layout now takes a snap id.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
Implement a pool lookup by name. This will be used by rbd.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
|
|
flush_scheduled_work() is going away.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Steve French <sfrench@us.ibm.com>
|
|
- Eliminate double declaration of variable blob_len
- Modify function build_ntlmssp_auth_blob to return error code
as well as length of the blob.
Signed-off-by: Shirish Pargaonkar <shirishpargaonkar@gmail.com>
Reviewed-by: Jeff Layton <jlayton@samba.org>
Signed-off-by: Steve French <sfrench@us.ibm.com>
|
|
RAW_SETBIND and RAW_GETBIND 32bit versions are fscked in interesting ways.
1) fs/compat_ioctl.c has COMPATIBLE_IOCTL(RAW_SETBIND) followed by
HANDLE_IOCTL(RAW_SETBIND, raw_ioctl). The latter is ignored.
2) on amd64 (and itanic) the damn thing is broken - we have int + u64 + u64
and layouts on i386 and amd64 are _not_ the same. raw_ioctl() would
work there, but it's never called due to (1). As it is, i386 /sbin/raw
definitely doesn't work on amd64 boxen.
3) switching to raw_ioctl() as is would *not* work on e.g. sparc64 and ppc64,
which would be rather sad, seeing that normal userland there is 32bit.
The thing is, slapping __packed on the struct in question does not DTRT -
it eliminates *all* padding. The real solution is to use compat_u64.
4) of course, all that stuff has no business being outside of raw.c in the
first place - there should be ->compat_ioctl() for /dev/rawctl instead of
messing with compat_ioctl.c.
[akpm@linux-foundation.org: coding-style fixes]
[arnd@arndb.de: port to 2.6.36]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
Conflicts:
block/blk-core.c
drivers/block/loop.c
mm/swapfile.c
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
|
/proc/diskstats would display a strange output as follows.
$ cat /proc/diskstats |grep sda
8 0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
8 1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
~~~~~~~~~~
8 2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
8 3 sda3 54 487 2188 92 0 0 0 0 0 88 92
8 4 sda4 4 0 8 0 0 0 0 0 0 0 0
8 5 sda5 81 2027 2130 138 0 0 0 0 0 87 137
Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.
The detailed root cause is as follows.
Assuming that there are two partition, sda1 and sda2.
1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
is 0 and sda2's one is 1.
| hd_struct->in_flight
---------------------------
sda1 | 0
sda2 | 1
---------------------------
2. A bio belongs to sda1 is issued and is merged into the request mentioned on
step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
from sda2 region to sda1 region. However the two partition's
hd_struct->in_flight are not changed.
| hd_struct->in_flight
---------------------------
sda1 | 0
sda2 | 1
---------------------------
3. The request is finished and blk_account_io_done() is called. In this case,
sda2's hd_struct->in_flight, not a sda1's one, is decremented.
| hd_struct->in_flight
---------------------------
sda1 | -1
sda2 | 1
---------------------------
The patch fixes the problem by caching the partition lookup
inside the request structure, hence making sure that the increment
and decrement will always happen on the same partition struct. This
also speeds up IO with accounting enabled, since it cuts down on
the number of lookups we have to do.
When reloading partition tables, quiesce IO to ensure that no
request references to the partition struct exists. When it is safe
to free the partition table, the IO for that device is restarted
again.
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
|
Get rid of init_MUTEX[_LOCKED]() and use sema_init() instead.
(Ported to current XFS code by <aelder@sgi.com>.)
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
|
|
This patch adds support for 32bit project quota identifiers.
On disk format is backward compatible with 16bit projid numbers. projid
on disk is now kept in two 16bit values - di_projid_lo (which holds the
same position as old 16bit projid value) and new di_projid_hi (takes
existing padding) and converts from/to 32bit value on the fly.
xfs_admin (for existing fs), mkfs.xfs (for new fs) needs to be used
to enable PROJID32BIT support.
Signed-off-by: Arkadiusz MiĆkiewicz <arekm@maven.pl>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
|
|
Stop having two different names for many buffer functions and use
the more descriptive xfs_buf_* names directly.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
|
|
We're not actually passing around credentials inside XFS for a while
now, so remove all xfs_cred.h with it's cred_t typedef and all
instances of it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
|
|
This header only provides one extern that isn't actually declared
anywhere, and shadowed by a macro.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
|
|
It used to have a place when it contained an automatically generated
CVS version, but these days it's entirely superflous.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
|
|
This header has been completely unused for a couple of years.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
|
|
Use the correct prototype for xfs_trans_committed instead of casting it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
|
|
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
|
|
These days inode64 should only control which AGs we allocate new
inodes from, while we still try to support reading all existing
inodes. To make this actually work the check ontop of xfs_iget
needs to be relaxed to allow inodes in all allocation groups instead
of just those that we allow allocating inodes from. Note that we
can't simply remove the check - it prevents us from accessing
invalid data when fed invalid inode numbers from NFS or bulkstat.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
|
|
Update the per-cpu counters manually in xfs_trans_unreserve_and_mod_sb
and remove support for per-cpu counters from xfs_mod_incore_sb_batch
to simplify it. And added benefit is that we don't have to take
m_sb_lock for transactions that only modify per-cpu counters.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
|
|
Export xfs_icsb_modify_counters and always use it for modifying
the per-cpu counters. Remove support for per-cpu counters from
xfs_mod_incore_sb to simplify it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
|
|
Fail the mount if we can't allocate memory for the per-CPU counters.
This is consistent with how we handle everything else in the mount
path and makes the superblock counter modification a lot simpler.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alex Elder <aelder@sgi.com>
|
|
pahole reports the struct xfs_buf has quite a few holes in it, so
packing the structure better will reduce the size of it by 16 bytes.
Also, move all the fields used in cache lookups into the first
cacheline.
Before on x86_64:
/* size: 320, cachelines: 5 */
/* sum members: 298, holes: 6, sum holes: 22 */
After on x86_64:
/* size: 304, cachelines: 5 */
/* padding: 6 */
/* last cacheline: 48 bytes */
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
|
|
The buffer cache hash is showing typical hash scalability problems.
In large scale testing the number of cached items growing far larger
than the hash can efficiently handle. Hence we need to move to a
self-scaling cache indexing mechanism.
I have selected rbtrees for indexing becuse they can have O(log n)
search scalability, and insert and remove cost is not excessive,
even on large trees. Hence we should be able to cache large numbers
of buffers without incurring the excessive cache miss search
penalties that the hash is imposing on us.
To ensure we still have parallel access to the cache, we need
multiple trees. Rather than hashing the buffers by disk address to
select a tree, it seems more sensible to separate trees by typical
access patterns. Most operations use buffers from within a single AG
at a time, so rather than searching lots of different lists,
separate the buffer indexes out into per-AG rbtrees. This means that
searches during metadata operation have a much higher chance of
hitting cache resident nodes, and that updates of the tree are less
likely to disturb trees being accessed on other CPUs doing
independent operations.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
|
|
Memory reclaim via shrinkers has a terrible habit of having N+M
concurrent shrinker executions (N = num CPUs, M = num kswapds) all
trying to shrink the same cache. When the cache they are all working
on is protected by a single spinlock, massive contention an
slowdowns occur.
Wrap the per-ag inode caches with a reclaim mutex to serialise
reclaim access to the AG. This will block concurrent reclaim in each
AG but still allow reclaim to scan multiple AGs concurrently. Allow
shrinkers to move on to the next AG if it can't get the lock, and if
we can't get any AG, then start blocking on locks.
To prevent reclaimers from continually scanning the same inodes in
each AG, add a cursor that tracks where the last reclaim got up to
and start from that point on the next reclaim. This should avoid
only ever scanning a small number of inodes at the satart of each AG
and not making progress. If we have a non-shrinker based reclaim
pass, ignore the cursor and reset it to zero once we are done.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Alex Elder <aelder@sgi.com>
|
|
Batch and optimise the per-ag inode lookup for reclaim to minimise
scanning overhead. This involves gang lookups on the radix trees to
get multiple inodes during each tree walk, and tighter validation of
what inodes can be reclaimed without blocking befor we take any
locks.
This is based on ideas suggested in a proof-of-concept patch
posted by Nick Piggin.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
|
|
With the reclaim code separated from the generic walking code, it is
simple to implement batched lookups for the generic walk code.
Separate out the inode validation from the execute operations and
modify the tree lookups to get a batch of inodes at a time.
Reclaim operations will be optimised separately.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Alex Elder <aelder@sgi.com>
|