aboutsummaryrefslogtreecommitdiffstats
path: root/fs (follow)
AgeCommit message (Collapse)AuthorFilesLines
2010-02-04syslog: use defined constants instead of raw numbersKees Cook1-5/+5
Right now the syslog "type" action are just raw numbers which makes the source difficult to follow. This patch replaces the raw numbers with defined constants for some level of sanity. Signed-off-by: Kees Cook <kees.cook@canonical.com> Acked-by: John Johansen <john.johansen@canonical.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: James Morris <jmorris@namei.org>
2010-02-04syslog: distinguish between /proc/kmsg and syscallsKees Cook1-7/+7
This allows the LSM to distinguish between syslog functions originating from /proc/kmsg access and direct syscalls. By default, the commoncaps will now no longer require CAP_SYS_ADMIN to read an opened /proc/kmsg file descriptor. For example the kernel syslog reader can now drop privileges after opening /proc/kmsg, instead of staying privileged with CAP_SYS_ADMIN. MAC systems that implement security_syslog have unchanged behavior. Signed-off-by: Kees Cook <kees.cook@canonical.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: John Johansen <john.johansen@canonical.com> Signed-off-by: James Morris <jmorris@namei.org>
2010-01-16nommu: fix shared mmap after truncate shrinkage problemsDavid Howells1-30/+1
Fix a problem in NOMMU mmap with ramfs whereby a shared mmap can happen over the end of a truncation. The problem is that ramfs_nommu_check_mappings() checks that the reduced file size against the VMA tree, but not the vm_region tree. The following sequence of events can cause the problem: fd = open("/tmp/x", O_RDWR|O_TRUNC|O_CREAT, 0600); ftruncate(fd, 32 * 1024); a = mmap(NULL, 32 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); b = mmap(NULL, 16 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); munmap(a, 32 * 1024); ftruncate(fd, 16 * 1024); c = mmap(NULL, 32 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); Mapping 'a' creates a vm_region covering 32KB of the file. Mapping 'b' sees that the vm_region from 'a' is covering the region it wants and so shares it, pinning it in memory. Mapping 'a' then goes away and the file is truncated to the end of VMA 'b'. However, the region allocated by 'a' is still in effect, and has _not_ been reduced. Mapping 'c' is then created, and because there's a vm_region covering the desired region, get_unmapped_area() is _not_ called to repeat the check, and the mapping is granted, even though the pages from the latter half of the mapping have been discarded. However: d = mmap(NULL, 16 * 1024, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); Mapping 'd' should work, and should end up sharing the region allocated by 'a'. To deal with this, we shrink the vm_region struct during the truncation, lest do_mmap_pgoff() take it as licence to share the full region automatically without calling the get_unmapped_area() file op again. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Greg Ungerer <gerg@snapgear.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-16nommu: fix race between ramfs truncation and shared mmapDavid Howells1-1/+6
Fix the race between the truncation of a ramfs file and an attempt to make a shared mmap of region of that file. The problem is that do_mmap_pgoff() calls f_op->get_unmapped_area() to verify that the file region is made of contiguous pages and to find its base address - but there isn't any locking to guarantee this region until vma_prio_tree_insert() is called by add_vma_to_mm(). Note that moving the functionality into f_op->mmap() doesn't help as that is also called before vma_prio_tree_insert(). Instead make ramfs_nommu_check_mappings() grab nommu_region_sem whilst it does its checks. This means that this function will wait whilst mmaps take place. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Greg Ungerer <gerg@snapgear.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-15inotify: only warn once for inotify problemsEric Paris1-1/+1
inotify will WARN() if it finds that the idr and the fsnotify internals somehow got out of sync. It was only supposed to do this once but due to this stupid bug it would warn every single time a problem was detected. Signed-off-by: Eric Paris <eparis@redhat.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-15inotify: do not reuse watch descriptorsEric Paris1-2/+2
Since commit 7e790dd5fc937bc8d2400c30a05e32a9e9eef276 ("inotify: fix error paths in inotify_update_watch") inotify changed the manor in which it gave watch descriptors back to userspace. Previous to this commit inotify acted like the following: inotify_add_watch(X, Y, Z) = 1 inotify_rm_watch(X, 1); inotify_add_watch(X, Y, Z) = 2 but after this patch inotify would return watch descriptors like so: inotify_add_watch(X, Y, Z) = 1 inotify_rm_watch(X, 1); inotify_add_watch(X, Y, Z) = 1 which I saw as equivalent to opening an fd where open(file) = 1; close(1); open(file) = 1; seemed perfectly reasonable. The issue is that quite a bit of userspace apparently relies on the behavior in which watch descriptors will not be quickly reused. KDE relies on it, I know some selinux packages rely on it, and I have heard complaints from other random sources such as debian bug 558981. Although the man page implies what we do is ok, we broke userspace so this patch almost reverts us to the old behavior. It is still slightly racey and I have patches that would fix that, but they are rather large and this will fix it for all real world cases. The race is as follows: - task1 creates a watch and blocks in idr_new_watch() before it updates the hint. - task2 creates a watch and updates the hint. - task1 updates the hint with it's older wd - task removes the watch created by task2 - task adds a new watch and will reuse the wd originally given to task2 it requires moving some locking around the hint (last_wd) but this should solve it for the real world and be -stable safe. As a side effect this patch papers over a bug in the lib/idr code which is causing a large number WARN's to pop on people's system and many reports in kerneloops.org. I'm working on the root cause of that idr bug seperately but this should make inotify immune to that issue. Signed-off-by: Eric Paris <eparis@redhat.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-13Merge branch 'fasync-helper'Linus Torvalds1-36/+66
* fasync-helper: fasync: split 'fasync_helper()' into separate add/remove functions
2010-01-12lib: Introduce generic list_sort functionDave Chinner1-95/+1
There are two copies of list_sort() in the tree already, one in the DRM code, another in ubifs. Now XFS needs this as well. Create a generic list_sort() function from the ubifs version and convert existing users to it so we don't end up with yet another copy in the tree. Signed-off-by: Dave Chinner <david@fromorbit.com> Acked-by: Dave Airlie <airlied@redhat.com> Acked-by: Artem Bityutskiy <dedekind@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-11Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfsLinus Torvalds5-625/+666
* 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: Ensure we force all busy extents in range to disk xfs: Don't flush stale inodes xfs: fix timestamp handling in xfs_setattr xfs: use DECLARE_EVENT_CLASS
2010-01-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixesLinus Torvalds4-15/+52
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-fixes: GFS2: Use MAX_LFS_FILESIZE for meta inode size GFS2: Fix gfs2_xattr_acl_chmod() GFS2: Fix locking bug in rename GFS2: Ensure uptodate inode size when using O_APPEND
2010-01-11Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6Linus Torvalds1-0/+3
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: quota: Fix dquot_transfer for filesystems different from ext4
2010-01-11smaps: fix wrong rss countMinchan Kim1-2/+1
A long time ago we regarded zero page as file_rss and vm_normal_page doesn't return NULL. But now, we reinstated ZERO_PAGE and vm_normal_page's implementation can return NULL in case of zero page. Also we don't count it with file_rss any more. Then, RSS and PSS can't be matched. For consistency, Let's ignore zero page in smaps_pte_range. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Matt Mackall <mpm@selenic.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-11proc: partially revert "procfs: provide stack information for threads"KOSAKI Motohiro1-89/+0
Commit d899bf7b (procfs: provide stack information for threads) introduced to show stack information in /proc/{pid}/status. But it cause large performance regression. Unfortunately /proc/{pid}/status is used ps command too and ps is one of most important component. Because both to take mmap_sem and page table walk are heavily operation. If many process run, the ps performance is, [before d899bf7b] % perf stat ps >/dev/null Performance counter stats for 'ps': 4090.435806 task-clock-msecs # 0.032 CPUs 229 context-switches # 0.000 M/sec 0 CPU-migrations # 0.000 M/sec 234 page-faults # 0.000 M/sec 8587565207 cycles # 2099.425 M/sec 9866662403 instructions # 1.149 IPC 3789415411 cache-references # 926.409 M/sec 30419509 cache-misses # 7.437 M/sec 128.859521955 seconds time elapsed [after d899bf7b] % perf stat ps > /dev/null Performance counter stats for 'ps': 4305.081146 task-clock-msecs # 0.028 CPUs 480 context-switches # 0.000 M/sec 2 CPU-migrations # 0.000 M/sec 237 page-faults # 0.000 M/sec 9021211334 cycles # 2095.480 M/sec 10605887536 instructions # 1.176 IPC 3612650999 cache-references # 839.160 M/sec 23917502 cache-misses # 5.556 M/sec 152.277819582 seconds time elapsed Thus, this patch revert it. Fortunately /proc/{pid}/task/{tid}/smaps provide almost same information. we can use it. Commit d899bf7b introduced two features: 1) Add the annotattion of [thread stack: xxxx] mark to /proc/{pid}/task/{tid}/maps. 2) Add StackUsage field to /proc/{pid}/status. I only revert (2), because I haven't seen (1) cause regression. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Stefani Seibold <stefani@seibold.net> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-11quota: Fix dquot_transfer for filesystems different from ext4Jan Kara1-0/+3
Commit fd8fbfc1 modified the way we find amount of reserved space belonging to an inode. The amount of reserved space is checked from dquot_transfer and thus inode_reserved_space gets called even for filesystems that don't provide get_reserved_space callback which results in a BUG. Fix the problem by checking get_reserved_space callback and return 0 if the filesystem does not provide it. CC: Dmitry Monakhov <dmonakhov@openvz.org> Signed-off-by: Jan Kara <jack@suse.cz>
2010-01-11GFS2: Use MAX_LFS_FILESIZE for meta inode sizeSteven Whitehouse1-1/+1
Using ~0ULL was cauing sign issues in filemap_fdatawrite_range, so use MAX_LFS_FILESIZE instead. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-01-10xfs: Ensure we force all busy extents in range to diskDave Chinner2-29/+27
When we search for and find a busy extent during allocation we force the log out to ensure the extent free transaction is on disk before the allocation transaction. The current implementation has a subtle bug in it--it does not handle multiple overlapping ranges. That is, if we free lots of little extents into a single contiguous extent, then allocate the contiguous extent, the busy search code stops searching at the first extent it finds that overlaps the allocated range. It then uses the commit LSN of the transaction to force the log out to. Unfortunately, the other busy ranges might have more recent commit LSNs than the first busy extent that is found, and this results in xfs_alloc_search_busy() returning before all the extent free transactions are on disk for the range being allocated. This can lead to potential metadata corruption or stale data exposure after a crash because log replay won't replay all the extent free transactions that cover the allocation range. Modified-by: Alex Elder <aelder@sgi.com> (Dropped the "found" argument from the xfs_alloc_busysearch trace event.) Signed-off-by: Dave Chinner <david@fromorbit.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-01-10xfs: Don't flush stale inodesDave Chinner1-3/+7
Because inodes remain in cache much longer than inode buffers do under memory pressure, we can get the situation where we have stale, dirty inodes being reclaimed but the backing storage has been freed. Hence we should never, ever flush XFS_ISTALE inodes to disk as there is no guarantee that the backing buffer is in cache and still marked stale when the flush occurs. Signed-off-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-01-10xfs: fix timestamp handling in xfs_setattrChristoph Hellwig2-55/+41
We currently have some rather odd code in xfs_setattr for updating the a/c/mtime timestamps: - first we do a non-transaction update if all three are updated together - second we implicitly update the ctime for various changes instead of relying on the ATTR_CTIME flag - third we set the timestamps to the current time instead of the arguments in the iattr structure in many cases. This patch makes sure we update it in a consistent way: - always transactional - ctime is only updated if ATTR_CTIME is set or we do a size update, which is a special case - always to the times passed in from the caller instead of the current time The only non-size caller of xfs_setattr that doesn't come from the VFS is updated to set ATTR_CTIME and pass in a valid ctime value. Reported-by: Eric Blake <ebb9@byu.net> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-01-10xfs: use DECLARE_EVENT_CLASSChristoph Hellwig1-538/+591
Using DECLARE_EVENT_CLASS allows us to to use trace event code instead of duplicating it in the binary. This was not available before 2.6.33 so it had to be done as a separate step once the prerequisite was merged. This only requires changes to xfs_trace.h and the results are rather impressive: hch@brick:~/work/linux-2.6/obj-kvm$ size fs/xfs/xfs.o* text data bss dec hex filename 607732 41884 3616 653232 9f7b0 fs/xfs/xfs.o 1026732 41884 3808 1072424 105d28 fs/xfs/xfs.o.old Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-01-08Merge branch 'reiserfs/kill-bkl' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracingLinus Torvalds4-9/+33
* 'reiserfs/kill-bkl' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing: reiserfs: Relax reiserfs_xattr_set_handle() while acquiring xattr locks reiserfs: Fix unreachable statement reiserfs: Don't call reiserfs_get_acl() with the reiserfs lock reiserfs: Relax lock on xattr removing reiserfs: Relax the lock before truncating pages reiserfs: Fix recursive lock on lchown reiserfs: Fix mistake in down_write() conversion
2010-01-08Merge branch 'for-linus' of git://oss.sgi.com/xfs/xfsLinus Torvalds1-4/+4
* 'for-linus' of git://oss.sgi.com/xfs/xfs: xfs: kill some warnings on i386 builds
2010-01-08Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6Linus Torvalds1-0/+1
* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6: nfs: fix oops in nfs_rename() sunrpc: fix build-time warning sunrpc: on successful gss error pipe write, don't return error SUNRPC: Fix the return value in gss_import_sec_context() SUNRPC: Fix up an error return value in gss_import_sec_context_kerberos()
2010-01-08xfs: kill some warnings on i386 buildsDave Chinner1-4/+4
Randy Dunlap Reported printk() format-related warnings reported on i386 builds in his environment. Dave Chinner provided this patch to eliminate them. Signed-off by: Dave Chinner <david@fromorbit.com> Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Alex Elder <aelder@sgi.com>
2010-01-08GFS2: Fix gfs2_xattr_acl_chmod()Steven Whitehouse1-10/+11
The ref counting for the bh returned by gfs2_ea_find() was wrong. This patch ensures that we always drop the ref count to that bh correctly. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-01-08GFS2: Fix locking bug in renameSteven Whitehouse1-2/+4
The rename code was taking a resource group lock in cases where it wasn't actually needed, this caused problems if the rename was resulting in an inode being unlinked. The patch ensures that we only take the rgrp lock early if it is really needed. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-01-08GFS2: Ensure uptodate inode size when using O_APPENDSteven Whitehouse1-2/+36
The VFS reads the inode size during generic_file_aio_write() but with no locking around it. In order to get the expected result from O_APPEND opens, this patch updated the inode size before calling generic_file_aio_write() There is of course still a race here, in that there is nothing to prevent another node coming in and extending the file in the mean time. On the other hand, when used with file locking this will ensure that the expected results are obtained. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com>
2010-01-07reiserfs: Relax reiserfs_xattr_set_handle() while acquiring xattr locksFrederic Weisbecker1-0/+4
Fix remaining xattr locks acquired in reiserfs_xattr_set_handle() while we are holding the reiserfs lock to avoid lock inversions. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Christian Kujau <lists@nerdbynature.de> Cc: Alexander Beregalov <a.beregalov@gmail.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Ingo Molnar <mingo@elte.hu>
2010-01-07reiserfs: Fix unreachable statementJiri Slaby1-1/+2
Stanse found an unreachable statement in reiserfs_ioctl. There is a if followed by error assignment and `break' with no braces. Add the braces so that we don't break every time, but only in error case, so that REISERFS_IOC_SETVERSION actually works when it returns no error. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Cc: Reiserfs <reiserfs-devel@vger.kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2010-01-07reiserfs: Don't call reiserfs_get_acl() with the reiserfs lockFrederic Weisbecker1-0/+2
reiserfs_get_acl is usually not called under the reiserfs lock, as it doesn't need it. But it happens when it is called by reiserfs_acl_chmod(), which creates a dependency inversion against the private xattr inodes mutexes for the given inode. We need to call it without the reiserfs lock, especially since it's unnecessary. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Christian Kujau <lists@nerdbynature.de> Cc: Alexander Beregalov <a.beregalov@gmail.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Ingo Molnar <mingo@elte.hu>
2010-01-06FDPIC: Respect PT_GNU_STACK exec protection markings when creating NOMMU stackMike Frysinger1-2/+11
The current code will load the stack size and protection markings, but then only use the markings in the MMU code path. The NOMMU code path always passes PROT_EXEC to the mmap() call. While this doesn't matter to most people whilst the code is running, it will cause a pointless icache flush when starting every FDPIC application. Typically this icache flush will be of a region on the order of 128KB in size, or may be the entire icache, depending on the facilities available on the CPU. In the case where the arch default behaviour seems to be desired (EXSTACK_DEFAULT), we probe VM_STACK_FLAGS for VM_EXEC to determine whether we should be setting PROT_EXEC or not. For arches that support an MPU (Memory Protection Unit - an MMU without the virtual mapping capability), setting PROT_EXEC or not will make an important difference. It should be noted that this change also affects the executability of the brk region, since ELF-FDPIC has that share with the stack. However, this is probably irrelevant as NOMMU programs aren't likely to use the brk region, preferring instead allocation via mmap(). Signed-off-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-06Merge branch 'for-2.6.33' of git://linux-nfs.org/~bfields/linuxLinus Torvalds1-4/+1
* 'for-2.6.33' of git://linux-nfs.org/~bfields/linux: sunrpc: fix peername failed on closed listener nfsd: make sure data is on disk before calling ->fsync nfsd: fix "insecure" export option
2010-01-06nfs: fix oops in nfs_rename()OGAWA Hirofumi1-0/+1
Recent change is missing to update "rehash". With that change, it will become the cause of adding dentry to hash twice. This explains the reason of Oops (dereference the freed dentry in __d_lookup()) on my machine. Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Reported-by: Marvin <marvin24@gmx.de> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
2010-01-06nfsd: make sure data is on disk before calling ->fsyncChristoph Hellwig1-4/+1
nfsd is not using vfs_fsync, so I missed it when changing the calling convention during the 2.6.32 window. This patch fixes it to not only start the data writeout, but also wait for it to complete before calling into ->fsync. Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: stable@kernel.org Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
2010-01-06Merge branch 'for-linus' of git://git.open-osd.org/linux-open-osdLinus Torvalds2-9/+18
* 'for-linus' of git://git.open-osd.org/linux-open-osd: exofs: simple_write_end does not mark_inode_dirty exofs: fix pnfs_osd re-definitions in pre-pnfs trees
2010-01-05Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2Linus Torvalds1-6/+15
* 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jlbec/ocfs2: ocfs2: Handle O_DIRECT when writing to a refcounted cluster.
2010-01-05exofs: simple_write_end does not mark_inode_dirtyBoaz Harrosh1-1/+16
exofs uses simple_write_end() for it's .write_end handler. But it is not enough because simple_write_end() does not call mark_inode_dirty() when it extends i_size. So even if we do call mark_inode_dirty at beginning of write out, with a very long IO and a saturated system we might get the .write_inode() called while still extend-writing to file and miss out on the last i_size updates. So override .write_end, call simple_write_end(), and afterwords if i_size was changed call mark_inode_dirty(). It stands to logic that since simple_write_end() was the one extending i_size it should also call mark_inode_dirty(). But it looks like all users of simple_write_end() are memory-bound pseudo filesystems, who could careless about mark_inode_dirty(). I might submit a warning-comment patch to simple_write_end() in future. CC: Stable <stable@kernel.org> Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2010-01-05exofs: fix pnfs_osd re-definitions in pre-pnfs treesBoaz Harrosh1-8/+2
Some on disk exofs constants and types are defined in the pnfs_osd_xdr.h file. Since we needed these types before the pnfs-objects code was accepted to mainline we duplicated the minimal needed definitions into an exofs local header. The definitions where conditionally included depending on !CONFIG_PNFS defined. So if PNFS was present in the tree definitions are taken from there and if not they are defined locally. That was all good but, the CONFIG_PNFS is planed to be included upstream before the pnfs-objects is also included. (The first pnfs batch might be pnfs-files only) So condition exofs local definitions on the absence of pnfs_osd_xdr.h inclusion (__PNFS_OSD_XDR_H__ not defined). User code must make sure that in future pnfs_osd_xdr.h will be included before fs/exofs/pnfs.h, which happens to be so in current code. Once pnfs-objects hits mainline, exofs's local header will be removed. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
2010-01-05reiserfs: Relax lock on xattr removingFrederic Weisbecker1-3/+9
When we remove an xattr, we call lookup_and_delete_xattr() that takes some private xattr inodes mutexes. But we hold the reiserfs lock at this time, which leads to dependency inversions. We can safely call lookup_and_delete_xattr() without the reiserfs lock, where xattr inodes lookups only need the xattr inodes mutexes. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Christian Kujau <lists@nerdbynature.de> Cc: Alexander Beregalov <a.beregalov@gmail.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Ingo Molnar <mingo@elte.hu>
2010-01-05reiserfs: Relax the lock before truncating pagesFrederic Weisbecker1-1/+10
While truncating a file, reiserfs_setattr() calls inode_setattr() that will truncate the mapping for the given inode, but for that it needs the pages locks. In order to release these, the owners need the reiserfs lock to complete their jobs. But they can't, as we don't release it before calling inode_setattr(). We need to do that to fix the following softlockups: INFO: task flush-8:0:2149 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. flush-8:0 D f51af998 0 2149 2 0x00000000 f51af9ac 00000092 00000002 f51af998 c2803304 00000000 c1894ad0 010f3000 f51af9cc c1462604 c189ef80 f51af974 c1710304 f715b450 f715b5ec c2807c40 00000000 0005bb00 c2803320 c102c55b c1710304 c2807c50 c2803304 00000246 Call Trace: [<c1462604>] ? schedule+0x434/0xb20 [<c102c55b>] ? resched_task+0x4b/0x70 [<c106fa22>] ? mark_held_locks+0x62/0x80 [<c146414d>] ? mutex_lock_nested+0x1fd/0x350 [<c14640b9>] mutex_lock_nested+0x169/0x350 [<c1178cde>] ? reiserfs_write_lock+0x2e/0x40 [<c1178cde>] reiserfs_write_lock+0x2e/0x40 [<c11719a2>] do_journal_end+0xc2/0xe70 [<c1172912>] journal_end+0xb2/0x120 [<c11686b3>] ? pathrelse+0x33/0xb0 [<c11729e4>] reiserfs_end_persistent_transaction+0x64/0x70 [<c1153caa>] reiserfs_get_block+0x12ba/0x15f0 [<c106fa22>] ? mark_held_locks+0x62/0x80 [<c1154b24>] reiserfs_writepage+0xa74/0xe80 [<c1465a27>] ? _raw_spin_unlock_irq+0x27/0x50 [<c11f3d25>] ? radix_tree_gang_lookup_tag_slot+0x95/0xc0 [<c10b5377>] ? find_get_pages_tag+0x127/0x1a0 [<c106fa22>] ? mark_held_locks+0x62/0x80 [<c106fcd4>] ? trace_hardirqs_on_caller+0x124/0x170 [<c10bc1e0>] __writepage+0x10/0x40 [<c10bc9ab>] write_cache_pages+0x16b/0x320 [<c10bc1d0>] ? __writepage+0x0/0x40 [<c10bcb88>] generic_writepages+0x28/0x40 [<c10bcbd5>] do_writepages+0x35/0x40 [<c11059f7>] writeback_single_inode+0xc7/0x330 [<c11067b2>] writeback_inodes_wb+0x2c2/0x490 [<c1106a86>] wb_writeback+0x106/0x1b0 [<c1106cf6>] wb_do_writeback+0x106/0x1e0 [<c1106c18>] ? wb_do_writeback+0x28/0x1e0 [<c1106e0a>] bdi_writeback_task+0x3a/0xb0 [<c10cbb13>] bdi_start_fn+0x63/0xc0 [<c10cbab0>] ? bdi_start_fn+0x0/0xc0 [<c105d1f4>] kthread+0x74/0x80 [<c105d180>] ? kthread+0x0/0x80 [<c100327a>] kernel_thread_helper+0x6/0x10 3 locks held by flush-8:0/2149: #0: (&type->s_umount_key#30){+++++.}, at: [<c110676f>] writeback_inodes_wb+0x27f/0x490 #1: (&journal->j_mutex){+.+...}, at: [<c117199a>] do_journal_end+0xba/0xe70 #2: (&REISERFS_SB(s)->lock){+.+.+.}, at: [<c1178cde>] reiserfs_write_lock+0x2e/0x40 INFO: task fstest:3813 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. fstest D 00000002 0 3813 3812 0x00000000 f5103c94 00000082 f5103c40 00000002 f5ad5450 00000007 f5103c28 011f3000 00000006 f5ad5450 c10bb005 00000480 c1710304 f5ad5450 f5ad55ec c2907c40 00000001 f5ad5450 f5103c74 00000046 00000002 f5ad5450 00000007 f5103c6c Call Trace: [<c10bb005>] ? free_hot_cold_page+0x1d5/0x280 [<c1462d64>] io_schedule+0x74/0xc0 [<c10b5a45>] sync_page+0x35/0x60 [<c146325a>] __wait_on_bit_lock+0x4a/0x90 [<c10b5a10>] ? sync_page+0x0/0x60 [<c10b59e5>] __lock_page+0x85/0x90 [<c105d660>] ? wake_bit_function+0x0/0x60 [<c10bf654>] truncate_inode_pages_range+0x1e4/0x2d0 [<c10bf75f>] truncate_inode_pages+0x1f/0x30 [<c10bf7cf>] truncate_pagecache+0x5f/0xa0 [<c10bf86a>] vmtruncate+0x5a/0x70 [<c10fdb7d>] inode_setattr+0x5d/0x190 [<c1150117>] reiserfs_setattr+0x1f7/0x2f0 [<c1464569>] ? down_write+0x49/0x70 [<c10fde01>] notify_change+0x151/0x330 [<c10e6f3d>] do_truncate+0x6d/0xa0 [<c10f4ce2>] do_filp_open+0x9a2/0xcf0 [<c1465aec>] ? _raw_spin_unlock+0x2c/0x50 [<c10fec50>] ? alloc_fd+0xe0/0x100 [<c10e602d>] do_sys_open+0x6d/0x130 [<c1002cfb>] ? sysenter_exit+0xf/0x16 [<c10e615e>] sys_open+0x2e/0x40 [<c1002ccc>] sysenter_do_call+0x12/0x32 3 locks held by fstest/3813: #0: (&sb->s_type->i_mutex_key#4){+.+.+.}, at: [<c10e6f33>] do_truncate+0x63/0xa0 #1: (&sb->s_type->i_alloc_sem_key#3){+.+.+.}, at: [<c10fdf07>] notify_change+0x257/0x330 #2: (&REISERFS_SB(s)->lock){+.+.+.}, at: [<c1178c8e>] reiserfs_write_lock_once+0x2e/0x50 Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Christian Kujau <lists@nerdbynature.de> Cc: Alexander Beregalov <a.beregalov@gmail.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Ingo Molnar <mingo@elte.hu>
2010-01-05reiserfs: Fix recursive lock on lchownFrederic Weisbecker1-3/+5
On chown, reiserfs will call reiserfs_setattr() to change the owner of the given inode, but it may also recursively call reiserfs_setattr() to propagate the owner change to the private xattr files for this inode. Hence, the reiserfs lock may be acquired twice which is not wanted as reiserfs_setattr() calls journal_begin() that is going to try to relax the lock in order to safely acquire the journal mutex. Using reiserfs_write_lock_once() from reiserfs_setattr() solves the problem. This fixes the following warning, that precedes a lockdep report. WARNING: at fs/reiserfs/lock.c:95 reiserfs_lock_check_recursive+0x3f/0x50() Hardware name: MS-7418 Unwanted recursive reiserfs lock! Pid: 4189, comm: fsstress Not tainted 2.6.33-rc2-tip-atom+ #195 Call Trace: [<c1178bff>] ? reiserfs_lock_check_recursive+0x3f/0x50 [<c1178bff>] ? reiserfs_lock_check_recursive+0x3f/0x50 [<c103f7ac>] warn_slowpath_common+0x6c/0xc0 [<c1178bff>] ? reiserfs_lock_check_recursive+0x3f/0x50 [<c103f84b>] warn_slowpath_fmt+0x2b/0x30 [<c1178bff>] reiserfs_lock_check_recursive+0x3f/0x50 [<c1172ae3>] do_journal_begin_r+0x83/0x350 [<c1172f2d>] journal_begin+0x7d/0x140 [<c106509a>] ? in_group_p+0x2a/0x30 [<c10fda71>] ? inode_change_ok+0x91/0x140 [<c115007d>] reiserfs_setattr+0x15d/0x2e0 [<c10f9bf3>] ? dput+0xe3/0x140 [<c1465adc>] ? _raw_spin_unlock+0x2c/0x50 [<c117831d>] chown_one_xattr+0xd/0x10 [<c11780a3>] reiserfs_for_each_xattr+0x113/0x2c0 [<c1178310>] ? chown_one_xattr+0x0/0x10 [<c14641e9>] ? mutex_lock_nested+0x2a9/0x350 [<c117826f>] reiserfs_chown_xattrs+0x1f/0x60 [<c106509a>] ? in_group_p+0x2a/0x30 [<c10fda71>] ? inode_change_ok+0x91/0x140 [<c1150046>] reiserfs_setattr+0x126/0x2e0 [<c1177c20>] ? reiserfs_getxattr+0x0/0x90 [<c11b0d57>] ? cap_inode_need_killpriv+0x37/0x50 [<c10fde01>] notify_change+0x151/0x330 [<c10e659f>] chown_common+0x6f/0x90 [<c10e67bd>] sys_lchown+0x6d/0x80 [<c1002ccc>] sysenter_do_call+0x12/0x32 ---[ end trace 7c2b77224c1442fc ]--- Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Christian Kujau <lists@nerdbynature.de> Cc: Alexander Beregalov <a.beregalov@gmail.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Ingo Molnar <mingo@elte.hu>
2010-01-04sysfs: Add lockdep annotations for the sysfs active referenceEric W. Biederman2-2/+27
Holding locks over device_del -> kobject_del -> sysfs_deactivate can cause deadlocks if those same locks are grabbed in sysfs show or store methods. The I model s_active count + completion as a sleeping read/write lock. I describe to lockdep sysfs_get_active as a read_trylock, sysfs_put_active as a read_unlock, and sysfs_deactivate as a write_lock and write_unlock pair. This seems to capture the essence for purposes of finding deadlocks, and in my testing gives finds real issues and ignores non-issues. This brings us back to holding locks over kobject_del is a problem that ideally we should find a way of addressing, but at least lockdep can tell us about the problems instead of requiring developers to debug rare strange system deadlocks, that happen when sysfs files are removed while being written to. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-04Merge branch 'sh/for-2.6.33' of git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6Linus Torvalds1-2/+2
* 'sh/for-2.6.33' of git://git.kernel.org/pub/scm/linux/kernel/git/lethal/sh-2.6: binfmt_elf_fdpic: Fix build breakage introduced by coredump changes. sh: update defconfigs. sh: Don't default enable PMB support. sh: Disable PMB for SH4AL-DSP CPUs. sh: Only provide a PCLK definition for legacy CPG CPUs.
2010-01-04Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4Linus Torvalds5-48/+77
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: Calculate metadata requirements more accurately ext4: Fix accounting of reserved metadata blocks
2010-01-04Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2Linus Torvalds4-24/+30
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ryusuke/nilfs2: nilfs2: update mailing list address nilfs2: Storage class should be before const qualifier nilfs2: trivial coding style fix
2010-01-04binfmt_elf_fdpic: Fix build breakage introduced by coredump changes.Daisuke HATAYAMA1-2/+2
Commit f6151dfea21496d43dbaba32cfcd9c9f404769bc introduces build breakage, so this patch fixes it together with some printk formatting cleanup. Signed-off-by: Daisuke HATAYAMA <d.hatayama@jp.fujitsu.com> Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2010-01-03reiserfs: Fix mistake in down_write() conversionFrederic Weisbecker1-1/+1
Fix a mistake in commit 0719d3434747889b314a1e8add776418c4148bcf (reiserfs: Fix reiserfs lock <-> i_xattr_sem dependency inversion) that has converted a down_write() into a down_read() accidentally. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Christian Kujau <lists@nerdbynature.de> Cc: Alexander Beregalov <a.beregalov@gmail.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Ingo Molnar <mingo@elte.hu>
2010-01-02Merge branch 'reiserfs/kill-bkl' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracingLinus Torvalds6-15/+53
* 'reiserfs/kill-bkl' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing: reiserfs: Safely acquire i_mutex from xattr_rmdir reiserfs: Safely acquire i_mutex from reiserfs_for_each_xattr reiserfs: Fix journal mutex <-> inode mutex lock inversion reiserfs: Fix unwanted recursive reiserfs lock in reiserfs_unlink() reiserfs: Relax lock before open xattr dir in reiserfs_xattr_set_handle() reiserfs: Relax reiserfs lock while freeing the journal reiserfs: Fix reiserfs lock <-> i_mutex dependency inversion on xattr reiserfs: Warn on lock relax if taken recursively reiserfs: Fix reiserfs lock <-> i_xattr_sem dependency inversion reiserfs: Fix remaining in-reclaim-fs <-> reclaim-fs-on locking inversion reiserfs: Fix reiserfs lock <-> inode mutex dependency inversion reiserfs: Fix reiserfs lock and journal lock inversion dependency reiserfs: Fix possible recursive lock
2010-01-02writeback: add missing kernel-doc notationJaswinder Singh Rajput1-0/+1
Fix the following htmldocs warning: Warning(fs/fs-writeback.c:255): No description found for parameter 'sb' Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com> Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Acked-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Jan Kara <jack@suse.cz> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-01-02reiserfs: Safely acquire i_mutex from xattr_rmdirFrederic Weisbecker1-1/+2
Relax the reiserfs lock before taking the inode mutex from xattr_rmdir() to avoid the usual reiserfs lock <-> inode mutex bad dependency. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Tested-by: Christian Kujau <lists@nerdbynature.de> Cc: Alexander Beregalov <a.beregalov@gmail.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Ingo Molnar <mingo@elte.hu>
2010-01-02reiserfs: Safely acquire i_mutex from reiserfs_for_each_xattrFrederic Weisbecker1-2/+3
Relax the reiserfs lock before taking the inode mutex from reiserfs_for_each_xattr() to avoid the usual bad dependencies: ======================================================= [ INFO: possible circular locking dependency detected ] 2.6.32-atom #179 ------------------------------------------------------- rm/3242 is trying to acquire lock: (&sb->s_type->i_mutex_key#4/3){+.+.+.}, at: [<c11428ef>] reiserfs_for_each_xattr+0x23f/0x290 but task is already holding lock: (&REISERFS_SB(s)->lock){+.+.+.}, at: [<c1143389>] reiserfs_write_lock+0x29/0x40 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&REISERFS_SB(s)->lock){+.+.+.}: [<c105ea7f>] __lock_acquire+0x11ff/0x19e0 [<c105f2c8>] lock_acquire+0x68/0x90 [<c1401aab>] mutex_lock_nested+0x5b/0x340 [<c1143339>] reiserfs_write_lock_once+0x29/0x50 [<c1117022>] reiserfs_lookup+0x62/0x140 [<c10bd85f>] __lookup_hash+0xef/0x110 [<c10bf21d>] lookup_one_len+0x8d/0xc0 [<c1141e3a>] open_xa_dir+0xea/0x1b0 [<c1142720>] reiserfs_for_each_xattr+0x70/0x290 [<c11429ba>] reiserfs_delete_xattrs+0x1a/0x60 [<c111ea2f>] reiserfs_delete_inode+0x9f/0x150 [<c10c9c32>] generic_delete_inode+0xa2/0x170 [<c10c9d4f>] generic_drop_inode+0x4f/0x70 [<c10c8b07>] iput+0x47/0x50 [<c10c0965>] do_unlinkat+0xd5/0x160 [<c10c0b13>] sys_unlinkat+0x23/0x40 [<c1002ec4>] sysenter_do_call+0x12/0x32 -> #0 (&sb->s_type->i_mutex_key#4/3){+.+.+.}: [<c105f176>] __lock_acquire+0x18f6/0x19e0 [<c105f2c8>] lock_acquire+0x68/0x90 [<c1401aab>] mutex_lock_nested+0x5b/0x340 [<c11428ef>] reiserfs_for_each_xattr+0x23f/0x290 [<c11429ba>] reiserfs_delete_xattrs+0x1a/0x60 [<c111ea2f>] reiserfs_delete_inode+0x9f/0x150 [<c10c9c32>] generic_delete_inode+0xa2/0x170 [<c10c9d4f>] generic_drop_inode+0x4f/0x70 [<c10c8b07>] iput+0x47/0x50 [<c10c0965>] do_unlinkat+0xd5/0x160 [<c10c0b13>] sys_unlinkat+0x23/0x40 [<c1002ec4>] sysenter_do_call+0x12/0x32 other info that might help us debug this: 1 lock held by rm/3242: #0: (&REISERFS_SB(s)->lock){+.+.+.}, at: [<c1143389>] reiserfs_write_lock+0x29/0x40 stack backtrace: Pid: 3242, comm: rm Not tainted 2.6.32-atom #179 Call Trace: [<c13ffa13>] ? printk+0x18/0x1a [<c105d33a>] print_circular_bug+0xca/0xd0 [<c105f176>] __lock_acquire+0x18f6/0x19e0 [<c105c932>] ? mark_held_locks+0x62/0x80 [<c105cc3b>] ? trace_hardirqs_on+0xb/0x10 [<c1401098>] ? mutex_unlock+0x8/0x10 [<c105f2c8>] lock_acquire+0x68/0x90 [<c11428ef>] ? reiserfs_for_each_xattr+0x23f/0x290 [<c11428ef>] ? reiserfs_for_each_xattr+0x23f/0x290 [<c1401aab>] mutex_lock_nested+0x5b/0x340 [<c11428ef>] ? reiserfs_for_each_xattr+0x23f/0x290 [<c11428ef>] reiserfs_for_each_xattr+0x23f/0x290 [<c1143180>] ? delete_one_xattr+0x0/0x100 [<c11429ba>] reiserfs_delete_xattrs+0x1a/0x60 [<c1143339>] ? reiserfs_write_lock_once+0x29/0x50 [<c111ea2f>] reiserfs_delete_inode+0x9f/0x150 [<c11b0d4f>] ? _atomic_dec_and_lock+0x4f/0x70 [<c111e990>] ? reiserfs_delete_inode+0x0/0x150 [<c10c9c32>] generic_delete_inode+0xa2/0x170 [<c10c9d4f>] generic_drop_inode+0x4f/0x70 [<c10c8b07>] iput+0x47/0x50 [<c10c0965>] do_unlinkat+0xd5/0x160 [<c1401098>] ? mutex_unlock+0x8/0x10 [<c10c3e0d>] ? vfs_readdir+0x7d/0xb0 [<c10c3af0>] ? filldir64+0x0/0xf0 [<c1002ef3>] ? sysenter_exit+0xf/0x16 [<c105cbe4>] ? trace_hardirqs_on_caller+0x124/0x170 [<c10c0b13>] sys_unlinkat+0x23/0x40 [<c1002ec4>] sysenter_do_call+0x12/0x32 Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Tested-by: Christian Kujau <lists@nerdbynature.de> Cc: Alexander Beregalov <a.beregalov@gmail.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Ingo Molnar <mingo@elte.hu>