Age | Commit message (Collapse) | Author | Files | Lines |
|
Test generic/224 is failing with a corruption being detected on one
of Michael's test boxes. Debug that Michael added is indicating
that the minleft trimming is resulting in an underflow:
.....
before fixup: rlen 1 args->len 0
after xfs_alloc_fix_len : rlen 1 args->len 1
before goto out_nominleft: rlen 1 args->len 0
before fixup: rlen 1 args->len 0
after xfs_alloc_fix_len : rlen 1 args->len 1
after fixup: rlen 1 args->len 1
before fixup: rlen 1 args->len 0
after xfs_alloc_fix_len : rlen 1 args->len 1
after fixup: rlen 4294967295 args->len 4294967295
XFS: Assertion failed: fs_is_ok, file: fs/xfs/libxfs/xfs_alloc.c, line: 1424
The "goto out_nominleft:" indicates that we are getting close to
ENOSPC in the AG, and a couple of allocations later we underflow
and the corruption check fires in xfs_alloc_ag_vextent_size().
The issue is that the extent length fixups comaprisons are done
with variables of xfs_extlen_t types. These are unsigned so an
underflow looks like a really big value and hence is not detected
as being smaller than the minimum length allowed for the extent.
Hence the corruption check fires as it is noticing that the returned
length is longer than the original extent length passed in.
This can be easily fixed by ensuring we do the underflow test on
signed values, the same way xfs_alloc_fix_len() prevents underflow.
So we realise in future that these casts prevent underflows from
going undetected, add comments to the code indicating this.
Reported-by: Michael L. Semon <mlsemon35@gmail.com>
Tested-by: Michael L. Semon <mlsemon35@gmail.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
If xfs_trans_reserve fails we don't cancel the transaction,
and we'll leak the allocated transaction pointer.
Spotted by Coverity.
Signed-off-by: Eric Sandeen <ssandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
The error messages document the reason for the checks better than the comment
and the comments about volume mounts date back to Irix and so aren't relevant
any more. So just remove the old and redundant comment.
Signed-off-by: Wang Sheng-Hui <shhuiw@foxmail.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Today, when the "failing async writes" get ratelimited, we see:
XFS:: 62836 callbacks suppressed
Aside from the extra ":" it's not entirely clear which message is being
suppressed, especially if other messages or ratelimits are happening
at the same time. Clarify this as i.e.:
XFS (dm-11): Failing async write on buffer block 0x140090. Retrying async write.
XFS: Failing async write: 62836 callbacks suppressed
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
There are times, when doing triage and forensics,
that we would like to know whether a filesystem was unmounted,
or if the plug was pulled without a clean unmount. Log
unmounts at the same level (NOTICE) as we log mounts.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
We shouldn't get here with RENAME_EXCHANGE set and no
target_ip, but let's be defensive, because xfs_cross_rename()
will dereference it.
Spotted by Coverity.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Today, if we hit an XFS_WANT_CORRUPTED_RETURN we don't print any
information about which filesystem hit it. Passing in the mp allows
us to print the filesystem (device) name, which is a pretty critical
piece of information.
Tested by running fsfuzzer 'til I hit some.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Today, if we hit an XFS_WANT_CORRUPTED_GOTO we don't print any
information about which filesystem hit it. Passing in the mp allows
us to print the filesystem (device) name, which is a pretty critical
piece of information.
Tested by running fsfuzzer 'til I hit some.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Al Viro noticed a generic set of issues to do with filehandle lookup
racing with dentry cache setup. They involve a filehandle lookup
occurring while an inode is being created and the filehandle lookup
racing with the dentry creation for the real file. This can lead to
multiple dentries for the one path being instantiated. There are a
host of other issues around this same set of paths.
The underlying cause is that file handle lookup only waits on inode
cache instantiation rather than full dentry cache instantiation. XFS
is mostly immune to the problems discovered due to it's own internal
inode cache, but there are a couple of corner cases where races can
happen.
We currently clear the XFS_INEW flag when the inode is fully set up
after insertion into the cache. Newly allocated inodes are inserted
locked and so aren't usable until the allocation transaction
commits. This, however, occurs before the dentry and security
information is fully initialised and hence the inode is unlocked and
available for lookups to find too early.
To solve the problem, only clear the XFS_INEW flag for newly created
inodes once the dentry is fully instantiated. This means lookups
will retry until the XFS_INEW flag is removed from the inode and
hence avoids the race conditions in questions.
THis also means that xfs_create(), xfs_create_tmpfile() and
xfs_symlink() need to finish the setup of the inode in their error
paths if we had allocated the inode but failed later in the creation
process. xfs_symlink(), in particular, needed a lot of help to make
it's error handling match that of xfs_create().
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
A new fsync vs power fail test in xfstests indicated that XFS can
have unreliable data consistency when doing extending truncates that
require block zeroing. The blocks beyond EOF get zeroed in memory,
but we never force those changes to disk before we run the
transaction that extends the file size and exposes those blocks to
userspace. This can result in the blocks not being correctly zeroed
after a crash.
Because in-memory behaviour is correct, tools like fsx don't pick up
any coherency problems - it's not until the filesystem is shutdown
or the system crashes after writing the truncate transaction to the
journal but before the zeroed data in the page cache is flushed that
the issue is exposed.
Fix this by also flushing the dirty data in memory region between
the old size and new size when we've found blocks that need zeroing
in the truncate process.
Reported-by: Liu Bo <bo.li.liu@oracle.com>
cc: <stable@vger.kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
For filesystems without separate project quota inode field in the
superblock we just reuse project quota file for group quotas (and vice
versa) if project quota file is allocated and we need group quota file.
When we reuse the file, quota structures on disk suddenly have wrong
type stored in d_flags though. Nobody really cares about this (although
structure type reported to userspace was wrong as well) except
that after commit 14bf61ffe6ac (quota: Switch ->get_dqblk() and
->set_dqblk() to use bytes as space units) assertion in
xfs_qm_scall_getquota() started to trigger on xfs/106 test (apparently I
was testing without XFS_DEBUG so I didn't notice when submitting the
above commit).
Fix the problem by properly resetting ddq->d_flags when running quotacheck
for a quota file.
CC: stable@vger.kernel.org
Reported-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Extent swap operations are another extent manipulation operation
that we need to ensure does not race against mmap page faults. The
current code returns if the file is mapped prior to the swap being
done, but it could potentially race against new page faults while
the swap is in progress. Hence we should use the XFS_MMAPLOCK_EXCL
for this operation, too.
While there, fix the error path handling that can result in double
unlocks of the inodes when cancelling the swapext transaction.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Now that truncate locks out new page faults, we no longer need to do
special writeback hacks in truncate to work around potential races
between page faults, page cache truncation and file size updates to
ensure we get write page faults for extending truncates on sub-page
block size filesystems. Hence we can remove the code in
xfs_setattr_size() that handles this and update the comments around
the code tha thandles page cache truncate and size updates to
reflect the new reality.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Now we have the i_mmap_lock being held across the page fault IO
path, we now add extent manipulation operation exclusion by adding
the lock to the paths that directly modify extent maps. This
includes truncate, hole punching and other fallocate based
operations. The operations will now take both the i_iolock and the
i_mmaplock in exclusive mode, thereby ensuring that all IO and page
faults block without holding any page locks while the extent
manipulation is in progress.
This gives us the lock order during truncate of i_iolock ->
i_mmaplock -> page_lock -> i_lock, hence providing the same
lock order as the iolock provides the normal IO path without
involving the mmap_sem.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Take the i_mmaplock over write page faults. These come through the
->page_mkwrite callout, so we need to wrap that calls with the
i_mmaplock.
This gives us a lock order of mmap_sem -> i_mmaplock -> page_lock
-> i_lock.
Also, move the page_mkwrite wrapper to the same region of xfs_file.c
as the read fault wrappers and add a tracepoint.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Take the i_mmaplock over read page faults. These come through the
->fault callout, so we need to wrap the generic implementation
with the i_mmaplock. While there, add tracepoints for the read
fault as it passes through XFS.
This gives us a lock order of mmap_sem -> i_mmaplock -> page_lock
-> i_lock.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Right now we cannot serialise mmap against truncate or hole punch
sanely. ->page_mkwrite is not able to take locks that the read IO
path normally takes (i.e. the inode iolock) because that could
result in lock inversions (read - iolock - page fault - page_mkwrite
- iolock) and so we cannot use an IO path lock to serialise page
write faults against truncate operations.
Instead, introduce a new lock that is used *only* in the
->page_mkwrite path that is the equivalent of the iolock. The lock
ordering in a page fault is i_mmaplock -> page lock -> i_ilock,
and so in truncate we can i_iolock -> i_mmaplock and so lock out
new write faults during the process of truncation.
Because i_mmap_lock is outside the page lock, we can hold it across
all the same operations we hold the i_iolock for. The only
difference is that we never hold the i_mmaplock in the normal IO
path and so do not ever have the possibility that we can page fault
inside it. Hence there are no recursion issues on the i_mmap_lock
and so we can use it to serialise page fault IO against inode
modification operations that affect the IO path.
This patch introduces the i_mmaplock infrastructure, lockdep
annotations and initialisation/destruction code. Use of the new lock
will be in subsequent patches.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Now that there are no users of the bitfield based incore superblock
modification API, just remove the whole damn lot of it, including
all the bitfield definitions. This finally removes a lot of cruft
that has been around for a long time.
Credit goes to Christoph Hellwig for providing a great patch
connecting all the dots to enale us to do this. This patch is
derived from that work.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Introduce helper functions for modifying fields in the superblock
into xfs_trans.c, the only caller of xfs_mod_incore_sb_batch(). We
can then use these directly in xfs_trans_unreserve_and_mod_sb() and
so remove another user of the xfs_mode_incore_sb() API without
losing any functionality or scalability of the transaction commit
code..
Based on a patch from Christoph Hellwig.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Add a new helper to modify the incore counter of free realtime
extents. This matches the helpers used for inode and data block
counters, and removes a significant users of the xfs_mod_incore_sb()
interface.
Based on a patch originally from Christoph Hellwig.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
Now that the in-core superblock infrastructure has been replaced with
generic per-cpu counters, we don't need it anymore. Nuke it from
orbit so we are sure that it won't haunt us again...
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
XFS has hand-rolled per-cpu counters for the superblock since before
there was any generic implementation. The free block counter is
special in that it is used for ENOSPC detection outside transaction
contexts for for delayed allocation. This means that the counter
needs to be accurate at zero. The current per-cpu counter code jumps
through lots of hoops to ensure we never run past zero, but we don't
need to make all those jumps with the generic counter
implementation.
The generic counter implementation allows us to pass a "batch"
threshold at which the addition/subtraction to the counter value
will be folded back into global value under lock. We can use this
feature to reduce the batch size as we approach 0 in a very similar
manner to the existing counters and their rebalance algorithm. If we
use a batch size of 1 as we approach 0, then every addition and
subtraction will be done against the global value and hence allow
accurate detection of zero threshold crossing.
Hence we can replace the handrolled, accurate-at-zero counters with
generic percpu counters.
Note: this removes just enough of the icsb infrastructure to compile
without warnings. The rest will go in subsequent commits.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
XFS has hand-rolled per-cpu counters for the superblock since before
there was any generic implementation. The free inode counter is not
used for any limit enforcement - the per-AG free inode counters are
used during allocation to determine if there are inode available for
allocation.
Hence we don't need any of the complexity of the hand-rolled
counters and we can simply replace them with generic per-cpu
counters similar to the inode counter.
This version introduces a xfs_mod_ifree() helper function from
Christoph Hellwig.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
XFS has hand-rolled per-cpu counters for the superblock since before
there was any generic implementation. There are some warts around
the use of them for the inode counter as the hand rolled counter is
designed to be accurate at zero, but has no specific accurracy at
any other value. This design causes problems for the maximum inode
count threshold enforcement, as there is no trigger that balances
the counters as they get close tothe maximum threshold.
Instead of designing new triggers for balancing, just replace the
handrolled per-cpu counter with a generic counter. This enables us
to update the counter through the normal superblock modification
funtions, but rather than do that we add a xfs_mod_icount() helper
function (from Christoph Hellwig) and keep the percpu counter
outside the superblock in the struct xfs_mount.
This means we still need to initialise the per-cpu counter
specifically when we read the superblock, and vice versa when we
log/write it, but it does mean that we don't need to change any
other code.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
|
|
.. after extensive statistical analysis of my G+ polling, I've come to
the inescapable conclusion that internet polls are bad.
Big surprise.
But "Hurr durr I'ma sheep" trounced "I like online polls" by a 62-to-38%
margin, in a poll that people weren't even supposed to participate in.
Who can argue with solid numbers like that? 5,796 votes from people who
can't even follow the most basic directions?
In contrast, "v4.0" beat out "v3.20" by a slimmer margin of 56-to-44%,
but with a total of 29,110 votes right now.
Now, arguably, that vote spread is only about 3,200 votes, which is less
than the almost six thousand votes that the "please ignore" poll got, so
it could be considered noise.
But hey, I asked, so I'll honor the votes.
|
|
X-Coverup: just ask spender
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
use_pde()/unuse_pde() in ->follow_link()/->put_link() resp.
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
As it is, we have debugfs_remove() racing with symlink traversals.
Supply ->evict_inode() and do freeing there - inode will remain
pinned until we are done with the symlink body.
And rip the idiocy with checking if dentry is positive right after
we'd verified debugfs_positive(), which is a stronger check...
Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
I've noticed significant locking contention in memory reclaimer around
sb_lock inside grab_super_passive(). Grab_super_passive() is called from
two places: in icache/dcache shrinkers (function super_cache_scan) and
from writeback (function __writeback_inodes_wb). Both are required for
progress in memory allocator.
Grab_super_passive() acquires sb_lock to increment sb->s_count and check
sb->s_instances. It seems sb->s_umount locked for read is enough here:
super-block deactivation always runs under sb->s_umount locked for write.
Protecting super-block itself isn't a problem: in super_cache_scan() sb
is protected by shrinker_rwsem: it cannot be freed if its slab shrinkers
are still active. Inside writeback super-block comes from inode from bdi
writeback list under wb->list_lock.
This patch removes locking sb_lock and checks s_instances under s_umount:
generic_shutdown_super() unlinks it under sb->s_umount locked for write.
New variant is called trylock_super() and since it only locks semaphore,
callers must call up_read(&sb->s_umount) instead of drop_super(sb) when
they're done.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Fanotify probably doesn't want to watch autodirs so make it use d_can_lookup()
rather than d_is_dir() when checking a dir watch and give an error on fake
directories.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Fix up the following scripted S_ISDIR/S_ISREG/S_ISLNK conversions (or lack
thereof) in cachefiles:
(1) Cachefiles mostly wants to use d_can_lookup() rather than d_is_dir() as
it doesn't want to deal with automounts in its cache.
(2) Coccinelle didn't find S_IS* expressions in ASSERT() statements in
cachefiles.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Convert the following where appropriate:
(1) S_ISLNK(dentry->d_inode) to d_is_symlink(dentry).
(2) S_ISREG(dentry->d_inode) to d_is_reg(dentry).
(3) S_ISDIR(dentry->d_inode) to d_is_dir(dentry). This is actually more
complicated than it appears as some calls should be converted to
d_can_lookup() instead. The difference is whether the directory in
question is a real dir with a ->lookup op or whether it's a fake dir with
a ->d_automount op.
In some circumstances, we can subsume checks for dentry->d_inode not being
NULL into this, provided we the code isn't in a filesystem that expects
d_inode to be NULL if the dirent really *is* negative (ie. if we're going to
use d_inode() rather than d_backing_inode() to get the inode pointer).
Note that the dentry type field may be set to something other than
DCACHE_MISS_TYPE when d_inode is NULL in the case of unionmount, where the VFS
manages the fall-through from a negative dentry to a lower layer. In such a
case, the dentry type of the negative union dentry is set to the same as the
type of the lower dentry.
However, if you know d_inode is not NULL at the call site, then you can use
the d_is_xxx() functions even in a filesystem.
There is one further complication: a 0,0 chardev dentry may be labelled
DCACHE_WHITEOUT_TYPE rather than DCACHE_SPECIAL_TYPE. Strictly, this was
intended for special directory entry types that don't have attached inodes.
The following perl+coccinelle script was used:
use strict;
my @callers;
open($fd, 'git grep -l \'S_IS[A-Z].*->d_inode\' |') ||
die "Can't grep for S_ISDIR and co. callers";
@callers = <$fd>;
close($fd);
unless (@callers) {
print "No matches\n";
exit(0);
}
my @cocci = (
'@@',
'expression E;',
'@@',
'',
'- S_ISLNK(E->d_inode->i_mode)',
'+ d_is_symlink(E)',
'',
'@@',
'expression E;',
'@@',
'',
'- S_ISDIR(E->d_inode->i_mode)',
'+ d_is_dir(E)',
'',
'@@',
'expression E;',
'@@',
'',
'- S_ISREG(E->d_inode->i_mode)',
'+ d_is_reg(E)' );
my $coccifile = "tmp.sp.cocci";
open($fd, ">$coccifile") || die $coccifile;
print($fd "$_\n") || die $coccifile foreach (@cocci);
close($fd);
foreach my $file (@callers) {
chomp $file;
print "Processing ", $file, "\n";
system("spatch", "--sp-file", $coccifile, $file, "--in-place", "--no-show-diff") == 0 ||
die "spatch failed";
}
[AV: overlayfs parts skipped]
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Use d_is_positive() rather than testing dentry->d_inode in SELinux to get rid
of direct references to d_inode outside of the VFS.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Use d_is_positive() rather than testing dentry->d_inode in Smack to get rid of
direct references to d_inode outside of the VFS.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Use d_is_dir() rather than d_inode and S_ISDIR(). Note that this will include
fake directories such as automount triggers.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Use d_is_positive(dentry) or d_is_negative(dentry) rather than testing
dentry->d_inode as the dentry may cover another layer that has an inode when
the top layer doesn't or may hold a 0,0 chardev that's actually a whiteout.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
mediated_filesystem() should use dentry->d_sb not dentry->d_inode->i_sb and
should avoid file_inode() also since it is really dealing with the path.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Split DCACHE_FILE_TYPE into DCACHE_REGULAR_TYPE (dentries representing regular
files) and DCACHE_SPECIAL_TYPE (representing blockdev, chardev, FIFO and
socket files).
d_is_reg() and d_is_special() are added to detect these subtypes and
d_is_file() is left as the union of the two.
This allows a number of places that use S_ISREG(dentry->d_inode->i_mode) to
use d_is_reg(dentry) instead.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Add a DCACHE_FALLTHRU flag to indicate that, in a layered filesystem, this is
a virtual dentry that covers another one in a lower layer that should be used
instead. This may be recorded on medium if directory integration is stored
there.
The flag can be set with d_set_fallthru() and tested with d_is_fallthru().
Original-author: Valerie Aurora <vaurora@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Add DCACHE_WHITEOUT_TYPE and provide a d_is_whiteout() accessor function. A
d_is_miss() accessor is also added for ordinary cache misses and
d_is_negative() is modified to indicate either an ordinary miss or an enforced
miss (whiteout).
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Introduce some function for getting the inode (and also the dentry) in an
environment where layered/unioned filesystems are in operation.
The problem is that we have places where we need *both* the union dentry and
the lower source or workspace inode or dentry available, but we can only have
a handle on one of them. Therefore we need to derive the handle to the other
from that.
The idea is to introduce an extra field in struct dentry that allows the union
dentry to refer to and pin the lower dentry.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
The use of READ_ONCE() causes lots of warnings witht he pending paravirt
spinlock fixes, because those ends up having passing a member to a
'const' structure to READ_ONCE().
There should certainly be nothing wrong with using READ_ONCE() with a
const source, but the helper function __read_once_size() would cause
warnings because it would drop the 'const' qualifier, but also because
the destination would be marked 'const' too due to the use of 'typeof'.
Use a union of types in READ_ONCE() to avoid this issue.
Also make sure to use parenthesis around the macro arguments to avoid
possible operator precedence issues.
Tested-by: Ingo Molnar <mingo@kernel.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When reading blkio.throttle.io_serviced in a recently created blkio
cgroup, it's possible to race against the creation of a throttle policy,
which delays the allocation of stats_cpu.
Like other functions in the throttle code, just checking for a NULL
stats_cpu prevents the following oops caused by that race.
[ 1117.285199] Unable to handle kernel paging request for data at address 0x7fb4d0020
[ 1117.285252] Faulting instruction address: 0xc0000000003efa2c
[ 1137.733921] Oops: Kernel access of bad area, sig: 11 [#1]
[ 1137.733945] SMP NR_CPUS=2048 NUMA PowerNV
[ 1137.734025] Modules linked in: bridge stp llc kvm_hv kvm binfmt_misc autofs4
[ 1137.734102] CPU: 3 PID: 5302 Comm: blkcgroup Not tainted 3.19.0 #5
[ 1137.734132] task: c000000f1d188b00 ti: c000000f1d210000 task.ti: c000000f1d210000
[ 1137.734167] NIP: c0000000003efa2c LR: c0000000003ef9f0 CTR: c0000000003ef980
[ 1137.734202] REGS: c000000f1d213500 TRAP: 0300 Not tainted (3.19.0)
[ 1137.734230] MSR: 9000000000009032 <SF,HV,EE,ME,IR,DR,RI> CR: 42008884 XER: 20000000
[ 1137.734325] CFAR: 0000000000008458 DAR: 00000007fb4d0020 DSISR: 40000000 SOFTE: 0
GPR00: c0000000003ed3a0 c000000f1d213780 c000000000c59538 0000000000000000
GPR04: 0000000000000800 0000000000000000 0000000000000000 0000000000000000
GPR08: ffffffffffffffff 00000007fb4d0020 00000007fb4d0000 c000000000780808
GPR12: 0000000022000888 c00000000fdc0d80 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 000001003e120200 c000000f1d5b0cc0 0000000000000200 0000000000000000
GPR24: 0000000000000001 c000000000c269e0 0000000000000020 c000000f1d5b0c80
GPR28: c000000000ca3a08 c000000000ca3dec c000000f1c667e00 c000000f1d213850
[ 1137.734886] NIP [c0000000003efa2c] .tg_prfill_cpu_rwstat+0xac/0x180
[ 1137.734915] LR [c0000000003ef9f0] .tg_prfill_cpu_rwstat+0x70/0x180
[ 1137.734943] Call Trace:
[ 1137.734952] [c000000f1d213780] [d000000005560520] 0xd000000005560520 (unreliable)
[ 1137.734996] [c000000f1d2138a0] [c0000000003ed3a0] .blkcg_print_blkgs+0xe0/0x1a0
[ 1137.735039] [c000000f1d213960] [c0000000003efb50] .tg_print_cpu_rwstat+0x50/0x70
[ 1137.735082] [c000000f1d2139e0] [c000000000104b48] .cgroup_seqfile_show+0x58/0x150
[ 1137.735125] [c000000f1d213a70] [c0000000002749dc] .kernfs_seq_show+0x3c/0x50
[ 1137.735161] [c000000f1d213ae0] [c000000000218630] .seq_read+0xe0/0x510
[ 1137.735197] [c000000f1d213bd0] [c000000000275b04] .kernfs_fop_read+0x164/0x200
[ 1137.735240] [c000000f1d213c80] [c0000000001eb8e0] .__vfs_read+0x30/0x80
[ 1137.735276] [c000000f1d213cf0] [c0000000001eb9c4] .vfs_read+0x94/0x1b0
[ 1137.735312] [c000000f1d213d90] [c0000000001ebb38] .SyS_read+0x58/0x100
[ 1137.735349] [c000000f1d213e30] [c000000000009218] syscall_exit+0x0/0x98
[ 1137.735383] Instruction dump:
[ 1137.735405] 7c6307b4 7f891800 409d00b8 60000000 60420000 3d420004 392a63b0 786a1f24
[ 1137.735471] 7d49502a e93e01c8 7d495214 7d2ad214 <7cead02a> e9090008 e9490010 e9290018
And here is one code that allows to easily reproduce this, although this
has first been found by running docker.
void run(pid_t pid)
{
int n;
int status;
int fd;
char *buffer;
buffer = memalign(BUFFER_ALIGN, BUFFER_SIZE);
n = snprintf(buffer, BUFFER_SIZE, "%d\n", pid);
fd = open(CGPATH "/test/tasks", O_WRONLY);
write(fd, buffer, n);
close(fd);
if (fork() > 0) {
fd = open("/dev/sda", O_RDONLY | O_DIRECT);
read(fd, buffer, 512);
close(fd);
wait(&status);
} else {
fd = open(CGPATH "/test/blkio.throttle.io_serviced", O_RDONLY);
n = read(fd, buffer, BUFFER_SIZE);
close(fd);
}
free(buffer);
exit(0);
}
void test(void)
{
int status;
mkdir(CGPATH "/test", 0666);
if (fork() > 0)
wait(&status);
else
run(getpid());
rmdir(CGPATH "/test");
}
int main(int argc, char **argv)
{
int i;
for (i = 0; i < NR_TESTS; i++)
test();
return 0;
}
Reported-by: Ricardo Marin Matinata <rmm@br.ibm.com>
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
Commit e9de688dac65 ("irqchip: mips-gic: Support local interrupts")
updated several platforms. This is a copy paste error.
Signed-off-by: Niklas Cassel <niklass@axis.com>
Reviewed-by: Andrew Bresticker <abrestic@chromium.org>
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/9245/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
|
|
In commit c441d4a54c6e ("MIPS: mm: Only build one microassembler that
is suitable"), the Makefile at arch/mips/mm was rewritten to only
build the "right" microassembler file, depending on whether
CONFIG_CPU_MICROMIPS is set or not.
In the files, however, there are still preprocessor definitions
depending on CONFIG_CPU_MICROMIPS. The #ifdef around them can now
never evaluate to true, so let's remove them altogether.
This inconsistency was found using the undertaker-checkpatch tool.
Signed-off-by: Andreas Ruprecht <rupran@einserver.de>
Reviewed-by: Maciej W. Rozycki <macro@linux-mips.org>
Cc: linux-mips@linux-mips.org
Cc: linux-kernel@vger.kernel.org
Cc: Valentin Rothberg <valentinrothberg@gmail.com>
Cc: Paul Bolle <pebolle@tiscali.nl>
Patchwork: https://patchwork.linux-mips.org/patch/9267/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
|
|
Upstream checkpatch now requires this.
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
|
|
Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Roland Dreier <roland@purestorage.com>
|
|
For, !HAVE_CLK the clk API returns a NULL cookie. Rework the
initialization code to handle that. If clk_get_rate() delivers 0, we use
the fallback mechanisms. The patch is pretty easy when ignoring white
space issues (git diff -b).
Suggested-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
Tested-by: Max Filippov <jcmvbkbc@gmail.com>
|
|
- Use of_irq_init() to initialize interrupt controllers
- Get rid of some unlikely()
- Add CIB to support SATA and other interrupts
- Add support for CIU SUM2 interrupt sources
Signed-off-by: David Daney <david.daney@cavium.com>
Signed-off-by: Leonid Rosenboim <lrosenboim@caviumnetworks.com>
Signed-off-by: Aleksey Makarov <aleksey.makarov@auriga.com>
Signed-off-by: Peter Swain <peter.swain@cavium.com>
Cc: linux-mips@linux-mips.org
Cc: linux-kernel@vger.kernel.org
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Pawel Moll <pawel.moll@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Ian Campbell <ijc+devicetree@hellion.org.uk>
Cc: Kumar Gala <galak@codeaurora.org>
Cc: devicetree@vger.kernel.org
Patchwork: https://patchwork.linux-mips.org/patch/8947/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
|