Age | Commit message (Collapse) | Author | Files | Lines |
|
dq_data_lock is currently used to protect all modifications of quota
accounting information, consistency of quota accounting on the inode,
and dquot pointers from inode. As a result contention on the lock can be
pretty heavy.
Reduce the contention on the lock by protecting quota accounting
information by a new dquot->dq_dqb_lock and consistency of quota
accounting with inode usage by inode->i_lock.
This change reduces time to create 500000 files on ext4 on ramdisk by 50
different processes in separate directories by 6% when user quota is
turned on. When those 50 processes belong to 50 different users, the
improvement is about 9%.
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Provide helper __inode_get_bytes() which assumes i_lock is already
acquired. Quota code will need this to be able to use i_lock to protect
consistency of quota accounting information and inode usage.
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
dquot_claim_reserved_space() and dquot_reclaim_reserved_space() have
only a single callsite. Inline them there.
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
inode_incr_space() and inode_decr_space() have only two callsites.
Inline them there as that will make locking changes simpler.
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
inode_add_rsv_space() and inode_sub_rsv_space() had only one callsite.
Inline them there directly. inode_claim_rsv_space() and
inode_reclaim_rsv_space() had two callsites so inline them there as
well. This will simplify further locking changes.
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
When journalling quotas, we writeback all dquots immediately after
changing them as part of current transation. Thus there's no need to
write anything in dquot_writeback_dquots() and so we can avoid updating
list of dirty dquots to reduce dq_list_lock contention.
This change reduces time to create 500000 files on ext4 on ramdisk by 50
different processes in separate directories by 15% when user quota is
turned on.
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Filesystems that are journalling quotas generally don't need tracking of
dirty dquots in a list since forcing a transaction commit flushes all
quotas anyway. Allow filesystem to say it doesn't want dquots to be
tracked as it reduces contention on the dq_list_lock.
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Currently every dquot carries a wait_queue_head_t used only when we are
turning quotas off to wait for last users to drop dquot references.
Since such rare case is not performance sensitive in any means, just use
a global waitqueue for this and save space in struct dquot. Also convert
the logic to use wait_event() instead of open-coding it.
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Move locking of dq_list_lock into clear_dquot_dirty(). It makes the
function more self-contained and will simplify our life later.
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Currently we mark dirty even dquots that are not active (i.e.,
initialization or reading failed for them). Thus later we have to check
whether dirty dquot is really active and just clear the dirty bit if
not. Avoid this complication by just never marking non-active dquot as
dirty.
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
dqi_flags modifications are protected by dq_data_lock. However the
modifications in vfs_load_quota_inode() and in mark_info_dirty() were
not which could lead to corruption of dqi_flags. Since modifications to
dqi_flags are rare, this is hard to observe in practice but in theory it
could happen. Fix the problem by always using dq_data_lock for
protection.
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Currently we return -EIO on any error (or short read) from
->quota_read() while reading quota info. Propagate the error code
instead.
Suggested-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
v2_read_file_info() returned -1 instead of proper error codes on error.
Luckily this is not easily visible from userspace as we have called
->check_quota_file shortly before and thus already verified the quota
file is sane. Still set the error codes to proper values.
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Push down acquisition of dqio_sem into ->read_file_info() callback. This
is for consistency with other operations and it also allows us to get
rid of an ugliness in OCFS2.
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Push down acquisition of dqio_sem into ->write_file_info() callback.
Mostly for consistency with other operations.
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Push down acquisition of dqio_sem into ->get_next_id() callback. Mostly
for consistency with other operations.
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Push down acquisition of dqio_sem into ->release_dqblk() callback. It
will allow quota formats to decide whether they need it or not.
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
The old quota quota format has fixed offset in quota file based on ID so
there's no locking needed against concurrent modifications of the file
(locking against concurrent IO on the same dquot is still provided by
dq_lock).
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
When dquot has space already allocated in a quota file, we just
overwrite that place when writing dquot. So we don't need any protection
against other modifications of quota file as these keep dquot in place.
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Push down acquisition of dqio_sem into ->write_dqblk() callback. It will
allow quota formats to decide whether they need it or not.
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
The old quota format has fixed offset in quota file based on ID so
there's no locking needed against concurrent modifications of the file
(locking against concurrent IO on the same dquot is still provided by
dq_lock).
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Push down acquisition of dqio_sem into ->read_dqblk() callback. It will
allow quota formats to decide whether they need it or not.
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Currently dquot writeout is only protected by dqio_sem held for writing.
As we transition to a finer grained locking we will use dquot->dq_lock
instead. So acquire it in dquot_commit() and move dqio_sem just around
->commit_dqblk() call as it is still needed to serialize quota file
changes.
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
vfs_load_quota_inode() needs dqio_sem only for reading. In fact dqio_sem
is not needed there at all since the function can be called only during
quota on when quota file cannot be modified but let's leave the
protection there since it is logical and the path is in no way
performance critical.
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
dquot_get_next_id() needs dqio_sem only for reading to protect against
racing with modification of quota file structure.
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
We need dqio_sem held just for reading when calling ->read_dqblk() in
dquot_acquire(). Also dqio_sem is not needed when setting DQ_READ_B and
DQ_ACTIVE_B as concurrent reads and dquot activations are serialized by
dq_lock. So acquire and release dqio_sem closer to the place where it is
needed. This reduces lock hold time and will make locking changes
easier.
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
Convert dqio_mutex to rwsem and call it dqio_sem. No functional changes
yet.
Signed-off-by: Jan Kara <jack@suse.cz>
|
|
|
|
The latest change of compat_sys_sigpending in commit 8f13621abced
("sigpending(): move compat to native") has broken it in two ways.
First, it tries to write 4 bytes more than userspace expects:
sizeof(old_sigset_t) == sizeof(long) == 8 instead of
sizeof(compat_old_sigset_t) == sizeof(u32) == 4.
Second, on big endian architectures these bytes are being written in the
wrong order.
This bug was found by strace test suite.
Reported-by: Anatoly Pugachev <matorola@gmail.com>
Inspired-by: Eugene Syromyatnikov <evgsyr@gmail.com>
Fixes: 8f13621abced ("sigpending(): move compat to native")
Signed-off-by: Dmitry V. Levin <ldv@altlinux.org>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This bug was found by a static code checker tool for copy paste
problems.
Signed-off-by: Maninder Singh <maninder1.s@samsung.com>
Signed-off-by: Vaneet Narang <v.narang@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
On a 32-bit platform, the value of n_blcoks_count may be wrong during
the file system is resized to size larger than 2^32 blocks. This may
caused the superblock being corrupted with zero blocks count.
Fixes: 1c6bd7173d66
Signed-off-by: Jerry Lee <jerrylee@qnap.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org # 3.7+
|
|
When upgrading from old format, try to set project id
to old file first time, it will return EOVERFLOW, but if
that file is dirtied(touch etc), changing project id will
be allowed, this might be confusing for users, we could
try to expand @i_extra_isize here too.
Reported-by: Zhang Yi <yi.zhang@huawei.com>
Signed-off-by: Miao Xie <miaoxie@huawei.com>
Signed-off-by: Wang Shilong <wshilong@ddn.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
Clean up some goto statement, make ext4_expand_extra_isize_ea() clearer.
Signed-off-by: Miao Xie <miaoxie@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Wang Shilong <wshilong@ddn.com>
|
|
Current ext4_expand_extra_isize just tries to expand extra isize, if
someone is holding xattr lock or some check fails, it will give up.
So rename its name to ext4_try_to_expand_extra_isize.
Besides that, we clean up unnecessary check and move some relative checks
into it.
Signed-off-by: Miao Xie <miaoxie@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Wang Shilong <wshilong@ddn.com>
|
|
We should avoid the contention between the i_extra_isize update and
the inline data insertion, so move the xattr trylock in front of
i_extra_isize update.
Signed-off-by: Miao Xie <miaoxie@huawei.com>
Reviewed-by: Wang Shilong <wshilong@ddn.com>
|
|
ext4_xattr_inode_read() currently reads each block sequentially while
waiting for io operation to complete before moving on to the next
block. This prevents request merging in block layer.
Add a ext4_bread_batch() function that starts reads for all blocks
then optionally waits for them to complete. A similar logic is used
in ext4_find_entry(), so update that code to use the new function.
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
When an xattr block has a single reference, block is updated inplace
and it is reinserted to the cache. Later, a cache lookup is performed
to see whether an existing block has the same contents. This cache
lookup will most of the time return the just inserted entry so
deduplication is not achieved.
Running the following test script will produce two xattr blocks which
can be observed in "File ACL: " line of debugfs output:
mke2fs -b 1024 -I 128 -F -O extent /dev/sdb 1G
mount /dev/sdb /mnt/sdb
touch /mnt/sdb/{x,y}
setfattr -n user.1 -v aaa /mnt/sdb/x
setfattr -n user.2 -v bbb /mnt/sdb/x
setfattr -n user.1 -v aaa /mnt/sdb/y
setfattr -n user.2 -v bbb /mnt/sdb/y
debugfs -R 'stat x' /dev/sdb | cat
debugfs -R 'stat y' /dev/sdb | cat
This patch defers the reinsertion to the cache so that we can locate
other blocks with the same contents.
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
|
|
ext4_alloc_file_blocks() does not use its mode parameter. Remove it.
Signed-off-by: Tahsin Erdogan <tahsin@google.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
After commit 62d1034f53e3 ("fortify: use WARN instead of BUG for now"),
we get a warning about possible stack overflow from a memcpy that
was not strictly bounded to the size of the local variable:
inlined from 'ext4_mb_seq_groups_show' at fs/ext4/mballoc.c:2322:2:
include/linux/string.h:309:9: error: '__builtin_memcpy': writing between 161 and 1116 bytes into a region of size 160 overflows the destination [-Werror=stringop-overflow=]
We actually had a bug here that would have been found by the warning,
but it was already fixed last year in commit 30a9d7afe70e ("ext4: fix
stack memory corruption with 64k block size").
This replaces the fixed-length structure on the stack with a variable-length
structure, using the correct upper bound that tells the compiler that
everything is really fine here. I also change the loop count to check
for the same upper bound for consistency, but the existing code is
already correct here.
Note that while clang won't allow certain kinds of variable-length arrays
in structures, this particular instance is fine, as the array is at the
end of the structure, and the size is strictly bounded.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
The dir_nlink feature has been enabled by default for new ext4
filesystems since e2fsprogs-1.41 in 2008, and was automatically
enabled by the kernel for older ext4 filesystems since the
dir_nlink feature was added with ext4 in kernel 2.6.28+ when
the subdirectory count exceeded EXT4_LINK_MAX-1.
Automatically adding the file system features such as dir_nlink is
generally frowned upon, since it could cause the file system to not be
mountable on older kernel, thus preventing the administrator from
rolling back to an older kernel if necessary.
In this case, the administrator might also want to disable the feature
because glibc's fts_read() function does not correctly optimize
directory traversal for directories that use st_nlinks field of 1 to
indicate that the number of links in the directory are not tracked by
the file system, and could fail to traverse the full directory
hierarchy. Fortunately, in the past ten years very few users have
complained about incomplete file system traversal by glibc's
fts_read().
This commit also changes ext4_inc_count() to allow i_nlinks to reach
the full EXT4_LINK_MAX links on the parent directory (including "."
and "..") before changing i_links_count to be 1.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=196405
Signed-off-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
I get a static checker warning:
fs/ext4/ext4.h:3091 ext4_set_de_type()
error: buffer overflow 'ext4_type_by_mode' 15 <= 15
It seems unlikely that we would hit this read overflow in real life, but
it's also simple enough to make the array 16 bytes instead of 15.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|
ext4_find_unwritten_pgoff() does not properly handle a situation when
starting index is in the middle of a page and blocksize < pagesize. The
following command shows the bug on filesystem with 1k blocksize:
xfs_io -f -c "falloc 0 4k" \
-c "pwrite 1k 1k" \
-c "pwrite 3k 1k" \
-c "seek -a -r 0" foo
In this example, neither lseek(fd, 1024, SEEK_HOLE) nor lseek(fd, 2048,
SEEK_DATA) will return the correct result.
Fix the problem by neglecting buffers in a page before starting offset.
Reported-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
CC: stable@vger.kernel.org # 3.8+
|
|
This fixes a problem where the system gets stuck in a loop
unable to wakeup via power button in s2idle.
The problem happens because:
- press power button:
- system emits 0xc0 (power press), event ignored
- system emits 0xc1 (power release), event processed,
emited as KEY_POWER
- set wakeup_mode to true
- system goes to s2idle
- press power button
- system emits 0xc0 (power press), wakeup_mode is true,
system wakes
- system emits 0xc1 (power release), event processed,
emited as KEY_POWER
- system goes to s2idle again
To avoid this situation, process the presses (which matches what
intel-hid does too).
Verified on an Dell XPS 9365
Signed-off-by: Mario Limonciello <mario.limonciello@dell.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Darren Hart (VMware) <dvhart@infradead.org>
|
|
We've changed the discard command handling into parallel manner.
But, in this change, I forgot decreasing the usage count of the bio
which was used to send discard request. I'm sorry about that.
Fixes: a015434480dc ("ext4: send parallel discards on commit completions")
Signed-off-by: Daeho Jeong <daeho.jeong@samsung.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
|
|
Mikael Pettersson reported that some test programs in the strace-4.18
testsuite cause an OOPS.
After some debugging it turns out that garbage values are returned
when an exception occurs, causing the fixup memset() to be run with
bogus arguments.
The problem is that two of the exception handler stubs write the
successfully copied length into the wrong register.
Fixes: ee841d0aff64 ("sparc64: Convert U3copy_{from,to}_user to accurate exception reporting.")
Reported-by: Mikael Pettersson <mikpelinux@gmail.com>
Tested-by: Mikael Pettersson <mikpelinux@gmail.com>
Reviewed-by: Sam Ravnborg <sam@ravnborg.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The bitmask used to define these values produces overflow, as seen by
this compiler warning:
arch/arm64/kernel/head.S:47:8: warning:
integer overflow in preprocessor expression
#elif (PAGE_OFFSET & 0x1fffff) != 0
^~~~~~~~~~~
arch/arm64/include/asm/memory.h:52:46: note:
expanded from macro 'PAGE_OFFSET'
#define PAGE_OFFSET (UL(0xffffffffffffffff) << (VA_BITS -
1))
~~~~~~~~~~~~~~~~~~ ^
It would be preferrable to use GENMASK_ULL() instead, but it's not set
up to be used from assembly (the UL() macro token pastes UL suffixes
when not included in assembly sources).
Suggested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Suggested-by: Yury Norov <ynorov@caviumnetworks.com>
Suggested-by: Matthias Kaehlcke <mka@chromium.org>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
|
|
In a system with DBM (dirty bit management) capable agents there is a
possible race between a CPU executing ptep_set_access_flags() (maybe
non-DBM capable) and a hardware update of the dirty state (clearing of
PTE_RDONLY). The scenario:
a) the pte is writable (PTE_WRITE set), clean (PTE_RDONLY set) and old
(PTE_AF clear)
b) ptep_set_access_flags() is called as a result of a read access and it
needs to set the pte to writable, clean and young (PTE_AF set)
c) a DBM-capable agent, as a result of a different write access, is
marking the entry as young (setting PTE_AF) and dirty (clearing
PTE_RDONLY)
The current ptep_set_access_flags() implementation would set the
PTE_RDONLY bit in the resulting value overriding the DBM update and
losing the dirty state.
This patch fixes such race by setting PTE_RDONLY to the most permissive
(lowest value) of the current entry and the new one.
Fixes: 66dbd6e61a52 ("arm64: Implement ptep_set_access_flags() for hardware AF/DBM")
Cc: Will Deacon <will.deacon@arm.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Steve Capper <steve.capper@arm.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
|
|
RX and TX clock delays are required. Request them explicitly.
Fixes: cad008b8a77e6 ("ARM: dts: tango4: Initial device trees")
Cc: stable@vger.kernel.org
Signed-off-by: Marc Gonzalez <marc_gonzalez@sigmadesigns.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
When resuming, set up registers that have been lost in the sleep state.
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
|
|
If the decrementer wraps again and de-asserts the decrementer
exception while hard-disabled, __check_irq_replay() has a test to
notice the wrap when interrupts are re-enabled.
The decrementer check must be done when clearing the PACA_IRQ_HARD_DIS
flag, not when the PACA_IRQ_DEC flag is tested. Previously this worked
because the decrementer interrupt was always the first one checked
after clearing the hard disable flag, but HMI check was moved ahead of
that, which introduced this bug.
This can cause a missed decrementer interrupt if we soft-disable
interrupts then take an HMI which is recorded in irq_happened, then
hard-disable interrupts for > 4s to wrap the decrementer.
Fixes: e0e0d6b7390b ("powerpc/64: Replay hypervisor maintenance interrupt first")
Cc: stable@vger.kernel.org # v4.9+
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
|