aboutsummaryrefslogtreecommitdiffstats
path: root/fs/btrfs (follow)
AgeCommit message (Collapse)AuthorFilesLines
2019-05-07Merge tag 'for-5.2/block-20190507' of git://git.kernel.dk/linux-blockLinus Torvalds5-16/+12
Pull block updates from Jens Axboe: "Nothing major in this series, just fixes and improvements all over the map. This contains: - Series of fixes for sed-opal (David, Jonas) - Fixes and performance tweaks for BFQ (via Paolo) - Set of fixes for bcache (via Coly) - Set of fixes for md (via Song) - Enabling multi-page for passthrough requests (Ming) - Queue release fix series (Ming) - Device notification improvements (Martin) - Propagate underlying device rotational status in loop (Holger) - Removal of mtip32xx trim support, which has been disabled for years (Christoph) - Improvement and cleanup of nvme command handling (Christoph) - Add block SPDX tags (Christoph) - Cleanup/hardening of bio/bvec iteration (Christoph) - A few NVMe pull requests (Christoph) - Removal of CONFIG_LBDAF (Christoph) - Various little fixes here and there" * tag 'for-5.2/block-20190507' of git://git.kernel.dk/linux-block: (164 commits) block: fix mismerge in bvec_advance block: don't drain in-progress dispatch in blk_cleanup_queue() blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release blk-mq: always free hctx after request queue is freed blk-mq: split blk_mq_alloc_and_init_hctx into two parts blk-mq: free hw queue's resource in hctx's release handler blk-mq: move cancel of requeue_work into blk_mq_release blk-mq: grab .q_usage_counter when queuing request from plug code path block: fix function name in comment nvmet: protect discovery change log event list iteration nvme: mark nvme_core_init and nvme_core_exit static nvme: move command size checks to the core nvme-fabrics: check more command sizes nvme-pci: check more command sizes nvme-pci: remove an unneeded variable initialization nvme-pci: unquiesce admin queue on shutdown nvme-pci: shutdown on timeout during deletion nvme-pci: fix psdt field for single segment sgls nvme-multipath: don't print ANA group state by default nvme-multipath: split bios with the ns_head bio_set before submitting ...
2019-05-07Merge tag 'Wimplicit-fallthrough-5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linuxLinus Torvalds2-0/+2
Pull Wimplicit-fallthrough updates from Gustavo A. R. Silva: "Mark switch cases where we are expecting to fall through. This is part of the ongoing efforts to enable -Wimplicit-fallthrough. Most of them have been baking in linux-next for a whole development cycle. And with Stephen Rothwell's help, we've had linux-next nag-emails going out for newly introduced code that triggers -Wimplicit-fallthrough to avoid gaining more of these cases while we work to remove the ones that are already present. We are getting close to completing this work. Currently, there are only 32 of 2311 of these cases left to be addressed in linux-next. I'm auditing every case; I take a look into the code and analyze it in order to determine if I'm dealing with an actual bug or a false positive, as explained here: https://lore.kernel.org/lkml/c2fad584-1705-a5f2-d63c-824e9b96cf50@embeddedor.com/ While working on this, I've found and fixed the several missing break/return bugs, some of them introduced more than 5 years ago. Once this work is finished, we'll be able to universally enable "-Wimplicit-fallthrough" to avoid any of these kinds of bugs from entering the kernel again" * tag 'Wimplicit-fallthrough-5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux: (27 commits) memstick: mark expected switch fall-throughs drm/nouveau/nvkm: mark expected switch fall-throughs NFC: st21nfca: Fix fall-through warnings NFC: pn533: mark expected switch fall-throughs block: Mark expected switch fall-throughs ASN.1: mark expected switch fall-through lib/cmdline.c: mark expected switch fall-throughs lib: zstd: Mark expected switch fall-throughs scsi: sym53c8xx_2: sym_nvram: Mark expected switch fall-through scsi: sym53c8xx_2: sym_hipd: mark expected switch fall-throughs scsi: ppa: mark expected switch fall-through scsi: osst: mark expected switch fall-throughs scsi: lpfc: lpfc_scsi: Mark expected switch fall-throughs scsi: lpfc: lpfc_nvme: Mark expected switch fall-through scsi: lpfc: lpfc_nportdisc: Mark expected switch fall-through scsi: lpfc: lpfc_hbadisc: Mark expected switch fall-throughs scsi: lpfc: lpfc_els: Mark expected switch fall-throughs scsi: lpfc: lpfc_ct: Mark expected switch fall-throughs scsi: imm: mark expected switch fall-throughs scsi: csiostor: csio_wr: mark expected switch fall-through ...
2019-05-07Merge tag 'for-5.2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linuxLinus Torvalds62-2193/+3043
Pull btrfs updates from David Sterba: "This time the majority of changes are cleanups, though there's still a number of changes of user interest. User visible changes: - better read time and write checks to catch errors early and before writing data to disk (to catch potential memory corruption on data that get checksummed) - qgroups + metadata relocation: last speed up patch int the series to address the slowness, there should be no overhead comparing balance with and without qgroups - FIEMAP ioctl does not start a transaction unnecessarily, this can result in a speed up and less blocking due to IO - LOGICAL_INO (v1, v2) does not start transaction unnecessarily, this can speed up the mentioned ioctl and scrub as well - fsync on files with many (but not too many) hardlinks is faster, finer decision if the links should be fsynced individually or completely - send tries harder to find ranges to clone - trim/discard will skip unallocated chunks that haven't been touched since the last mount Fixes: - send flushes delayed allocation before start, otherwise it could miss some changes in case of a very recent rw->ro switch of a subvolume - fix fallocate with qgroups that could lead to space accounting underflow, reported as a warning - trim/discard ioctl honours the requested range - starting send and dedupe on a subvolume at the same time will let only one of them succeed, this is to prevent changes that send could miss due to dedupe; both operations are restartable Core changes: - more tree-checker validations, errors reported by fuzzing tools: - device item - inode item - block group profiles - tracepoints for extent buffer locking - async cow preallocates memory to avoid errors happening too deep in the call chain - metadata reservations for delalloc reworked to better adapt in many-writers/low-space scenarios - improved space flushing logic for intense DIO vs buffered workloads - lots of cleanups - removed unused struct members - redundant argument removal - properties and xattrs - extent buffer locking - selftests - use common file type conversions - many-argument functions reduction" * tag 'for-5.2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (227 commits) btrfs: Use kvmalloc for allocating compressed path context btrfs: Factor out common extent locking code in submit_compressed_extents btrfs: Set io_tree only once in submit_compressed_extents btrfs: Replace clear_extent_bit with unlock_extent btrfs: Make compress_file_range take only struct async_chunk btrfs: Remove fs_info from struct async_chunk btrfs: Rename async_cow to async_chunk btrfs: Preallocate chunks in cow_file_range_async btrfs: reserve delalloc metadata differently btrfs: track DIO bytes in flight btrfs: merge calls of btrfs_setxattr and btrfs_setxattr_trans in btrfs_set_prop btrfs: delete unused function btrfs_set_prop_trans btrfs: start transaction in xattr_handler_set_prop btrfs: drop local copy of inode i_mode btrfs: drop old_fsflags in btrfs_ioctl_setflags btrfs: modify local copy of btrfs_inode flags btrfs: drop useless inode i_flags copy and restore btrfs: start transaction in btrfs_ioctl_setflags() btrfs: export btrfs_set_prop btrfs: refactor btrfs_set_props to validate externally ...
2019-05-07Merge branch 'work.icache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds3-5/+4
Pull vfs inode freeing updates from Al Viro: "Introduction of separate method for RCU-delayed part of ->destroy_inode() (if any). Pretty much as posted, except that destroy_inode() stashes ->free_inode into the victim (anon-unioned with ->i_fops) before scheduling i_callback() and the last two patches (sockfs conversion and folding struct socket_wq into struct socket) are excluded - that pair should go through netdev once davem reopens his tree" * 'work.icache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (58 commits) orangefs: make use of ->free_inode() shmem: make use of ->free_inode() hugetlb: make use of ->free_inode() overlayfs: make use of ->free_inode() jfs: switch to ->free_inode() fuse: switch to ->free_inode() ext4: make use of ->free_inode() ecryptfs: make use of ->free_inode() ceph: use ->free_inode() btrfs: use ->free_inode() afs: switch to use of ->free_inode() dax: make use of ->free_inode() ntfs: switch to ->free_inode() securityfs: switch to ->free_inode() apparmor: switch to ->free_inode() rpcpipe: switch to ->free_inode() bpf: switch to ->free_inode() mqueue: switch to ->free_inode() ufs: switch to ->free_inode() coda: switch to ->free_inode() ...
2019-05-07Merge tag 'printk-for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printkLinus Torvalds1-2/+2
Pull printk updates from Petr Mladek: - Allow state reset of printk_once() calls. - Prevent crashes when dereferencing invalid pointers in vsprintf(). Only the first byte is checked for simplicity. - Make vsprintf warnings consistent and inlined. - Treewide conversion of obsolete %pf, %pF to %ps, %pF printf modifiers. - Some clean up of vsprintf and test_printf code. * tag 'printk-for-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk: lib/vsprintf: Make function pointer_string static vsprintf: Limit the length of inlined error messages vsprintf: Avoid confusion between invalid address and value vsprintf: Prevent crash when dereferencing invalid pointers vsprintf: Consolidate handling of unknown pointer specifiers vsprintf: Factor out %pO handler as kobject_string() vsprintf: Factor out %pV handler as va_format() vsprintf: Factor out %p[iI] handler as ip_addr_string() vsprintf: Do not check address of well-known strings vsprintf: Consistent %pK handling for kptr_restrict == 0 vsprintf: Shuffle restricted_pointer() printk: Tie printk_once / printk_deferred_once into .data.once for reset treewide: Switch printk users from %pf and %pF to %ps and %pS, respectively lib/test_printf: Switch to bitmap_zalloc()
2019-05-06Merge branch 'core-stacktrace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-13/+2
Pull stack trace updates from Ingo Molnar: "So Thomas looked at the stacktrace code recently and noticed a few weirdnesses, and we all know how such stories of crummy kernel code meeting German engineering perfection end: a 45-patch series to clean it all up! :-) Here's the changes in Thomas's words: 'Struct stack_trace is a sinkhole for input and output parameters which is largely pointless for most usage sites. In fact if embedded into other data structures it creates indirections and extra storage overhead for no benefit. Looking at all usage sites makes it clear that they just require an interface which is based on a storage array. That array is either on stack, global or embedded into some other data structure. Some of the stack depot usage sites are outright wrong, but fortunately the wrongness just causes more stack being used for nothing and does not have functional impact. Another oddity is the inconsistent termination of the stack trace with ULONG_MAX. It's pointless as the number of entries is what determines the length of the stored trace. In fact quite some call sites remove the ULONG_MAX marker afterwards with or without nasty comments about it. Not all architectures do that and those which do, do it inconsistenly either conditional on nr_entries == 0 or unconditionally. The following series cleans that up by: 1) Removing the ULONG_MAX termination in the architecture code 2) Removing the ULONG_MAX fixups at the call sites 3) Providing plain storage array based interfaces for stacktrace and stackdepot. 4) Cleaning up the mess at the callsites including some related cleanups. 5) Removing the struct stack_trace based interfaces This is not changing the struct stack_trace interfaces at the architecture level, but it removes the exposure to the generic code'" * 'core-stacktrace-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits) x86/stacktrace: Use common infrastructure stacktrace: Provide common infrastructure lib/stackdepot: Remove obsolete functions stacktrace: Remove obsolete functions livepatch: Simplify stack trace retrieval tracing: Remove the last struct stack_trace usage tracing: Simplify stack trace retrieval tracing: Make ftrace_trace_userstack() static and conditional tracing: Use percpu stack trace buffer more intelligently tracing: Simplify stacktrace retrieval in histograms lockdep: Simplify stack trace handling lockdep: Remove save argument from check_prev_add() lockdep: Remove unused trace argument from print_circular_bug() drm: Simplify stacktrace handling dm persistent data: Simplify stack trace handling dm bufio: Simplify stack trace retrieval btrfs: ref-verify: Simplify stack trace retrieval dma/debug: Simplify stracktrace retrieval fault-inject: Simplify stacktrace retrieval mm/page_owner: Simplify stack trace handling ...
2019-05-02btrfs: Use kvmalloc for allocating compressed path contextNikolay Borisov1-2/+7
Recent refactoring of cow_file_range_async means it's now possible to request a rather large physically contiguous memory via kmalloc. The size is dependent on the number of 512k chunks that the compressed range consists of. David reported multiple OOM messages on such large allocations. Fix it by switching to using kvmalloc. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-02btrfs: Factor out common extent locking code in submit_compressed_extentsNikolay Borisov1-7/+3
Irrespective of whether the compress code fell back to uncompressed or a compressed extent has to be submitted, the extent range is always locked. So factor out the common lock_extent call at the beginning of the loop. No functional changes just removes one duplicate lock_extent call. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-02btrfs: Set io_tree only once in submit_compressed_extentsNikolay Borisov1-4/+1
The inode never changes so it's sufficient to dereference it and get the iotree only once, before the execution of the main loop. No functional changes, only the size of the function is decreased: add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-44 (-44) Function old new delta submit_compressed_extents 1240 1196 -44 Total: Before=88476, After=88432, chg -0.05% Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-02btrfs: Replace clear_extent_bit with unlock_extentNikolay Borisov1-2/+1
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-02btrfs: Make compress_file_range take only struct async_chunkNikolay Borisov1-11/+9
All context this function needs is held within struct async_chunk. Currently we not only pass the struct but also every individual member. This is redundant, simplify it by only passing struct async_chunk and leaving it to compress_file_range to extract the values it requires. No functional changes. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-02btrfs: Remove fs_info from struct async_chunkNikolay Borisov1-7/+3
The associated btrfs_work already contains a reference to the fs_info so use that instead of passing it via async_chunk. No functional changes. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-02btrfs: Rename async_cow to async_chunkNikolay Borisov1-30/+30
Now that we have an explicit async_chunk struct rename references to variables of this type to async_chunk. No functional changes. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-02btrfs: Preallocate chunks in cow_file_range_asyncNikolay Borisov1-34/+74
This commit changes the implementation of cow_file_range_async in order to get rid of the BUG_ON in the middle of the loop. Additionally it reworks the inner loop in the hopes of making it more understandable. The idea is to make async_cow be a top-level structured, shared amongst all chunks being sent for compression. This allows to perform one memory allocation at the beginning and gracefully fail the IO if there isn't enough memory. Now, each chunk is going to be described by an async_chunk struct. It's the responsibility of the final chunk to actually free the memory. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-02btrfs: reserve delalloc metadata differentlyJosef Bacik1-92/+52
With the per-inode block reserves we started refilling the reserve based on the calculated size of the outstanding csum bytes and extents for the inode, including the amount we were adding with the new operation. However, generic/224 exposed a problem with this approach. With 1000 files all writing at the same time we ended up with a bunch of bytes being reserved but unusable. When you write to a file we reserve space for the csum leaves for those bytes, the number of extent items required to cover those bytes, and a single transaction item for updating the inode at ordered extent finish for that range of bytes. This is held until the ordered extent finishes and we release all of the reserved space. If a second write comes in at this point we would add a single reservation for the new outstanding extent and however many reservations for the csum leaves. At this point we find the delta of how much we have reserved and how much outstanding size this is and attempt to reserve this delta. If the first write finishes it will not release any space, because the space it had reserved for the initial write is still needed for the second write. However some space would have been used, as we have added csums, extent items, and dirtied the inode. Our reserved space would be > 0 but less than the total needed reserved space. This is just for a single inode, now consider generic/224. This has 1000 inodes writing in parallel to a very small file system, 1GiB. In my testing this usually means we get about a 120MiB metadata area to work with, more than enough to allow the writes to continue, but not enough if all of the inodes are stuck trying to reserve the slack space while continuing to hold their leftovers from their initial writes. Fix this by pre-reserved _only_ for the space we are currently trying to add. Then once that is successful modify our inodes csum count and outstanding extents, and then add the newly reserved space to the inodes block_rsv. This allows us to actually pass generic/224 without running out of metadata space. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-05-01btrfs: use ->free_inode()Al Viro3-5/+4
a lot of stuff remains in ->destroy_inode() Acked-by: David Sterba <dsterba@suse.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-05-01gcc-9: don't warn about uninitialized btrfs extent_type variableLinus Torvalds1-1/+1
The 'extent_type' variable does seem to be reliably initialized, but it's _very_ non-obvious, since there's a "goto next" case that jumps over the normal initialization. That will then always trigger the "start >= extent_end" test, which will end up never falling through to the use of that variable. But the code is certainly not obvious, and the compiler warning looks reasonable. Make 'extent_type' an int, and initialize it to an invalid negative value, which seems to be the common pattern in other places. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-04-30block: remove the i argument to bio_for_each_segment_allChristoph Hellwig5-16/+12
We only have two callers that need the integer loop iterator, and they can easily maintain it themselves. Suggested-by: Matthew Wilcox <willy@infradead.org> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Acked-by: David Sterba <dsterba@suse.com> Reviewed-by: Hannes Reinecke <hare@suse.com> Acked-by: Coly Li <colyli@suse.de> Reviewed-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-04-29btrfs: track DIO bytes in flightJosef Bacik4-4/+36
When diagnosing a slowdown of generic/224 I noticed we were not doing anything when calling into shrink_delalloc(). This is because all writes in 224 are O_DIRECT, not delalloc, and thus our delalloc_bytes counter is 0, which short circuits most of the work inside of shrink_delalloc(). However O_DIRECT writes still consume metadata resources and generate ordered extents, which we can still wait on. Fix this by tracking outstanding DIO write bytes, and use this as well as the delalloc bytes counter to decide if we need to lookup and wait on any ordered extents. If we have more DIO writes than delalloc bytes we'll go ahead and wait on any ordered extents regardless of our flush state as flushing delalloc is likely to not gain us anything. Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ use dio instead of odirect in identifiers ] Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: merge calls of btrfs_setxattr and btrfs_setxattr_trans in btrfs_set_propAnand Jain1-19/+6
Since now the trans argument is never NULL in btrfs_set_prop we don't have to check. So delete it and use btrfs_setxattr that makes use of that. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: delete unused function btrfs_set_prop_transAnand Jain2-8/+0
The last consumer of btrfs_set_prop_trans() was taken away by the patch ("btrfs: start transaction in xattr_handler_set_prop") so now this function can be deleted. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: start transaction in xattr_handler_set_propAnand Jain1-1/+20
btrfs specific extended attributes on the inode are set using btrfs_xattr_handler_set_prop(), and the required transaction for this update is started by btrfs_setxattr(). For better visibility of the transaction start and end, do this in btrfs_xattr_handler_set_prop(). For which this patch copied code of btrfs_setxattr() as it is in the original, which needs proper error handling. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: drop local copy of inode i_modeAnand Jain1-5/+2
There isn't real use of making struct inode::i_mode a local copy, it saves a dereference one time, not much. Just use it directly. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: drop old_fsflags in btrfs_ioctl_setflagsAnand Jain1-3/+3
btrfs_inode_flags_to_fsflags() is copied into @old_fsflags and used only once. Instead used it directly. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: modify local copy of btrfs_inode flagsAnand Jain1-31/+26
Instead of updating the binode::flags directly, update a local copy, and then at the point of no error, store copy it to the binode::flags. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: drop useless inode i_flags copy and restoreAnand Jain1-3/+0
The patch ("btrfs: start transaction in btrfs_ioctl_setflags()") used btrfs_set_prop() instead of btrfs_set_prop_trans() by which now the inode::i_flags update functions such as btrfs_sync_inode_flags_to_i_flags() and btrfs_update_inode() is called in btrfs_ioctl_setflags() instead of btrfs_set_prop_trans()->btrfs_setxattr() as earlier. So the inode::i_flags remains unmodified until the thread has checked all the conditions. So drop the saved inode::i_flags in out_i_flags. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: start transaction in btrfs_ioctl_setflags()Anand Jain1-20/+25
Inode attribute can be set through the FS_IOC_SETFLAGS ioctl. This flags also includes compression attribute for which we would set/reset the compression extended attribute. While doing this there is a bit of duplicate code, the following things happens twice: - start/end_transaction - inode_inc_iversion() - current_time update to inode->i_ctime - and btrfs_update_inode() These are updated both at btrfs_ioctl_setflags() and btrfs_set_props() as well. This patch merges these two duplicate codes at btrfs_ioctl_setflags(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: export btrfs_set_propAnand Jain2-3/+6
Make btrfs_set_prop() a non-static function, so that it can be called from btrfs_ioctl_setflags(). We need btrfs_set_prop() instead of btrfs_set_prop_trans() so that we can use the transaction which is already started in the current thread. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: refactor btrfs_set_props to validate externallyAnand Jain4-6/+26
In preparation to merge multiple transactions when setting the compression flags, split btrfs_set_props() validation part outside of it. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: ctree: Dump the leaf before BUG_ON in btrfs_set_item_key_safeQu Wenruo1-2/+22
We have a long standing problem with reversed keys that's detected by btrfs_set_item_key_safe. This is hard to reproduce so we'd like to capture more information for later analysis. Let's dump the leaf content before triggering BUG_ON() so that we can have some clue on what's going wrong. The output of tree locks should help us to debug such problem. Sample stacktrace: generic/522 [00:07:05] [26946.113381] run fstests generic/522 at 2019-04-16 00:07:05 [27161.474720] kernel BUG at fs/btrfs/ctree.c:3192! [27161.475923] invalid opcode: 0000 [#1] PREEMPT SMP [27161.477167] CPU: 0 PID: 15676 Comm: fsx Tainted: G W 5.1.0-rc5-default+ #562 [27161.478932] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c89-prebuilt.qemu.org 04/01/2014 [27161.481099] RIP: 0010:btrfs_set_item_key_safe+0x146/0x1c0 [btrfs] [27161.485369] RSP: 0018:ffffb087499e39b0 EFLAGS: 00010286 [27161.486464] RAX: 00000000ffffffff RBX: ffff941534d80e70 RCX: 0000000000024000 [27161.487929] RDX: 0000000000013039 RSI: ffffb087499e3aa5 RDI: ffffb087499e39c7 [27161.489289] RBP: 000000000000000e R08: ffff9414e0f49008 R09: 0000000000001000 [27161.490807] R10: 0000000000000000 R11: 0000000000000003 R12: ffff9414e0f48e70 [27161.492305] R13: ffffb087499e3aa5 R14: 0000000000000000 R15: 0000000000071000 [27161.493845] FS: 00007f8ea58d0b80(0000) GS:ffff94153d400000(0000) knlGS:0000000000000000 [27161.495608] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [27161.496717] CR2: 00007f8ea57a9000 CR3: 0000000016a33000 CR4: 00000000000006f0 [27161.498100] Call Trace: [27161.498771] __btrfs_drop_extents+0x6ec/0xdf0 [btrfs] [27161.499872] btrfs_log_changed_extents.isra.26+0x3a2/0x9e0 [btrfs] [27161.501114] btrfs_log_inode+0x7ff/0xdc0 [btrfs] [27161.502114] ? __mutex_unlock_slowpath+0x4b/0x2b0 [27161.503172] btrfs_log_inode_parent+0x237/0x9c0 [btrfs] [27161.504348] btrfs_log_dentry_safe+0x4a/0x70 [btrfs] [27161.505374] btrfs_sync_file+0x1b7/0x480 [btrfs] [27161.506371] __x64_sys_msync+0x180/0x210 [27161.507208] do_syscall_64+0x54/0x180 [27161.507932] entry_SYSCALL_64_after_hwframe+0x49/0xbe [27161.508839] RIP: 0033:0x7f8ea5aa9c61 [27161.512616] RSP: 002b:00007ffea2a06498 EFLAGS: 00000246 ORIG_RAX: 000000000000001a [27161.514161] RAX: ffffffffffffffda RBX: 000000000002a938 RCX: 00007f8ea5aa9c61 [27161.515376] RDX: 0000000000000004 RSI: 000000000001c9b2 RDI: 00007f8ea578d000 [27161.516572] RBP: 000000000001c07a R08: fffffffffffffff8 R09: 000000000002a000 [27161.517883] R10: 00007f8ea57a99b2 R11: 0000000000000246 R12: 0000000000000938 [27161.519080] R13: 00007f8ea578d000 R14: 000000000001c9b2 R15: 0000000000000000 [27161.520281] Modules linked in: btrfs libcrc32c xor zstd_decompress zstd_compress xxhash raid6_pq loop [last unloaded: scsi_debug] [27161.522272] ---[ end trace d5afec7ccac6a252 ]--- [27161.523111] RIP: 0010:btrfs_set_item_key_safe+0x146/0x1c0 [btrfs] [27161.527253] RSP: 0018:ffffb087499e39b0 EFLAGS: 00010286 [27161.528192] RAX: 00000000ffffffff RBX: ffff941534d80e70 RCX: 0000000000024000 [27161.529392] RDX: 0000000000013039 RSI: ffffb087499e3aa5 RDI: ffffb087499e39c7 [27161.530607] RBP: 000000000000000e R08: ffff9414e0f49008 R09: 0000000000001000 [27161.531802] R10: 0000000000000000 R11: 0000000000000003 R12: ffff9414e0f48e70 [27161.533018] R13: ffffb087499e3aa5 R14: 0000000000000000 R15: 0000000000071000 [27161.534405] FS: 00007f8ea58d0b80(0000) GS:ffff94153d400000(0000) knlGS:0000000000000000 [27161.536048] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [27161.537210] CR2: 00007f8ea57a9000 CR3: 0000000016a33000 CR4: 00000000000006f0 Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: tree-checker: Allow error injection for tree-checkerQu Wenruo1-0/+5
Allowing error injection for btrfs_check_leaf_full() and btrfs_check_node() is useful to test the failure path of btrfs write time tree check. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: Document btrfs_csum_one_bioNikolay Borisov1-0/+10
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29Btrfs: improve performance on fsync of files with multiple hardlinksFilipe Manana3-63/+188
Commit 41bd6067692382 ("Btrfs: fix fsync of files with multiple hard links in new directories") introduced a path that makes fsync fallback to a full transaction commit in order to avoid losing hard links and new ancestors of the fsynced inode. That path is triggered only when the inode has more than one hard link and either has a new hard link created in the current transaction or the inode was evicted and reloaded in the current transaction. That path ends up getting triggered very often (hundreds of times) during the course of pgbench benchmarks, resulting in performance drops of about 20%. This change restores the performance by not triggering the full transaction commit in those cases, and instead iterate the fs/subvolume tree in search of all possible new ancestors, for all hard links, to log them. Reported-by: Zhao Yuhu <zyuhu@suse.com> Tested-by: James Wang <jnwang@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29Btrfs: fix race between send and deduplication that lead to failures and crashesFilipe Manana3-1/+50
Send operates on read only trees and expects them to never change while it is using them. This is part of its initial design, and this expection is due to two different reasons: 1) When it was introduced, no operations were allowed to modifiy read-only subvolumes/snapshots (including defrag for example). 2) It keeps send from having an impact on other filesystem operations. Namely send does not need to keep locks on the trees nor needs to hold on to transaction handles and delay transaction commits. This ends up being a consequence of the former reason. However the deduplication feature was introduced later (on September 2013, while send was introduced in July 2012) and it allowed for deduplication with destination files that belong to read-only trees (subvolumes and snapshots). That means that having a send operation (either full or incremental) running in parallel with a deduplication that has the destination inode in one of the trees used by the send operation, can result in tree nodes and leaves getting freed and reused while send is using them. This problem is similar to the problem solved for the root nodes getting freed and reused when a snapshot is made against one tree that is currenly being used by a send operation, fixed in commits [1] and [2]. These commits explain in detail how the problem happens and the explanation is valid for any node or leaf that is not the root of a tree as well. This problem was also discussed and explained recently in a thread [3]. The problem is very easy to reproduce when using send with large trees (snapshots) and just a few concurrent deduplication operations that target files in the trees used by send. A stress test case is being sent for fstests that triggers the issue easily. The most common error to hit is the send ioctl return -EIO with the following messages in dmesg/syslog: [1631617.204075] BTRFS error (device sdc): did not find backref in send_root. inode=63292, offset=0, disk_byte=5228134400 found extent=5228134400 [1631633.251754] BTRFS error (device sdc): parent transid verify failed on 32243712 wanted 24 found 27 The first one is very easy to hit while the second one happens much less frequently, except for very large trees (in that test case, snapshots with 100000 files having large xattrs to get deep and wide trees). Less frequently, at least one BUG_ON can be hit: [1631742.130080] ------------[ cut here ]------------ [1631742.130625] kernel BUG at fs/btrfs/ctree.c:1806! [1631742.131188] invalid opcode: 0000 [#6] SMP DEBUG_PAGEALLOC PTI [1631742.131726] CPU: 1 PID: 13394 Comm: btrfs Tainted: G B D W 5.0.0-rc8-btrfs-next-45 #1 [1631742.132265] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 [1631742.133399] RIP: 0010:read_node_slot+0x122/0x130 [btrfs] (...) [1631742.135061] RSP: 0018:ffffb530021ebaa0 EFLAGS: 00010246 [1631742.135615] RAX: ffff93ac8912e000 RBX: 000000000000009d RCX: 0000000000000002 [1631742.136173] RDX: 000000000000009d RSI: ffff93ac564b0d08 RDI: ffff93ad5b48c000 [1631742.136759] RBP: ffffb530021ebb7d R08: 0000000000000001 R09: ffffb530021ebb7d [1631742.137324] R10: ffffb530021eba70 R11: 0000000000000000 R12: ffff93ac87d0a708 [1631742.137900] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001 [1631742.138455] FS: 00007f4cdb1528c0(0000) GS:ffff93ad76a80000(0000) knlGS:0000000000000000 [1631742.139010] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [1631742.139568] CR2: 00007f5acb3d0420 CR3: 000000012be3e006 CR4: 00000000003606e0 [1631742.140131] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [1631742.140719] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [1631742.141272] Call Trace: [1631742.141826] ? do_raw_spin_unlock+0x49/0xc0 [1631742.142390] tree_advance+0x173/0x1d0 [btrfs] [1631742.142948] btrfs_compare_trees+0x268/0x690 [btrfs] [1631742.143533] ? process_extent+0x1070/0x1070 [btrfs] [1631742.144088] btrfs_ioctl_send+0x1037/0x1270 [btrfs] [1631742.144645] _btrfs_ioctl_send+0x80/0x110 [btrfs] [1631742.145161] ? trace_sched_stick_numa+0xe0/0xe0 [1631742.145685] btrfs_ioctl+0x13fe/0x3120 [btrfs] [1631742.146179] ? account_entity_enqueue+0xd3/0x100 [1631742.146662] ? reweight_entity+0x154/0x1a0 [1631742.147135] ? update_curr+0x20/0x2a0 [1631742.147593] ? check_preempt_wakeup+0x103/0x250 [1631742.148053] ? do_vfs_ioctl+0xa2/0x6f0 [1631742.148510] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs] [1631742.148942] do_vfs_ioctl+0xa2/0x6f0 [1631742.149361] ? __fget+0x113/0x200 [1631742.149767] ksys_ioctl+0x70/0x80 [1631742.150159] __x64_sys_ioctl+0x16/0x20 [1631742.150543] do_syscall_64+0x60/0x1b0 [1631742.150931] entry_SYSCALL_64_after_hwframe+0x49/0xbe [1631742.151326] RIP: 0033:0x7f4cd9f5add7 (...) [1631742.152509] RSP: 002b:00007ffe91017708 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 [1631742.152892] RAX: ffffffffffffffda RBX: 0000000000000105 RCX: 00007f4cd9f5add7 [1631742.153268] RDX: 00007ffe91017790 RSI: 0000000040489426 RDI: 0000000000000007 [1631742.153633] RBP: 0000000000000007 R08: 00007f4cd9e79700 R09: 00007f4cd9e79700 [1631742.153999] R10: 00007f4cd9e799d0 R11: 0000000000000202 R12: 0000000000000003 [1631742.154365] R13: 0000555dfae53020 R14: 0000000000000000 R15: 0000000000000001 (...) [1631742.156696] ---[ end trace 5dac9f96dcc3fd6b ]--- That BUG_ON happens because while send is using a node, that node is COWed by a concurrent deduplication, gets freed and gets reused as a leaf (because a transaction commit happened in between), so when it attempts to read a slot from the extent buffer, at ctree.c:read_node_slot(), the extent buffer contents were wiped out and it now matches a leaf (which can even belong to some other tree now), hitting the BUG_ON(level == 0). Fix this concurrency issue by not allowing send and deduplication to run in parallel if both operate on the same readonly trees, returning EAGAIN to user space and logging an exlicit warning in dmesg/syslog. [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=be6821f82c3cc36e026f5afd10249988852b35ea [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6f2f0b394b54e2b159ef969a0b5274e9bbf82ff2 [3] https://lore.kernel.org/linux-btrfs/CAL3q7H7iqSEEyFaEtpRZw3cp613y+4k2Q8b4W7mweR3tZA05bQ@mail.gmail.com/ CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29Btrfs: send, flush dellaloc in order to avoid data lossFilipe Manana1-0/+36
When we set a subvolume to read-only mode we do not flush dellaloc for any of its inodes (except if the filesystem is mounted with -o flushoncommit), since it does not affect correctness for any subsequent operations - except for a future send operation. The send operation will not be able to see the delalloc data since the respective file extent items, inode item updates, backreferences, etc, have not hit yet the subvolume and extent trees. Effectively this means data loss, since the send stream will not contain any data from existing delalloc. Another problem from this is that if the writeback starts and finishes while the send operation is in progress, we have the subvolume tree being being modified concurrently which can result in send failing unexpectedly with EIO or hitting runtime errors, assertion failures or hitting BUG_ONs, etc. Simple reproducer: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ btrfs subvolume create /mnt/sv $ xfs_io -f -c "pwrite -S 0xea 0 108K" /mnt/sv/foo $ btrfs property set /mnt/sv ro true $ btrfs send -f /tmp/send.stream /mnt/sv $ od -t x1 -A d /mnt/sv/foo 0000000 ea ea ea ea ea ea ea ea ea ea ea ea ea ea ea ea * 0110592 $ umount /mnt $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt $ btrfs receive -f /tmp/send.stream /mnt $ echo $? 0 $ od -t x1 -A d /mnt/sv/foo 0000000 # ---> empty file Since this a problem that affects send only, fix it in send by flushing dellaloc for all the roots used by the send operation before send starts to process the commit roots. This is a problem that affects send since it was introduced (commit 31db9f7c23fbf7 ("Btrfs: introduce BTRFS_IOC_SEND for btrfs send/receive")) but backporting it to older kernels has some dependencies: - For kernels between 3.19 and 4.20, it depends on commit 3cd24c698004d2 ("btrfs: use tagged writepage to mitigate livelock of snapshot") because the function btrfs_start_delalloc_snapshot() does not exist before that commit. So one has to either pick that commit or replace the calls to btrfs_start_delalloc_snapshot() in this patch with calls to btrfs_start_delalloc_inodes(). - For kernels older than 3.19 it also requires commit e5fa8f865b3324 ("Btrfs: ensure send always works on roots without orphans") because it depends on the function ensure_commit_roots_uptodate() which that commits introduced. - No dependencies for 5.0+ kernels. A test case for fstests follows soon. CC: stable@vger.kernel.org # 3.19+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29Btrfs: do not start a transaction during fiemapFilipe Manana1-6/+10
During fiemap, for regular extents (non inline) we need to check if they are shared and if they are, set the shared bit. Checking if an extent is shared requires checking the delayed references of the currently running transaction, since some reference might have not yet hit the extent tree and be only in the in-memory delayed references. However we were using a transaction join for this, which creates a new transaction when there is no transaction currently running. That means that two more potential failures can happen: creating the transaction and committing it. Further, if no write activity is currently happening in the system, and fiemap calls keep being done, we end up creating and committing transactions that do nothing. In some extreme cases this can result in the commit of the transaction created by fiemap to fail with ENOSPC when updating the root item of a subvolume tree because a join does not reserve any space, leading to a trace like the following: heisenberg kernel: ------------[ cut here ]------------ heisenberg kernel: BTRFS: Transaction aborted (error -28) heisenberg kernel: WARNING: CPU: 0 PID: 7137 at fs/btrfs/root-tree.c:136 btrfs_update_root+0x22b/0x320 [btrfs] (...) heisenberg kernel: CPU: 0 PID: 7137 Comm: btrfs-transacti Not tainted 4.19.0-4-amd64 #1 Debian 4.19.28-2 heisenberg kernel: Hardware name: FUJITSU LIFEBOOK U757/FJNB2A5, BIOS Version 1.21 03/19/2018 heisenberg kernel: RIP: 0010:btrfs_update_root+0x22b/0x320 [btrfs] (...) heisenberg kernel: RSP: 0018:ffffb5448828bd40 EFLAGS: 00010286 heisenberg kernel: RAX: 0000000000000000 RBX: ffff8ed56bccef50 RCX: 0000000000000006 heisenberg kernel: RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff8ed6bda166a0 heisenberg kernel: RBP: 00000000ffffffe4 R08: 00000000000003df R09: 0000000000000007 heisenberg kernel: R10: 0000000000000000 R11: 0000000000000001 R12: ffff8ed63396a078 heisenberg kernel: R13: ffff8ed092d7c800 R14: ffff8ed64f5db028 R15: ffff8ed6bd03d068 heisenberg kernel: FS: 0000000000000000(0000) GS:ffff8ed6bda00000(0000) knlGS:0000000000000000 heisenberg kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 heisenberg kernel: CR2: 00007f46f75f8000 CR3: 0000000310a0a002 CR4: 00000000003606f0 heisenberg kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 heisenberg kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 heisenberg kernel: Call Trace: heisenberg kernel: commit_fs_roots+0x166/0x1d0 [btrfs] heisenberg kernel: ? _cond_resched+0x15/0x30 heisenberg kernel: ? btrfs_run_delayed_refs+0xac/0x180 [btrfs] heisenberg kernel: btrfs_commit_transaction+0x2bd/0x870 [btrfs] heisenberg kernel: ? start_transaction+0x9d/0x3f0 [btrfs] heisenberg kernel: transaction_kthread+0x147/0x180 [btrfs] heisenberg kernel: ? btrfs_cleanup_transaction+0x530/0x530 [btrfs] heisenberg kernel: kthread+0x112/0x130 heisenberg kernel: ? kthread_bind+0x30/0x30 heisenberg kernel: ret_from_fork+0x35/0x40 heisenberg kernel: ---[ end trace 05de912e30e012d9 ]--- Since fiemap (and btrfs_check_shared()) is a read-only operation, do not do a transaction join to avoid the overhead of creating a new transaction (if there is currently no running transaction) and introducing a potential point of failure when the new transaction gets committed, instead use a transaction attach to grab a handle for the currently running transaction if any. Reported-by: Christoph Anton Mitterer <calestyo@scientia.net> Link: https://lore.kernel.org/linux-btrfs/b2a668d7124f1d3e410367f587926f622b3f03a4.camel@scientia.net/ Fixes: afce772e87c36c ("btrfs: fix check_shared for fiemap ioctl") CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: remove unused parameter fs_info from btrfs_set_disk_extent_flagsDavid Sterba3-4/+2
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: remove unused parameter fs_info from btrfs_add_delayed_extent_opDavid Sterba3-6/+3
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: remove unused parameter fs_info from emit_last_fiemap_cacheDavid Sterba1-3/+2
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: remove unused parameter fs_info from CHECK_FE_ALIGNEDDavid Sterba1-6/+6
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: remove unused parameter fs_info from from tree_advanceDavid Sterba1-4/+3
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: remove unused parameter fs_info from tree_move_downDavid Sterba1-4/+2
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: remove unused parameter fs_info from btrfs_extend_itemDavid Sterba9-17/+12
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: remove unused parameter fs_info from btrfs_truncate_itemDavid Sterba10-17/+12
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: remove unused parameter fs_info from split_itemDavid Sterba1-3/+2
Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: qgroup: Don't scan leaf if we're modifying reloc treeQu Wenruo1-4/+6
Since reloc tree doesn't contribute to qgroup numbers, just skip them. This should catch the final cause of unnecessary data ref processing when running balance of metadata with qgroups on. The 4G data 16 snapshots test (*) should explain it pretty well: | delayed subtree | refactor delayed ref | this patch --------------------------------------------------------------------- relocated | 22653 | 22673 | 22744 qgroup dirty | 122792 | 48360 | 70 time | 24.494 | 11.606 | 3.944 Finally, we're at the stage where qgroup + metadata balance cost no obvious overhead. Test environment: Test VM: - vRAM 8G - vCPU 8 - block dev vitrio-blk, 'unsafe' cache mode - host block 850evo Test workload: - Copy 4G data from /usr/ to one subvolume - Create 16 snapshots of that subvolume, and modify 3 files in each snapshot - Enable quota, rescan - Time "btrfs balance start -m" Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: extent-tree: Use btrfs_ref to refactor btrfs_free_extent()Qu Wenruo5-56/+61
Similar to btrfs_inc_extent_ref(), use btrfs_ref to replace the long parameter list and the confusing @owner parameter. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: extent-tree: Use btrfs_ref to refactor btrfs_inc_extent_ref()Qu Wenruo7-61/+96
Use the new btrfs_ref structure and replace parameter list to clean up the usage of owner and level to distinguish the extent types. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: extent-tree: Use btrfs_ref to refactor add_pinned_bytes()Qu Wenruo1-16/+10
Since add_pinned_bytes() only needs to know if the extent is metadata and if it's a chunk tree extent, btrfs_ref is a perfect match for it, as we don't need various owner/level trick to determine extent type. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-04-29btrfs: ref-verify: Use btrfs_ref to refactor btrfs_ref_tree_mod()Qu Wenruo3-48/+42
It's a perfect match for btrfs_ref_tree_mod() to use btrfs_ref, as btrfs_ref describes a metadata/data reference update comprehensively. Now we have one less function use confusing owner/level trick. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>