aboutsummaryrefslogtreecommitdiffstats
path: root/fs/btrfs/ctree.h (follow)
AgeCommit message (Collapse)AuthorFilesLines
2020-01-31Btrfs: fix race between adding and putting tree mod seq elements and nodesFilipe Manana1-4/+2
There is a race between adding and removing elements to the tree mod log list and rbtree that can lead to use-after-free problems. Consider the following example that explains how/why the problems happens: 1) Task A has mod log element with sequence number 200. It currently is the only element in the mod log list; 2) Task A calls btrfs_put_tree_mod_seq() because it no longer needs to access the tree mod log. When it enters the function, it initializes 'min_seq' to (u64)-1. Then it acquires the lock 'tree_mod_seq_lock' before checking if there are other elements in the mod seq list. Since the list it empty, 'min_seq' remains set to (u64)-1. Then it unlocks the lock 'tree_mod_seq_lock'; 3) Before task A acquires the lock 'tree_mod_log_lock', task B adds itself to the mod seq list through btrfs_get_tree_mod_seq() and gets a sequence number of 201; 4) Some other task, name it task C, modifies a btree and because there elements in the mod seq list, it adds a tree mod elem to the tree mod log rbtree. That node added to the mod log rbtree is assigned a sequence number of 202; 5) Task B, which is doing fiemap and resolving indirect back references, calls btrfs get_old_root(), with 'time_seq' == 201, which in turn calls tree_mod_log_search() - the search returns the mod log node from the rbtree with sequence number 202, created by task C; 6) Task A now acquires the lock 'tree_mod_log_lock', starts iterating the mod log rbtree and finds the node with sequence number 202. Since 202 is less than the previously computed 'min_seq', (u64)-1, it removes the node and frees it; 7) Task B still has a pointer to the node with sequence number 202, and it dereferences the pointer itself and through the call to __tree_mod_log_rewind(), resulting in a use-after-free problem. This issue can be triggered sporadically with the test case generic/561 from fstests, and it happens more frequently with a higher number of duperemove processes. When it happens to me, it either freezes the VM or it produces a trace like the following before crashing: [ 1245.321140] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI [ 1245.321200] CPU: 1 PID: 26997 Comm: pool Not tainted 5.5.0-rc6-btrfs-next-52 #1 [ 1245.321235] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014 [ 1245.321287] RIP: 0010:rb_next+0x16/0x50 [ 1245.321307] Code: .... [ 1245.321372] RSP: 0018:ffffa151c4d039b0 EFLAGS: 00010202 [ 1245.321388] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8ae221363c80 RCX: 6b6b6b6b6b6b6b6b [ 1245.321409] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8ae221363c80 [ 1245.321439] RBP: ffff8ae20fcc4688 R08: 0000000000000002 R09: 0000000000000000 [ 1245.321475] R10: ffff8ae20b120910 R11: 00000000243f8bb1 R12: 0000000000000038 [ 1245.321506] R13: ffff8ae221363c80 R14: 000000000000075f R15: ffff8ae223f762b8 [ 1245.321539] FS: 00007fdee1ec7700(0000) GS:ffff8ae236c80000(0000) knlGS:0000000000000000 [ 1245.321591] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1245.321614] CR2: 00007fded4030c48 CR3: 000000021da16003 CR4: 00000000003606e0 [ 1245.321642] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1245.321668] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1245.321706] Call Trace: [ 1245.321798] __tree_mod_log_rewind+0xbf/0x280 [btrfs] [ 1245.321841] btrfs_search_old_slot+0x105/0xd00 [btrfs] [ 1245.321877] resolve_indirect_refs+0x1eb/0xc60 [btrfs] [ 1245.321912] find_parent_nodes+0x3dc/0x11b0 [btrfs] [ 1245.321947] btrfs_check_shared+0x115/0x1c0 [btrfs] [ 1245.321980] ? extent_fiemap+0x59d/0x6d0 [btrfs] [ 1245.322029] extent_fiemap+0x59d/0x6d0 [btrfs] [ 1245.322066] do_vfs_ioctl+0x45a/0x750 [ 1245.322081] ksys_ioctl+0x70/0x80 [ 1245.322092] ? trace_hardirqs_off_thunk+0x1a/0x1c [ 1245.322113] __x64_sys_ioctl+0x16/0x20 [ 1245.322126] do_syscall_64+0x5c/0x280 [ 1245.322139] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 1245.322155] RIP: 0033:0x7fdee3942dd7 [ 1245.322177] Code: .... [ 1245.322258] RSP: 002b:00007fdee1ec6c88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 1245.322294] RAX: ffffffffffffffda RBX: 00007fded40210d8 RCX: 00007fdee3942dd7 [ 1245.322314] RDX: 00007fded40210d8 RSI: 00000000c020660b RDI: 0000000000000004 [ 1245.322337] RBP: 0000562aa89e7510 R08: 0000000000000000 R09: 00007fdee1ec6d44 [ 1245.322369] R10: 0000000000000073 R11: 0000000000000246 R12: 00007fdee1ec6d48 [ 1245.322390] R13: 00007fdee1ec6d40 R14: 00007fded40210d0 R15: 00007fdee1ec6d50 [ 1245.322423] Modules linked in: .... [ 1245.323443] ---[ end trace 01de1e9ec5dff3cd ]--- Fix this by ensuring that btrfs_put_tree_mod_seq() computes the minimum sequence number and iterates the rbtree while holding the lock 'tree_mod_log_lock' in write mode. Also get rid of the 'tree_mod_seq_lock' lock, since it is now redundant. Fixes: bd989ba359f2ac ("Btrfs: add tree modification log functions") Fixes: 097b8a7c9e48e2 ("Btrfs: join tree mod log code with the code holding back delayed refs") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-23btrfs: separate definition of assertion failure handlersDavid Sterba1-8/+12
There's a report where objtool detects unreachable instructions, eg.: fs/btrfs/ctree.o: warning: objtool: btrfs_search_slot()+0x2d4: unreachable instruction This seems to be a false positive due to compiler version. The cause is in the ASSERT macro implementation that does the conditional check as IS_DEFINED(CONFIG_BTRFS_ASSERT) and not an #ifdef. To avoid that, use the ifdefs directly. There are still 2 reports that aren't fixed: fs/btrfs/extent_io.o: warning: objtool: __set_extent_bit()+0x71f: unreachable instruction fs/btrfs/relocation.o: warning: objtool: find_data_references()+0x4e0: unreachable instruction Co-developed-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: keep track of discard reuse statsDennis Zhou1-0/+3
Keep track of how much we are discarding and how often we are reusing with async discard. The discard_*_bytes values don't need any special protection because the work item provides the single threaded access. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: have multiple discard listsDennis Zhou1-1/+1
Non-block group destruction discarding currently only had a single list with no minimum discard length. This can lead to caravaning more meaningful discards behind a heavily fragmented block group. This adds support for multiple lists with minimum discard lengths to prevent the caravan effect. We promote block groups back up when we exceed the BTRFS_ASYNC_DISCARD_MAX_FILTER size, currently we support only 2 lists with filters of 1MB and 32KB respectively. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: make max async discard size tunableDennis Zhou1-0/+1
Expose max_discard_size as a tunable via sysfs and switch the current fixed maximum to the default value. Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: add kbps discard rate limit for async discardDennis Zhou1-0/+2
Provide the ability to rate limit based on kbps in addition to iops as additional guides for the target discard rate. The delay used ends up being max(kbps_delay, iops_delay). Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: calculate discard delay based on number of extentsDennis Zhou1-0/+2
An earlier patch keeps track of discardable_extents. These are undiscarded extents managed by the free space cache. Here, we will use this to dynamically calculate the discard delay interval. There are 3 rate to consider. The first is the target convergence rate, the rate to discard all discardable_extents over the BTRFS_DISCARD_TARGET_MSEC time frame. This is clamped by the lower limit, the iops limit or BTRFS_DISCARD_MIN_DELAY (1ms), and the upper limit, BTRFS_DISCARD_MAX_DELAY (1s). We reevaluate this delay every transaction commit. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: keep track of discardable_bytes for async discardDennis Zhou1-0/+1
Keep track of this metric so that we can understand how ahead or behind we are in discarding rate. This uses the same accounting method as discardable_extents, deltas between previous/current values and propagating them up. Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: track discardable extents for async discardDennis Zhou1-0/+9
The number of discardable extents will serve as the rate limiting metric for how often we should discard. This keeps track of discardable extents in the free space caches by maintaining deltas and propagating them to the global count. The deltas are calculated from 2 values stored in PREV and CURR entries, then propagated up to the global discard ctl. The current counter value becomes the previous counter value after update. Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: sysfs: add UUID/debug/discard directoryDennis Zhou1-0/+1
Setup base sysfs directory for discard stats + tunables. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: sysfs: make UUID/debug have its own kobjectDennis Zhou1-0/+4
Btrfs only allowed attributes to be exposed in debug/. Let's let other groups be created by making debug its own kobject. This also makes the per-fs debug options separate from the global features mount attributes. This seems to be needed as sysfs_create_files() requires const struct attribute * while sysfs_create_group() can take struct attribute *. This seems nicer as per file system, you'll probably use to_fs_info(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: handle empty block_group removal for async discardDennis Zhou1-2/+7
block_group removal is a little tricky. It can race with the extent allocator, the cleaner thread, and balancing. The current path is for a block_group to be added to the unused_bgs list. Then, when the cleaner thread comes around, it starts a transaction and then proceeds with removing the block_group. Extents that are pinned are subsequently removed from the pinned trees and then eventually a discard is issued for the entire block_group. Async discard introduces another player into the game, the discard workqueue. While it has none of the racing issues, the new problem is ensuring we don't leave free space untrimmed prior to forgetting the block_group. This is handled by placing fully free block_groups on a separate discard queue. This is necessary to maintain discarding order as in the future we will slowly trim even fully free block_groups. The ordering helps us make progress on the same block_group rather than say the last fully freed block_group or needing to search through the fully freed block groups at the beginning of a list and insert after. The new order of events is a fully freed block group gets placed on the unused discard queue first. Once it's processed, it will be placed on the unusued_bgs list and then the original sequence of events will happen, just without the final whole block_group discard. The mount flags can change when processing unused_bgs, so when flipping from DISCARD to DISCARD_ASYNC, the unused_bgs must be punted to the discard_list to be trimmed. If we flip off DISCARD_ASYNC, we punt free block groups on the discard_list to the unused_bg queue which will do the final discard for us. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: add the beginning of async discard, discard workqueueDennis Zhou1-0/+21
When discard is enabled, everytime a pinned extent is released back to the block_group's free space cache, a discard is issued for the extent. This is an overeager approach when it comes to discarding and helping the SSD maintain enough free space to prevent severe garbage collection situations. This adds the beginning of async discard. Instead of issuing a discard prior to returning it to the free space, it is just marked as untrimmed. The block_group is then added to a LRU which then feeds into a workqueue to issue discards at a much slower rate. Full discarding of unused block groups is still done and will be addressed in a future patch of the series. For now, we don't persist the discard state of extents and bitmaps. Therefore, our failure recovery mode will be to consider extents untrimmed. This lets us handle failure and unmounting as one in the same. On a number of Facebook webservers, I collected data every minute accounting the time we spent in btrfs_finish_extent_commit() (col. 1) and in btrfs_commit_transaction() (col. 2). btrfs_finish_extent_commit() is where we discard extents synchronously before returning them to the free space cache. discard=sync: p99 total per minute p99 total per minute Drive | extent_commit() (ms) | commit_trans() (ms) --------------------------------------------------------------- Drive A | 434 | 1170 Drive B | 880 | 2330 Drive C | 2943 | 3920 Drive D | 4763 | 5701 discard=async: p99 total per minute p99 total per minute Drive | extent_commit() (ms) | commit_trans() (ms) -------------------------------------------------------------- Drive A | 134 | 956 Drive B | 64 | 1972 Drive C | 59 | 1032 Drive D | 62 | 1200 While it's not great that the stats are cumulative over 1m, all of these servers are running the same workload and and the delta between the two are substantial. We are spending significantly less time in btrfs_finish_extent_commit() which is responsible for discarding. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: rename DISCARD mount option to to DISCARD_SYNCDennis Zhou1-1/+1
This series introduces async discard which will use the flag DISCARD_ASYNC, so rename the original flag to DISCARD_SYNC as it is synchronously done in transaction commit. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Dennis Zhou <dennis@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: drop create parameter to btrfs_get_extent()Omar Sandoval1-1/+1
We only pass this as 1 from __extent_writepage_io(). The parameter basically means "pretend I didn't pass in a page". This is silly since we can simply not pass in the page. Get rid of the parameter from btrfs_get_extent(), and since it's used as a get_extent_t callback, remove it from get_extent_t and btree_get_extent(), neither of which need it. While we're here, let's document btrfs_get_extent(). Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: get rid of at_offset parameter to btrfs_lookup_bio_sums()Omar Sandoval1-1/+1
We can encode this in the offset parameter: -1 means use the page offsets, anything else is a valid offset. Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: get rid of trivial __btrfs_lookup_bio_sums() wrappersOmar Sandoval1-3/+1
Currently, we have two wrappers for __btrfs_lookup_bio_sums(): btrfs_lookup_bio_sums_dio(), which is used for direct I/O, and btrfs_lookup_bio_sums(), which is used everywhere else. The only difference is that the _dio variant looks up csums starting at the given offset instead of using the page index, which isn't actually direct I/O-specific. Let's clean up the signature and return value of __btrfs_lookup_bio_sums(), rename it to btrfs_lookup_bio_sums(), and get rid of the trivial helpers. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-01-20btrfs: Rename __btrfs_free_reserved_extent to btrfs_pin_reserved_extentNikolay Borisov1-2/+2
__btrfs_free_reserved_extent now performs the actions of btrfs_free_and_pin_reserved_extent. But this name is a bit of a misnomer, since the extent is not really freed but just pinned. Reflect this in the new name. No semantics changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-12-13Btrfs: fix missing data checksums after replaying a log treeFilipe Manana1-1/+1
When logging a file that has shared extents (reflinked with other files or with itself), we can end up logging multiple checksum items that cover overlapping ranges. This confuses the search for checksums at log replay time causing some checksums to never be added to the fs/subvolume tree. Consider the following example of a file that shares the same extent at offsets 0 and 256Kb: [ bytenr 13893632, offset 64Kb, len 64Kb ] 0 64Kb [ bytenr 13631488, offset 64Kb, len 192Kb ] 64Kb 256Kb [ bytenr 13893632, offset 0, len 256Kb ] 256Kb 512Kb When logging the inode, at tree-log.c:copy_items(), when processing the file extent item at offset 0, we log a checksum item covering the range 13959168 to 14024704, which corresponds to 13893632 + 64Kb and 13893632 + 64Kb + 64Kb, respectively. Later when processing the extent item at offset 256K, we log the checksums for the range from 13893632 to 14155776 (which corresponds to 13893632 + 256Kb). These checksums get merged with the checksum item for the range from 13631488 to 13893632 (13631488 + 256Kb), logged by a previous fsync. So after this we get the two following checksum items in the log tree: (...) item 6 key (EXTENT_CSUM EXTENT_CSUM 13631488) itemoff 3095 itemsize 512 range start 13631488 end 14155776 length 524288 item 7 key (EXTENT_CSUM EXTENT_CSUM 13959168) itemoff 3031 itemsize 64 range start 13959168 end 14024704 length 65536 The first one covers the range from the second one, they overlap. So far this does not cause a problem after replaying the log, because when replaying the file extent item for offset 256K, we copy all the checksums for the extent 13893632 from the log tree to the fs/subvolume tree, since searching for an checksum item for bytenr 13893632 leaves us at the first checksum item, which covers the whole range of the extent. However if we write 64Kb to file offset 256Kb for example, we will not be able to find and copy the checksums for the last 128Kb of the extent at bytenr 13893632, referenced by the file range 384Kb to 512Kb. After writing 64Kb into file offset 256Kb we get the following extent layout for our file: [ bytenr 13893632, offset 64K, len 64Kb ] 0 64Kb [ bytenr 13631488, offset 64Kb, len 192Kb ] 64Kb 256Kb [ bytenr 14155776, offset 0, len 64Kb ] 256Kb 320Kb [ bytenr 13893632, offset 64Kb, len 192Kb ] 320Kb 512Kb After fsync'ing the file, if we have a power failure and then mount the filesystem to replay the log, the following happens: 1) When replaying the file extent item for file offset 320Kb, we lookup for the checksums for the extent range from 13959168 (13893632 + 64Kb) to 14155776 (13893632 + 256Kb), through a call to btrfs_lookup_csums_range(); 2) btrfs_lookup_csums_range() finds the checksum item that starts precisely at offset 13959168 (item 7 in the log tree, shown before); 3) However that checksum item only covers 64Kb of data, and not 192Kb of data; 4) As a result only the checksums for the first 64Kb of data referenced by the file extent item are found and copied to the fs/subvolume tree. The remaining 128Kb of data, file range 384Kb to 512Kb, doesn't get the corresponding data checksums found and copied to the fs/subvolume tree. 5) After replaying the log userspace will not be able to read the file range from 384Kb to 512Kb, because the checksums are missing and resulting in an -EIO error. The following steps reproduce this scenario: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt/sdc $ xfs_io -f -c "pwrite -S 0xa3 0 256K" /mnt/sdc/foobar $ xfs_io -c "fsync" /mnt/sdc/foobar $ xfs_io -c "pwrite -S 0xc7 256K 256K" /mnt/sdc/foobar $ xfs_io -c "reflink /mnt/sdc/foobar 320K 0 64K" /mnt/sdc/foobar $ xfs_io -c "fsync" /mnt/sdc/foobar $ xfs_io -c "pwrite -S 0xe5 256K 64K" /mnt/sdc/foobar $ xfs_io -c "fsync" /mnt/sdc/foobar <power failure> $ mount /dev/sdc /mnt/sdc $ md5sum /mnt/sdc/foobar md5sum: /mnt/sdc/foobar: Input/output error $ dmesg | tail [165305.003464] BTRFS info (device sdc): no csum found for inode 257 start 401408 [165305.004014] BTRFS info (device sdc): no csum found for inode 257 start 405504 [165305.004559] BTRFS info (device sdc): no csum found for inode 257 start 409600 [165305.005101] BTRFS info (device sdc): no csum found for inode 257 start 413696 [165305.005627] BTRFS info (device sdc): no csum found for inode 257 start 417792 [165305.006134] BTRFS info (device sdc): no csum found for inode 257 start 421888 [165305.006625] BTRFS info (device sdc): no csum found for inode 257 start 425984 [165305.007278] BTRFS info (device sdc): no csum found for inode 257 start 430080 [165305.008248] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1 [165305.009550] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1 Fix this simply by deleting first any checksums, from the log tree, for the range of the extent we are logging at copy_items(). This ensures we do not get checksum items in the log tree that have overlapping ranges. This is a long time issue that has been present since we have the clone (and deduplication) ioctl, and can happen both when an extent is shared between different files and within the same file. A test case for fstests follows soon. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: rename btrfs_block_group_cacheDavid Sterba1-7/+7
The type name is misleading, a single entry is named 'cache' while this normally means a collection of objects. Rename that everywhere. Also the identifier was quite long, making function prototypes harder to format. Suggested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: add incompat for raid1 with 3, 4 copiesDavid Sterba1-1/+2
The new raid1c3 and raid1c4 profiles are backward incompatible and the name shall be 'raid1c34', the status can be found in the global supported features in /sys/fs/btrfs/features or in the per-filesystem directory. Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: add support for 4-copy replication (raid1c4)David Sterba1-2/+2
Add new block group profile to store 4 copies in a simliar way that current RAID1 does. The profile attributes and constraints are defined in the raid table and used by the same code that already handles the 2- and 3-copy RAID1. The minimum number of devices is 4, the maximum number of devices/chunks that can be lost/damaged is 3. There is no comparable traditional RAID level, the profile is added for future needs to accompany triple-parity and beyond. Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: add support for 3-copy replication (raid1c3)David Sterba1-2/+2
Add new block group profile to store 3 copies in a simliar way that current RAID1 does. The profile attributes and constraints are defined in the raid table and used by the same code that already handles the 2-copy RAID1. The minimum number of devices is 3, the maximum number of devices/chunks that can be lost/damaged is 2. Like RAID6 but with 33% space utilization. Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: rename extent buffer block group item accessorsDavid Sterba1-3/+3
Accessors defined by BTRFS_SETGET_FUNCS take a raw extent buffer and manipulate the items there, there's no special prefix required. The block group accessors had _disk_ because previously the names were occupied by the on-stack accessors. As this has been addressed in the previous patch, we can now unify the naming. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: rename block_group_item on-stack accessors to follow namingDavid Sterba1-3/+3
All accessors defined by BTRFS_SETGET_STACK_FUNCS contain _stack_ in the name, the block group ones were not following that scheme, so let's switch them. Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: add member for a specific checksum driverDavid Sterba1-0/+1
Currently all the checksum algorithms generate a fixed size digest size and we use it. The on-disk format can hold up to BTRFS_CSUM_SIZE bytes and BLAKE2b produces digest of 512 bits by default. We can't do that and will use the blake2b-256, this needs to be passed to the crypto API. Separate that from the base algorithm name and add a member to request specific driver, in this case with the digest size. The only place that uses the driver name is the crypto API setup. Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: sysfs: export supported checksumsDavid Sterba1-0/+2
Export supported checksum algorithms via sysfs in the list of static features: /sys/fs/btrfs/features/supported_checksums Space spearated list of checksum algorithm names. Co-developed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18Btrfs: delete the entire async bio submission frameworkChris Mason1-1/+0
Now that we're not using btrfs_schedule_bio() anymore, delete all the code that supported it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Chris Mason <clm@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: add __pure attribute to functionsDavid Sterba1-2/+2
The attribute is more relaxed than const and the functions could dereference pointers, as long as the observable state is not changed. We do have such functions, based on -Wsuggest-attribute=pure . The visible effects of this patch are negligible, there are differences in the assembly but hard to summarize. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: add const function attributeDavid Sterba1-1/+1
For some reason the attribute is called __attribute_const__ and not __const, marks functions that have no observable effects on program state, IOW not reading pointers, just the arguments and calculating a value. Allows the compiler to do some optimizations, based on -Wsuggest-attribute=const . The effects are rather small, though, about 60 bytes decrese of btrfs.ko. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: drop unused parameter is_new from btrfs_igetDavid Sterba1-3/+2
The parameter is now always set to NULL and could be dropped. The last user was get_default_root but that got reworked in 05dbe6837b60 ("Btrfs: unify subvol= and subvolid= mounting") and the parameter became unused. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: move btrfs_unlock_up_safe to other locking functionsDavid Sterba1-2/+0
The function belongs to the family of locking functions, so move it there. The 'noinline' keyword is dropped as it's now an exported function that does not need it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: move extent_io_tree defs to their own headerJosef Bacik1-0/+1
extent_io.c/h are huge, encompassing a bunch of different things. The extent_io_tree code can live on its own, so separate this out. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-15btrfs: qgroup: Always free PREALLOC META reserve in btrfs_delalloc_release_extents()Qu Wenruo1-2/+1
[Background] Btrfs qgroup uses two types of reserved space for METADATA space, PERTRANS and PREALLOC. PERTRANS is metadata space reserved for each transaction started by btrfs_start_transaction(). While PREALLOC is for delalloc, where we reserve space before joining a transaction, and finally it will be converted to PERTRANS after the writeback is done. [Inconsistency] However there is inconsistency in how we handle PREALLOC metadata space. The most obvious one is: In btrfs_buffered_write(): btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes, true); We always free qgroup PREALLOC meta space. While in btrfs_truncate_block(): btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, (ret != 0)); We only free qgroup PREALLOC meta space when something went wrong. [The Correct Behavior] The correct behavior should be the one in btrfs_buffered_write(), we should always free PREALLOC metadata space. The reason is, the btrfs_delalloc_* mechanism works by: - Reserve metadata first, even it's not necessary In btrfs_delalloc_reserve_metadata() - Free the unused metadata space Normally in: btrfs_delalloc_release_extents() |- btrfs_inode_rsv_release() Here we do calculation on whether we should release or not. E.g. for 64K buffered write, the metadata rsv works like: /* The first page */ reserve_meta: num_bytes=calc_inode_reservations() free_meta: num_bytes=0 total: num_bytes=calc_inode_reservations() /* The first page caused one outstanding extent, thus needs metadata rsv */ /* The 2nd page */ reserve_meta: num_bytes=calc_inode_reservations() free_meta: num_bytes=calc_inode_reservations() total: not changed /* The 2nd page doesn't cause new outstanding extent, needs no new meta rsv, so we free what we have reserved */ /* The 3rd~16th pages */ reserve_meta: num_bytes=calc_inode_reservations() free_meta: num_bytes=calc_inode_reservations() total: not changed (still space for one outstanding extent) This means, if btrfs_delalloc_release_extents() determines to free some space, then those space should be freed NOW. So for qgroup, we should call btrfs_qgroup_free_meta_prealloc() other than btrfs_qgroup_convert_reserved_meta(). The good news is: - The callers are not that hot The hottest caller is in btrfs_buffered_write(), which is already fixed by commit 336a8bb8e36a ("btrfs: Fix wrong btrfs_delalloc_release_extents parameter"). Thus it's not that easy to cause false EDQUOT. - The trans commit in advance for qgroup would hide the bug Since commit f5fef4593653 ("btrfs: qgroup: Make qgroup async transaction commit more aggressive"), when btrfs qgroup metadata free space is slow, it will try to commit transaction and free the wrongly converted PERTRANS space, so it's not that easy to hit such bug. [FIX] So to fix the problem, remove the @qgroup_free parameter for btrfs_delalloc_release_extents(), and always pass true to btrfs_inode_rsv_release(). Reported-by: Filipe Manana <fdmanana@suse.com> Fixes: 43b18595d660 ("btrfs: qgroup: Use separate meta reservation type for delalloc") CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-10-15btrfs: don't needlessly create extent-refs kernel threadDavid Sterba1-2/+0
The patch 32b593bfcb58 ("Btrfs: remove no longer used function to run delayed refs asynchronously") removed the async delayed refs but the thread has been created, without any use. Remove it to avoid resource consumption. Fixes: 32b593bfcb58 ("Btrfs: remove no longer used function to run delayed refs asynchronously") CC: stable@vger.kernel.org # 5.2+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: create structure to encode checksum type and lengthJohannes Thumshirn1-18/+2
Create a structure to encode the type and length for the known on-disk checksums. This makes it easier to add new checksums later. The structure and helpers are moved from ctree.h so they don't occupy space in all headers including ctree.h. This save some space in the final object. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: tie extent buffer and it's token togetherDavid Sterba1-1/+3
Further simplifaction of the get/set helpers is possible when the token is uniquely tied to an extent buffer. A condition and an assignment can be avoided. The initializations are moved closer to the first use when the extent buffer is valid. There's one exception in __push_leaf_left where the token is reused. Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: define separate btrfs_set/get_XX helpersDavid Sterba1-11/+4
There are helpers for all type widths defined via macro and optionally can use a token which is a cached pointer to avoid repeated mapping of the extent buffer. The token value is known at compile time, when it's valid it's always address of a local variable, otherwise it's NULL passed by the token-less helpers. This can be utilized to remove some branching as the helpers are used frequenlty. Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: Make btrfs_find_name_in_ext_backref return struct btrfs_inode_extrefNikolay Borisov1-5/+3
btrfs_find_name_in_ext_backref returns either 0/1 depending on whether it found a backref for the given name. If it returns true then the actual inode_ref struct is returned in one of its parameters. That's pointless, instead refactor the function such that it returns either a pointer to the btrfs_inode_extref or NULL it it didn't find anything. This streamlines the function calling convention. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: Make btrfs_find_name_in_backref return btrfs_inode_ref structNikolay Borisov1-3/+3
btrfs_find_name_in_backref returns either 0/1 depending on whether it found a backref for the given name. If it returns true then the actual inode_ref struct is returned in one of its parameters. That's pointless, instead refactor the function such that it returns either a pointer to the btrfs_inode_ref or NULL it it didn't find anything. This streamlines the function calling convention. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: move dev_stats helpers to volumes.cDavid Sterba1-24/+0
The other dev stats functions are already there and the helpers are not used by anything else. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: move struct io_ctl to free-space-cache.hDavid Sterba1-14/+0
The io_ctl structure is used for free space management, and used only by the v1 space cache code, but unfortunatlly the full definition is required by block-group.h so it can't be moved to free-space-cache.c without additional changes. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: move functions for tree compare to send.cDavid Sterba1-14/+0
Send is the only user of tree_compare, we can move it there along with the other helpers and definitions. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: rename and export read_node_slotDavid Sterba1-0/+3
Preparatory work for code that will be moved out of ctree and uses this function. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: move private raid56 definitions from ctree.hDavid Sterba1-16/+0
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: move cond_wake_up functions out of ctreeDavid Sterba1-22/+0
The file ctree.h serves as a header for everything and has become quite bloated. Split some helpers that are generic and create a new file that should be the catch-all for code that's not btrfs-specific. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: fix allocation of free space cache v1 bitmap pagesChristophe Leroy1-0/+1
Various notifications of type "BUG kmalloc-4096 () : Redzone overwritten" have been observed recently in various parts of the kernel. After some time, it has been made a relation with the use of BTRFS filesystem and with SLUB_DEBUG turned on. [ 22.809700] BUG kmalloc-4096 (Tainted: G W ): Redzone overwritten [ 22.810286] INFO: 0xbe1a5921-0xfbfc06cd. First byte 0x0 instead of 0xcc [ 22.810866] INFO: Allocated in __load_free_space_cache+0x588/0x780 [btrfs] age=22 cpu=0 pid=224 [ 22.811193] __slab_alloc.constprop.26+0x44/0x70 [ 22.811345] kmem_cache_alloc_trace+0xf0/0x2ec [ 22.811588] __load_free_space_cache+0x588/0x780 [btrfs] [ 22.811848] load_free_space_cache+0xf4/0x1b0 [btrfs] [ 22.812090] cache_block_group+0x1d0/0x3d0 [btrfs] [ 22.812321] find_free_extent+0x680/0x12a4 [btrfs] [ 22.812549] btrfs_reserve_extent+0xec/0x220 [btrfs] [ 22.812785] btrfs_alloc_tree_block+0x178/0x5f4 [btrfs] [ 22.813032] __btrfs_cow_block+0x150/0x5d4 [btrfs] [ 22.813262] btrfs_cow_block+0x194/0x298 [btrfs] [ 22.813484] commit_cowonly_roots+0x44/0x294 [btrfs] [ 22.813718] btrfs_commit_transaction+0x63c/0xc0c [btrfs] [ 22.813973] close_ctree+0xf8/0x2a4 [btrfs] [ 22.814107] generic_shutdown_super+0x80/0x110 [ 22.814250] kill_anon_super+0x18/0x30 [ 22.814437] btrfs_kill_super+0x18/0x90 [btrfs] [ 22.814590] INFO: Freed in proc_cgroup_show+0xc0/0x248 age=41 cpu=0 pid=83 [ 22.814841] proc_cgroup_show+0xc0/0x248 [ 22.814967] proc_single_show+0x54/0x98 [ 22.815086] seq_read+0x278/0x45c [ 22.815190] __vfs_read+0x28/0x17c [ 22.815289] vfs_read+0xa8/0x14c [ 22.815381] ksys_read+0x50/0x94 [ 22.815475] ret_from_syscall+0x0/0x38 Commit 69d2480456d1 ("btrfs: use copy_page for copying pages instead of memcpy") changed the way bitmap blocks are copied. But allthough bitmaps have the size of a page, they were allocated with kzalloc(). Most of the time, kzalloc() allocates aligned blocks of memory, so copy_page() can be used. But when some debug options like SLAB_DEBUG are activated, kzalloc() may return unaligned pointer. On powerpc, memcpy(), copy_page() and other copying functions use 'dcbz' instruction which provides an entire zeroed cacheline to avoid memory read when the intention is to overwrite a full line. Functions like memcpy() are writen to care about partial cachelines at the start and end of the destination, but copy_page() assumes it gets pages. As pages are naturally cache aligned, copy_page() doesn't care about partial lines. This means that when copy_page() is called with a misaligned pointer, a few leading bytes are zeroed. To fix it, allocate bitmaps through kmem_cache instead of using kzalloc() The cache pool is created with PAGE_SIZE alignment constraint. Reported-by: Erhard F. <erhard_f@mailbox.org> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204371 Fixes: 69d2480456d1 ("btrfs: use copy_page for copying pages instead of memcpy") Cc: stable@vger.kernel.org # 4.19+ Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr> Reviewed-by: David Sterba <dsterba@suse.com> [ rename to btrfs_free_space_bitmap ] Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: rename the btrfs_calc_*_metadata_size helpersJosef Bacik1-5/+9
btrfs_calc_trunc_metadata_size differs from trans_metadata_size in that it doesn't take into account any splitting at the levels, because truncate will never split nodes. However truncate _and_ changing will never split nodes, so rename btrfs_calc_trunc_metadata_size to btrfs_calc_metadata_size. Also btrfs_calc_trans_metadata_size is purely for inserting items, so rename this to btrfs_calc_insert_metadata_size. Making these clearer will help when I start using them differently in upcoming patches. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: tree-checker: Add EXTENT_DATA_REF checkQu Wenruo1-0/+1
EXTENT_DATA_REF is a little like DIR_ITEM which contains hash in its key->offset. This patch will check the following contents: - Key->objectid Basic alignment check. - Hash Hash of each extent_data_ref item must match key->offset. - Offset Basic alignment check. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: introduce an evict flushing stateJosef Bacik1-0/+1
We have this weird space flushing loop inside inode.c for evict where we'll do the normal LIMIT flush, and then commit the transaction and hope we get our space. This is super janky, and in fact there's really nothing stopping us from using FLUSH_ALL except that we run delayed iputs, which means we could deadlock. So introduce a new flush state for eviction that does the normal priority flushing with all of the states that are safe for eviction. The nice side-effect of this is that we'll try harder for evictions. Previously if (for example generic/269) you had a bunch of other operations happening on the fs you could race with those reservations when committing the transaction, and eventually miss getting a reservation for the evict. With this code we'll have our ticket in place through the transaction commit, so any pinned bytes will go to our pending evictions first. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>