aboutsummaryrefslogtreecommitdiffstats
path: root/fs/xfs (follow)
AgeCommit message (Collapse)AuthorFilesLines
2017-09-01xfs: rename xfs_defer_join to xfs_defer_ijoinChristoph Hellwig3-4/+4
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01xfs: refactor xfs_trans_rollChristoph Hellwig9-56/+48
Split xfs_trans_roll into a low-level helper that just rolls the actual transaction and a new higher level xfs_trans_roll_inode that takes care of logging and rejoining the inode. This gets rid of the NULL inode case, and allows to simplify the special cases in the deferred operation code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01xfs: check for race with xfs_reclaim_inode() in xfs_ifree_cluster()Omar Sandoval2-10/+23
After xfs_ifree_cluster() finds an inode in the radix tree and verifies that the inode number is what it expected, xfs_reclaim_inode() can swoop in and free it. xfs_ifree_cluster() will then happily continue working on the freed inode. Most importantly, it will mark the inode stale, which will probably be overwritten when the inode slab object is reallocated, but if it has already been reallocated then we can end up with an inode spuriously marked stale. In 8a17d7ddedb4 ("xfs: mark reclaimed inodes invalid earlier") we added a second check to xfs_iflush_cluster() to detect this race, but the similar RCU lookup in xfs_ifree_cluster() needs the same treatment. Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-09-01xfs: evict all inodes involved with log redo itemDarrick J. Wong1-0/+12
When we introduced the bmap redo log items, we set MS_ACTIVE on the mountpoint and XFS_IRECOVERY on the inode to prevent unlinked inodes from being truncated prematurely during log recovery. This also had the effect of putting linked inodes on the lru instead of evicting them. Unfortunately, we neglected to find all those unreferenced lru inodes and evict them after finishing log recovery, which means that we leak them if anything goes wrong in the rest of xfs_mountfs, because the lru is only cleaned out on unmount. Therefore, evict unreferenced inodes in the lru list immediately after clearing MS_ACTIVE. Fixes: 17c12bcd30 ("xfs: when replaying bmap operations, don't let unlinked inodes get reaped") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Cc: viro@ZenIV.linux.org.uk Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-08-31xfs: perform dax_device lookup at mountDan Williams6-15/+41
The ->iomap_begin() operation is a hot path, so cache the fs_dax_get_by_host() result at mount time to avoid the incurring the hash lookup overhead on a per-i/o basis. Reported-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2017-08-23block: replace bi_bdev with a gendisk pointer and partitions indexChristoph Hellwig2-2/+2
This way we don't need a block_device structure to submit I/O. The block_device has different life time rules from the gendisk and request_queue and is usually only available when the block device node is open. Other callers need to explicitly create one (e.g. the lightnvm passthrough code, or the new nvme multipathing code). For the actual I/O path all that we need is the gendisk, which exists once per block device. But given that the block layer also does partition remapping we additionally need a partition index, which is used for said remapping in generic_make_request. Note that all the block drivers generally want request_queue or sometimes the gendisk, so this removes a layer of indirection all over the stack. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-22xfs: stop searching for free slots in an inode chunk when there are noneCarlos Maiolino1-28/+27
In a filesystem without finobt, the Space manager selects an AG to alloc a new inode, where xfs_dialloc_ag_inobt() will search the AG for the free slot chunk. When the new inode is in the same AG as its parent, the btree will be searched starting on the parent's record, and then retried from the top if no slot is available beyond the parent's record. To exit this loop though, xfs_dialloc_ag_inobt() relies on the fact that the btree must have a free slot available, once its callers relied on the agi->freecount when deciding how/where to allocate this new inode. In the case when the agi->freecount is corrupted, showing available inodes in an AG, when in fact there is none, this becomes an infinite loop. Add a way to stop the loop when a free slot is not found in the btree, making the function to fall into the whole AG scan which will then, be able to detect the corruption and shut the filesystem down. As pointed by Brian, this might impact performance, giving the fact we don't reset the search distance anymore when we reach the end of the tree, giving it fewer tries before falling back to the whole AG search, but it will only affect searches that start within 10 records to the end of the tree. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-08-22xfs: add log recovery tracepoint for head/tailBrian Foster2-0/+20
Torn write detection and tail overwrite detection can shift the log head and tail respectively in the event of CRC mismatch or corruption errors. Add a high-level log recovery tracepoint to dump the final log head/tail and make those values easily attainable in debug/diagnostic situations. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-08-22xfs: handle -EFSCORRUPTED during head/tail verificationBrian Foster1-4/+3
Torn write and tail overwrite detection both trigger only on -EFSBADCRC errors. While this is the most likely failure scenario for each condition, -EFSCORRUPTED is still possible in certain cases depending on what ends up on disk when a torn write or partial tail overwrite occurs. For example, an invalid log record h_len can lead to an -EFSCORRUPTED error when running the log recovery CRC pass. Therefore, update log head and tail verification to trigger the associated head/tail fixups in the event of -EFSCORRUPTED errors along with -EFSBADCRC. Also, -EFSCORRUPTED can currently be returned from xlog_do_recovery_pass() before rhead_blk is initialized if the first record encountered happens to be corrupted. This leads to an incorrect 'first_bad' return value. Initialize rhead_blk earlier in the function to address that problem as well. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-08-22xfs: add log item pinning error injection tagBrian Foster3-2/+22
Add an error injection tag to force log items in the AIL to the pinned state. This option can be used by test infrastructure to induce head behind tail conditions. Specifically, this is intended to be used by xfstests to reproduce log recovery problems after failed/corrupted log writes overwrite the last good tail LSN in the log. When enabled, AIL push attempts see log items in the AIL in the pinned state. This stalls metadata writeback and thus prevents the current tail of the log from moving forward. When disabled, subsequent AIL pushes observe the log items in their appropriate state and filesystem operation continues as normal. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-08-22xfs: fix log recovery corruption error due to tail overwriteBrian Foster1-31/+77
If we consider the case where the tail (T) of the log is pinned long enough for the head (H) to push and block behind the tail, we can end up blocked in the following state without enough free space (f) in the log to satisfy a transaction reservation: 0 phys. log N [-------HffT---H'--T'---] The last good record in the log (before H) refers to T. The tail eventually pushes forward (T') leaving more free space in the log for writes to H. At this point, suppose space frees up in the log for the maximum of 8 in-core log buffers to start flushing out to the log. If this pushes the head from H to H', these next writes overwrite the previous tail T. This is safe because the items logged from T to T' have been written back and removed from the AIL. If the next log writes (H -> H') happen to fail and result in partial records in the log, the filesystem shuts down having overwritten T with invalid data. Log recovery correctly locates H on the subsequent mount, but H still refers to the now corrupted tail T. This results in log corruption errors and recovery failure. Since the tail overwrite results from otherwise correct runtime behavior, it is up to log recovery to try and deal with this situation. Update log recovery tail verification to run a CRC pass from the first record past the tail to the head. This facilitates error detection at T and moves the recovery tail to the first good record past H' (similar to truncating the head on torn write detection). If corruption is detected beyond the range possibly affected by the max number of iclogs, the log is legitimately corrupted and log recovery failure is expected. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-08-22xfs: always verify the log tail during recoveryBrian Foster1-23/+3
Log tail verification currently only occurs when torn writes are detected at the head of the log. This was introduced because a change in the head block due to torn writes can lead to a change in the tail block (each log record header references the current tail) and the tail block should be verified before log recovery proceeds. Tail corruption is possible outside of torn write scenarios, however. For example, partial log writes can be detected and cleared during the initial head/tail block discovery process. If the partial write coincides with a tail overwrite, the log tail is corrupted and recovery fails. To facilitate correct handling of log tail overwites, update log recovery to always perform tail verification. This is necessary to detect potential tail overwrite conditions when torn writes may not have occurred. This changes normal (i.e., no torn writes) recovery behavior slightly to detect and return CRC related errors near the tail before actual recovery starts. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-08-22xfs: fix recovery failure when log record header wraps log endBrian Foster1-4/+14
The high-level log recovery algorithm consists of two loops that walk the physical log and process log records from the tail to the head. The first loop handles the case where the tail is beyond the head and processes records up to the end of the physical log. The subsequent loop processes records from the beginning of the physical log to the head. Because log records can wrap around the end of the physical log, the first loop mentioned above must handle this case appropriately. Records are processed from in-core buffers, which means that this algorithm must split the reads of such records into two partial I/Os: 1.) from the beginning of the record to the end of the log and 2.) from the beginning of the log to the end of the record. This is further complicated by the fact that the log record header and log record data are read into independent buffers. The current handling of each buffer correctly splits the reads when either the header or data starts before the end of the log and wraps around the end. The data read does not correctly handle the case where the prior header read wrapped or ends on the physical log end boundary. blk_no is incremented to or beyond the log end after the header read to point to the record data, but the split data read logic triggers, attempts to read from an invalid log block and ultimately causes log recovery to fail. This can be reproduced fairly reliably via xfstests tests generic/047 and generic/388 with large iclog sizes (256k) and small (10M) logs. If the record header read has pushed beyond the end of the physical log, the subsequent data read is actually contiguous. Update the data read logic to detect the case where blk_no has wrapped, mod it against the log size to read from the correct address and issue one contiguous read for the log data buffer. The log record is processed as normal from the buffer(s), the loop exits after the current iteration and the subsequent loop picks up with the first new record after the start of the log. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-08-22xfs: Properly retry failed inode items in case of error during buffer writebackCarlos Maiolino6-5/+108
When a buffer has been failed during writeback, the inode items into it are kept flush locked, and are never resubmitted due the flush lock, so, if any buffer fails to be written, the items in AIL are never written to disk and never unlocked. This causes unmount operation to hang due these items flush locked in AIL, but this also causes the items in AIL to never be written back, even when the IO device comes back to normal. I've been testing this patch with a DM-thin device, creating a filesystem larger than the real device. When writing enough data to fill the DM-thin device, XFS receives ENOSPC errors from the device, and keep spinning on xfsaild (when 'retry forever' configuration is set). At this point, the filesystem can not be unmounted because of the flush locked items in AIL, but worse, the items in AIL are never retried at all (once xfs_inode_item_push() will skip the items that are flush locked), even if the underlying DM-thin device is expanded to the proper size. This patch fixes both cases, retrying any item that has been failed previously, using the infra-structure provided by the previous patch. Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-08-22xfs: Add infrastructure needed for error propagation during buffer IO failureCarlos Maiolino2-3/+36
With the current code, XFS never re-submit a failed buffer for IO, because the failed item in the buffer is kept in the flush locked state forever. To be able to resubmit an log item for IO, we need a way to mark an item as failed, if, for any reason the buffer which the item belonged to failed during writeback. Add a new log item callback to be used after an IO completion failure and make the needed clean ups. Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-08-22xfs: toggle readonly state around xfs_log_mount_finishEric Sandeen1-0/+7
When we do log recovery on a readonly mount, unlinked inode processing does not happen due to the readonly checks in xfs_inactive(), which are trying to prevent any I/O on a readonly mount. This is misguided - we do I/O on readonly mounts all the time, for consistency; for example, log recovery. So do the same RDONLY flag twiddling around xfs_log_mount_finish() as we do around xfs_log_mount(), for the same reason. This all cries out for a big rework but for now this is a simple fix to an obvious problem. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-08-22xfs: write unmount record for ro mountsEric Sandeen1-2/+5
There are dueling comments in the xfs code about intent for log writes when unmounting a readonly filesystem. In xfs_mountfs, we see the intent: /* * Now the log is fully replayed, we can transition to full read-only * mode for read-only mounts. This will sync all the metadata and clean * the log so that the recovery we just performed does not have to be * replayed again on the next mount. */ and it calls xfs_quiesce_attr(), but by the time we get to xfs_log_unmount_write(), it returns early for a RDONLY mount: * Don't write out unmount record on read-only mounts. Because of this, sequential ro mounts of a filesystem with a dirty log will replay the log each time, which seems odd. Fix this by writing an unmount record even for RO mounts, as long as norecovery wasn't specified (don't write a clean log record if a dirty log may still be there!) and the log device is writable. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-08-17xfs: don't leak quotacheck dquots when cow recoveryDarrick J. Wong1-0/+2
If we fail a mount on account of cow recovery errors, it's possible that a previous quotacheck left some dquots in memory. The bailout clause of xfs_mountfs forgets to purge these, and so we leak them. Fix that. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-08-17xfs: clear MS_ACTIVE after finishing log recoveryDarrick J. Wong2-10/+11
Way back when we established inode block-map redo log items, it was discovered that we needed to prevent the VFS from evicting inodes during log recovery because any given inode might be have bmap redo items to replay even if the inode has no link count and is ultimately deleted, and any eviction of an unlinked inode causes the inode to be truncated and freed too early. To make this possible, we set MS_ACTIVE so that inodes would not be torn down immediately upon release. Unfortunately, this also results in the quota inodes not being released at all if a later part of the mount process should fail, because we never reclaim the inodes. So, set MS_ACTIVE right before we do the last part of log recovery and clear it immediately after we finish the log recovery so that everything will be torn down properly if we abort the mount. Fixes: 17c12bcd30 ("xfs: when replaying bmap operations, don't let unlinked inodes get reaped") Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-08-11xfs: fix inobt inode allocation search optimizationOmar Sandoval1-1/+1
When we try to allocate a free inode by searching the inobt, we try to find the inode nearest the parent inode by searching chunks both left and right of the chunk containing the parent. As an optimization, we cache the leftmost and rightmost records that we previously searched; if we do another allocation with the same parent inode, we'll pick up the search where it last left off. There's a bug in the case where we found a free inode to the left of the parent's chunk: we need to update the cached left and right records, but because we already reassigned the right record to point to the left, we end up assigning the left record to both the cached left and right records. This isn't a correctness problem strictly, but it can result in the next allocation rechecking chunks unnecessarily or allocating inodes further away from the parent than it needs to. Fix it by swapping the record pointer after we update the cached left and right records. Fixes: bd169565993b ("xfs: speed up free inode search") Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-08-04xfs: Fix per-inode DAX flag inheritanceLukas Czerner1-5/+7
According to the commit that implemented per-inode DAX flag: commit 58f88ca2df72 ("xfs: introduce per-inode DAX enablement") the flag is supposed to act as "inherit flag". Currently this only works in the situations where parent directory already has a flag in di_flags set, otherwise inheritance does not work. This is because setting the XFS_DIFLAG2_DAX flag is done in a wrong branch designated for di_flags, not di_flags2. Fix this by moving the code to branch designated for setting di_flags2, which does test for flags in di_flags2. Fixes: 58f88ca2df72 ("xfs: introduce per-inode DAX enablement") Signed-off-by: Lukas Czerner <lczerner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-08-04xfs: Fix leak of discard bioJan Kara1-0/+1
The bio describing discard operation is allocated by __blkdev_issue_discard() which returns us a reference to it. That reference is never released and thus we leak this bio. Drop the bio reference once it completes in xlog_discard_endio(). CC: stable@vger.kernel.org Fixes: 4560e78f40cb55bd2ea8f1ef4001c5baa88531c7 Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-07-26xfs: fix multi-AG deadlock in xfs_bunmapiChristoph Hellwig1-0/+12
Just like in the allocator we must avoid touching multiple AGs out of order when freeing blocks, as freeing still locks the AGF and can cause the same AB-BA deadlocks as in the allocation path. Signed-off-by: Christoph Hellwig <hch@lst.de> Reported-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-07-25xfs: check that dir block entries don't off the end of the bufferDarrick J. Wong1-0/+4
When we're checking the entries in a directory buffer, make sure that the entry length doesn't push us off the end of the buffer. Found via xfs/388 writing ones to the length fields. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-07-24xfs: fix quotacheck dquot id overflow infinite loopBrian Foster1-0/+3
If a dquot has an id of U32_MAX, the next lookup index increment overflows the uint32_t back to 0. This starts the lookup sequence over from the beginning, repeats indefinitely and results in a livelock. Update xfs_qm_dquot_walk() to explicitly check for the lookup overflow and exit the loop. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-07-20xfs: check _alloc_read_agf buffer pointer before usingDarrick J. Wong2-0/+6
In some circumstances, _alloc_read_agf can return an error code of zero but also a null AGF buffer pointer. Check for this and jump out. Fixes-coverity-id: 1415250 Fixes-coverity-id: 1415320 Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-07-20xfs: set firstfsb to NULLFSBLOCK before feeding it to _bmapi_writeDarrick J. Wong2-1/+10
We must initialize the firstfsb parameter to _bmapi_write so that it doesn't incorrectly treat stack garbage as a restriction on which AGs it can search for free space. Fixes-coverity-id: 1402025 Fixes-coverity-id: 1415167 Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-07-20xfs: check _btree_check_block valueDarrick J. Wong1-2/+4
Check the _btree_check_block return value for the firstrec and lastrec functions, since we have the ability to signal that the repositioning did not succeed. Fixes-coverity-id: 114067 Fixes-coverity-id: 114068 Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-07-17VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb)David Howells2-6/+6
Firstly by applying the following with coccinelle's spatch: @@ expression SB; @@ -SB->s_flags & MS_RDONLY +sb_rdonly(SB) to effect the conversion to sb_rdonly(sb), then by applying: @@ expression A, SB; @@ ( -(!sb_rdonly(SB)) && A +!sb_rdonly(SB) && A | -A != (sb_rdonly(SB)) +A != sb_rdonly(SB) | -A == (sb_rdonly(SB)) +A == sb_rdonly(SB) | -!(sb_rdonly(SB)) +!sb_rdonly(SB) | -A && (sb_rdonly(SB)) +A && sb_rdonly(SB) | -A || (sb_rdonly(SB)) +A || sb_rdonly(SB) | -(sb_rdonly(SB)) != A +sb_rdonly(SB) != A | -(sb_rdonly(SB)) == A +sb_rdonly(SB) == A | -(sb_rdonly(SB)) && A +sb_rdonly(SB) && A | -(sb_rdonly(SB)) || A +sb_rdonly(SB) || A ) @@ expression A, B, SB; @@ ( -(sb_rdonly(SB)) ? 1 : 0 +sb_rdonly(SB) | -(sb_rdonly(SB)) ? A : B +sb_rdonly(SB) ? A : B ) to remove left over excess bracketage and finally by applying: @@ expression A, SB; @@ ( -(A & MS_RDONLY) != sb_rdonly(SB) +(bool)(A & MS_RDONLY) != sb_rdonly(SB) | -(A & MS_RDONLY) == sb_rdonly(SB) +(bool)(A & MS_RDONLY) == sb_rdonly(SB) ) to make comparisons against the result of sb_rdonly() (which is a bool) work correctly. Signed-off-by: David Howells <dhowells@redhat.com>
2017-07-14Merge tag 'xfs-4.13-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds5-13/+11
Pull XFS fixes from Darrick Wong: "Largely debugging and regression fixes. - Add some locking assertions for the _ilock helpers. - Revert the XFS_QMOPT_NOLOCK patch; after discussion with hch the online fsck patch that would have needed it has been redesigned and no longer needs it. - Fix behavioral regression of SEEK_HOLE/DATA with negative offsets to match 4.12-era XFS behavior" * tag 'xfs-4.13-merge-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: vfs: in iomap seek_{hole,data}, return -ENXIO for negative offsets Revert "xfs: grab dquots without taking the ilock" xfs: assert locking precondition in xfs_readlink_bmap_ilocked xfs: assert locking precondіtion in xfs_attr_list_int_ilocked xfs: fixup xfs_attr_get_ilocked
2017-07-13Revert "xfs: grab dquots without taking the ilock"Christoph Hellwig2-12/+4
This reverts commit 50e0bdbe9f48f98bb02eac7030d682f4716884ae. The new XFS_QMOPT_NOLOCK isn't used at all, and conditional locking based on a flag is always the wrong thing to do - we should be having helpers that can be called without the lock instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-07-13xfs: assert locking precondition in xfs_readlink_bmap_ilockedChristoph Hellwig1-0/+2
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-07-13xfs: assert locking precondіtion in xfs_attr_list_int_ilockedChristoph Hellwig1-0/+2
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-07-13xfs: fixup xfs_attr_get_ilockedChristoph Hellwig1-1/+3
The comment mentioned the wrong lock. Also add an ASSERT to assert this locking precondition. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-07-12xfs: map KM_MAYFAIL to __GFP_RETRY_MAYFAILMichal Hocko1-0/+10
KM_MAYFAIL didn't have any suitable GFP_FOO counterpart until recently so it relied on the default page allocator behavior for the given set of flags. This means that small allocations actually never failed. Now that we have __GFP_RETRY_MAYFAIL flag which works independently on the allocation request size we can map KM_MAYFAIL to it. The allocator will try as hard as it can to fulfill the request but fails eventually if the progress cannot be made. It does so without triggering the OOM killer which can be seen as an improvement because KM_MAYFAIL users should be able to deal with allocation failures. Link: http://lkml.kernel.org/r/20170623085345.11304-4-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Alex Belits <alex.belits@cavium.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: David Daney <david.daney@cavium.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: NeilBrown <neilb@suse.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10Merge tag 'xfs-4.13-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds108-1855/+1892
Pull XFS updates from Darrick Wong: "Here are some changes for you for 4.13. For the most part it's fixes for bugs and deadlock problems, and preparation for online fsck in some future merge window. - Avoid quotacheck deadlocks - Fix transaction overflows when bunmapping fragmented files - Refactor directory readahead - Allow admin to configure if ASSERT is fatal - Improve transaction usage detail logging during overflows - Minor cleanups - Don't leak log items when the log shuts down - Remove double-underscore typedefs - Various preparation for online scrubbing - Introduce new error injection configuration sysfs knobs - Refactor dq_get_next to use extent map directly - Fix problems with iterating the page cache for unwritten data - Implement SEEK_{HOLE,DATA} via iomap - Refactor XFS to use iomap SEEK_HOLE and SEEK_DATA - Don't use MAXPATHLEN to check on-disk symlink target lengths" * tag 'xfs-4.13-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (48 commits) xfs: don't crash on unexpected holes in dir/attr btrees xfs: rename MAXPATHLEN to XFS_SYMLINK_MAXLEN xfs: fix contiguous dquot chunk iteration livelock xfs: Switch to iomap for SEEK_HOLE / SEEK_DATA vfs: Add iomap_seek_hole and iomap_seek_data helpers vfs: Add page_cache_seek_hole_data helper xfs: remove a whitespace-only line from xfs_fs_get_nextdqblk xfs: rewrite xfs_dq_get_next_id using xfs_iext_lookup_extent xfs: Check for m_errortag initialization in xfs_errortag_test xfs: grab dquots without taking the ilock xfs: fix semicolon.cocci warnings xfs: Don't clear SGID when inheriting ACLs xfs: free cowblocks and retry on buffered write ENOSPC xfs: replace log_badcrc_factor knob with error injection tag xfs: convert drop_writes to use the errortag mechanism xfs: remove unneeded parameter from XFS_TEST_ERROR xfs: expose errortag knobs via sysfs xfs: make errortag a per-mountpoint structure xfs: free uncommitted transactions during log recovery xfs: don't allow bmap on rt files ...
2017-07-07Merge tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linuxLinus Torvalds1-1/+1
Pull Writeback error handling updates from Jeff Layton: "This pile represents the bulk of the writeback error handling fixes that I have for this cycle. Some of the earlier patches in this pile may look trivial but they are prerequisites for later patches in the series. The aim of this set is to improve how we track and report writeback errors to userland. Most applications that care about data integrity will periodically call fsync/fdatasync/msync to ensure that their writes have made it to the backing store. For a very long time, we have tracked writeback errors using two flags in the address_space: AS_EIO and AS_ENOSPC. Those flags are set when a writeback error occurs (via mapping_set_error) and are cleared as a side-effect of filemap_check_errors (as you noted yesterday). This model really sucks for userland. Only the first task to call fsync (or msync or fdatasync) will see the error. Any subsequent task calling fsync on a file will get back 0 (unless another writeback error occurs in the interim). If I have several tasks writing to a file and calling fsync to ensure that their writes got stored, then I need to have them coordinate with one another. That's difficult enough, but in a world of containerized setups that coordination may even not be possible. But wait...it gets worse! The calls to filemap_check_errors can be buried pretty far down in the call stack, and there are internal callers of filemap_write_and_wait and the like that also end up clearing those errors. Many of those callers ignore the error return from that function or return it to userland at nonsensical times (e.g. truncate() or stat()). If I get back -EIO on a truncate, there is no reason to think that it was because some previous writeback failed, and a subsequent fsync() will (incorrectly) return 0. This pile aims to do three things: 1) ensure that when a writeback error occurs that that error will be reported to userland on a subsequent fsync/fdatasync/msync call, regardless of what internal callers are doing 2) report writeback errors on all file descriptions that were open at the time that the error occurred. This is a user-visible change, but I think most applications are written to assume this behavior anyway. Those that aren't are unlikely to be hurt by it. 3) document what filesystems should do when there is a writeback error. Today, there is very little consistency between them, and a lot of cargo-cult copying. We need to make it very clear what filesystems should do in this situation. To achieve this, the set adds a new data type (errseq_t) and then builds new writeback error tracking infrastructure around that. Once all of that is in place, we change the filesystems to use the new infrastructure for reporting wb errors to userland. Note that this is just the initial foray into cleaning up this mess. There is a lot of work remaining here: 1) convert the rest of the filesystems in a similar fashion. Once the initial set is in, then I think most other fs' will be fairly simple to convert. Hopefully most of those can in via individual filesystem trees. 2) convert internal waiters on writeback to use errseq_t for detecting errors instead of relying on the AS_* flags. I have some draft patches for this for ext4, but they are not quite ready for prime time yet. This was a discussion topic this year at LSF/MM too. If you're interested in the gory details, LWN has some good articles about this: https://lwn.net/Articles/718734/ https://lwn.net/Articles/724307/" * tag 'for-linus-v4.13-2' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux: btrfs: minimal conversion to errseq_t writeback error reporting on fsync xfs: minimal conversion to errseq_t writeback error reporting ext4: use errseq_t based error handling for reporting data writeback errors fs: convert __generic_file_fsync to use errseq_t based reporting block: convert to errseq_t based writeback error tracking dax: set errors in mapping when writeback fails Documentation: flesh out the section in vfs.txt on storing and reporting writeback errors mm: set both AS_EIO/AS_ENOSPC and errseq_t in mapping_set_error fs: new infrastructure for writeback error handling and reporting lib: add errseq_t type and infrastructure for handling it mm: don't TestClearPageError in __filemap_fdatawait_range mm: clear AS_EIO/AS_ENOSPC when writeback initiation fails jbd2: don't clear and reset errors after waiting on writeback buffer: set errors in mapping at the time that the error occurs fs: check for writeback errors after syncing out buffers in generic_file_fsync buffer: use mapping_set_error instead of setting the flag mm: fix mapping_set_error call in me_pagecache_dirty
2017-07-07xfs: don't crash on unexpected holes in dir/attr btreesDarrick J. Wong4-5/+5
In quite a few places we call xfs_da_read_buf with a mappedbno that we don't control, then assume that the function passes back either an error code or a buffer pointer. Unfortunately, if mappedbno == -2 and bno maps to a hole, we get a return code of zero and a NULL buffer, which means that we crash if we actually try to use that buffer pointer. This happens immediately when we set the buffer type for transaction context. Therefore, check that we have no error code and a non-NULL bp before trying to use bp. This patch is a follow-up to an incomplete fix in 96a3aefb8ffde231 ("xfs: don't crash if reading a directory results in an unexpected hole"). Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-07-07xfs: rename MAXPATHLEN to XFS_SYMLINK_MAXLENDarrick J. Wong6-8/+8
XFS has a maximum symlink target length of 1024 bytes; this is a holdover from the Irix days. Unfortunately, the constant establishing this is 'MAXPATHLEN' and is /not/ the same as the Linux MAXPATHLEN, which is 4096. The kernel enforces its 1024 byte MAXPATHLEN on symlink targets, but xfsprogs picks up the (Linux) system 4096 byte MAXPATHLEN, which means that xfs_repair doesn't complain about oversized symlinks. Since this is an on-disk format constraint, put the define in the XFS namespace and move everything over to use the new name. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
2017-07-06Merge branch 'for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpuLinus Torvalds1-2/+2
Pull percpu updates from Tejun Heo: "These are the percpu changes for the v4.13-rc1 merge window. There are a couple visibility related changes - tracepoints and allocator stats through debugfs, along with __ro_after_init markings and a cosmetic rename in percpu_counter. Please note that the simple O(#elements_in_the_chunk) area allocator used by percpu allocator is again showing scalability issues, primarily with bpf allocating and freeing large number of counters. Dennis is working on the replacement allocator and the percpu allocator will be seeing increased churns in the coming cycles" * 'for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: percpu: fix static checker warnings in pcpu_destroy_chunk percpu: fix early calls for spinlock in pcpu_stats percpu: resolve err may not be initialized in pcpu_alloc percpu_counter: Rename __percpu_counter_add to percpu_counter_add_batch percpu: add tracepoint support for percpu memory percpu: expose statistics about percpu memory via debugfs percpu: migrate percpu data structures to internal header percpu: add missing lockdep_assert_held to func pcpu_free_area mark most percpu globals as __ro_after_init
2017-07-06xfs: minimal conversion to errseq_t writeback error reportingJeff Layton1-1/+1
Just check and advance the data errseq_t in struct file before before returning from fsync on normal files. Internal filemap_* callers are left as-is. Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jeff Layton <jlayton@redhat.com>
2017-07-05xfs: fix contiguous dquot chunk iteration livelockBrian Foster1-2/+7
The patch below updated xfs_dq_get_next_id() to use the XFS iext lookup helpers to locate the next quota id rather than to seek for data in the quota file. The updated code fails to correctly handle the case where the quota inode might have contiguous chunks part of the same extent. In this case, the start block offset is calculated based on the next expected id but the extent lookup returns the same start offset as for the previous chunk. This causes the returned id to go backwards and livelocks the quota iteration. This problem is reproduced intermittently by generic/232. To handle this case, check whether the startoff from the extent lookup is behind the startoff calculated from the next quota id. If so, bump up got.br_startoff to the specific file offset that is expected to hold the next dquot chunk. Fixes: bda250dbaf39 ("xfs: rewrite xfs_dq_get_next_id using xfs_iext_lookup_extent") Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-07-03Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds2-6/+6
Pull scheduler updates from Ingo Molnar: "The main changes in this cycle were: - Add the SYSTEM_SCHEDULING bootup state to move various scheduler debug checks earlier into the bootup. This turns silent and sporadically deadly bugs into nice, deterministic splats. Fix some of the splats that triggered. (Thomas Gleixner) - A round of restructuring and refactoring of the load-balancing and topology code (Peter Zijlstra) - Another round of consolidating ~20 of incremental scheduler code history: this time in terms of wait-queue nomenclature. (I didn't get much feedback on these renaming patches, and we can still easily change any names I might have misplaced, so if anyone hates a new name, please holler and I'll fix it.) (Ingo Molnar) - sched/numa improvements, fixes and updates (Rik van Riel) - Another round of x86/tsc scheduler clock code improvements, in hope of making it more robust (Peter Zijlstra) - Improve NOHZ behavior (Frederic Weisbecker) - Deadline scheduler improvements and fixes (Luca Abeni, Daniel Bristot de Oliveira) - Simplify and optimize the topology setup code (Lauro Ramos Venancio) - Debloat and decouple scheduler code some more (Nicolas Pitre) - Simplify code by making better use of llist primitives (Byungchul Park) - ... plus other fixes and improvements" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (103 commits) sched/cputime: Refactor the cputime_adjust() code sched/debug: Expose the number of RT/DL tasks that can migrate sched/numa: Hide numa_wake_affine() from UP build sched/fair: Remove effective_load() sched/numa: Implement NUMA node level wake_affine() sched/fair: Simplify wake_affine() for the single socket case sched/numa: Override part of migrate_degrades_locality() when idle balancing sched/rt: Move RT related code from sched/core.c to sched/rt.c sched/deadline: Move DL related code from sched/core.c to sched/deadline.c sched/cpuset: Only offer CONFIG_CPUSETS if SMP is enabled sched/fair: Spare idle load balancing on nohz_full CPUs nohz: Move idle balancer registration to the idle path sched/loadavg: Generalize "_idle" naming to "_nohz" sched/core: Drop the unused try_get_task_struct() helper function sched/fair: WARN() and refuse to set buddy when !se->on_rq sched/debug: Fix SCHED_WARN_ON() to return a value on !CONFIG_SCHED_DEBUG as well sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming sched/wait: Move bit_wait_table[] and related functionality from sched/core.c to sched/wait_bit.c sched/wait: Split out the wait_bit*() APIs from <linux/wait.h> into <linux/wait_bit.h> sched/wait: Re-adjust macro line continuation backslashes in <linux/wait.h> ...
2017-07-03Merge branch 'for-4.13/block' of git://git.kernel.dk/linux-blockLinus Torvalds5-12/+61
Pull core block/IO updates from Jens Axboe: "This is the main pull request for the block layer for 4.13. Not a huge round in terms of features, but there's a lot of churn related to some core cleanups. Note this depends on the UUID tree pull request, that Christoph already sent out. This pull request contains: - A series from Christoph, unifying the error/stats codes in the block layer. We now use blk_status_t everywhere, instead of using different schemes for different places. - Also from Christoph, some cleanups around request allocation and IO scheduler interactions in blk-mq. - And yet another series from Christoph, cleaning up how we handle and do bounce buffering in the block layer. - A blk-mq debugfs series from Bart, further improving on the support we have for exporting internal information to aid debugging IO hangs or stalls. - Also from Bart, a series that cleans up the request initialization differences across types of devices. - A series from Goldwyn Rodrigues, allowing the block layer to return failure if we will block and the user asked for non-blocking. - Patch from Hannes for supporting setting loop devices block size to that of the underlying device. - Two series of patches from Javier, fixing various issues with lightnvm, particular around pblk. - A series from me, adding support for write hints. This comes with NVMe support as well, so applications can help guide data placement on flash to improve performance, latencies, and write amplification. - A series from Ming, improving and hardening blk-mq support for stopping/starting and quiescing hardware queues. - Two pull requests for NVMe updates. Nothing major on the feature side, but lots of cleanups and bug fixes. From the usual crew. - A series from Neil Brown, greatly improving the bio rescue set support. Most notably, this kills the bio rescue work queues, if we don't really need them. - Lots of other little bug fixes that are all over the place" * 'for-4.13/block' of git://git.kernel.dk/linux-block: (217 commits) lightnvm: pblk: set line bitmap check under debug lightnvm: pblk: verify that cache read is still valid lightnvm: pblk: add initialization check lightnvm: pblk: remove target using async. I/Os lightnvm: pblk: use vmalloc for GC data buffer lightnvm: pblk: use right metadata buffer for recovery lightnvm: pblk: schedule if data is not ready lightnvm: pblk: remove unused return variable lightnvm: pblk: fix double-free on pblk init lightnvm: pblk: fix bad le64 assignations nvme: Makefile: remove dead build rule blk-mq: map all HWQ also in hyperthreaded system nvmet-rdma: register ib_client to not deadlock in device removal nvme_fc: fix error recovery on link down. nvmet_fc: fix crashes on bad opcodes nvme_fc: Fix crash when nvme controller connection fails. nvme_fc: replace ioabort msleep loop with completion nvme_fc: fix double calls to nvme_cleanup_cmd() nvme-fabrics: verify that a controller returns the correct NQN nvme: simplify nvme_dev_attrs_are_visible ...
2017-07-03Merge tag 'uuid-for-4.13' of git://git.infradead.org/users/hch/uuidLinus Torvalds7-117/+16
Pull uuid subsystem from Christoph Hellwig: "This is the new uuid subsystem, in which Amir, Andy and I have started consolidating our uuid/guid helpers and improving the types used for them. Note that various other subsystems have pulled in this tree, so I'd like it to go in early. UUID/GUID summary: - introduce the new uuid_t/guid_t types that are going to replace the somewhat confusing uuid_be/uuid_le types and make the terminology fit the various specs, as well as the userspace libuuid library. (me, based on a previous version from Amir) - consolidated generic uuid/guid helper functions lifted from XFS and libnvdimm (Amir and me) - conversions to the new types and helpers (Amir, Andy and me)" * tag 'uuid-for-4.13' of git://git.infradead.org/users/hch/uuid: (34 commits) ACPI: hns_dsaf_acpi_dsm_guid can be static mmc: sdhci-pci: make guid intel_dsm_guid static uuid: Take const on input of uuid_is_null() and guid_is_null() thermal: int340x_thermal: fix compile after the UUID API switch thermal: int340x_thermal: Switch to use new generic UUID API acpi: always include uuid.h ACPI: Switch to use generic guid_t in acpi_evaluate_dsm() ACPI / extlog: Switch to use new generic UUID API ACPI / bus: Switch to use new generic UUID API ACPI / APEI: Switch to use new generic UUID API acpi, nfit: Switch to use new generic UUID API MAINTAINERS: add uuid entry tmpfs: generate random sb->s_uuid scsi_debug: switch to uuid_t nvme: switch to uuid_t sysctl: switch to use uuid_t partitions/ldm: switch to use uuid_t overlayfs: use uuid_t instead of uuid_be fs: switch ->s_uuid to uuid_t ima/policy: switch to use uuid_t ...
2017-07-02xfs: Switch to iomap for SEEK_HOLE / SEEK_DATAChristoph Hellwig2-364/+14
Switch to the iomap_seek_hole and iomap_seek_data helpers for implementing lseek SEEK_HOLE / SEEK_DATA, and remove all the code that isn't needed any more. Based on patches from Andreas Gruenbacher <agruenba@redhat.com>. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-07-01xfs: remove a whitespace-only line from xfs_fs_get_nextdqblkChristoph Hellwig1-1/+0
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-07-01xfs: rewrite xfs_dq_get_next_id using xfs_iext_lookup_extentChristoph Hellwig1-44/+22
This goes straight to a single lookup in the extent list and avoids a roundtrip through two layers that don't add any value for the simple quoata file that just has data or holes and no page cache, delayed allocation, unwritten extent or COW fork (which btw, doesn't seem to be handled by the existing SEEK HOLE/DATA code). Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-07-01xfs: Check for m_errortag initialization in xfs_errortag_testCarlos Maiolino1-0/+11
While adding error injection into IO completion, I notice the lack of initialization check in xfs_errortag_test(), make the error injection mechanism unable to be used there. IO completion is executed a few times before the error injection mechanism is initialized, so to be safer, make xfs_errortag_test() check if the errortag is properly initialized. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
2017-06-27xfs: grab dquots without taking the ilockDarrick J. Wong2-4/+12
Add a new dqget flag that grabs the dquot without taking the ilock. This will be used by the scrubber (which will have already grabbed the ilock) to perform basic sanity checking of the quota data. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>