linux-dev - Linux kernel development work

Age	Commit message (Collapse)	Author	Files	Lines
2016-07-28	libceph: define new ceph_file_layout structure	Yan, Zheng	7	-45/+43
	Define new ceph_file_layout structure and rename old ceph_file_layout to ceph_file_layout_legacy. This is preparation for adding namespace to ceph_file_layout structure. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-07-28	libceph: add an ONSTACK initializer for oids	Ilya Dryomov	1	-1/+1
	An on-stack oid in ceph_ioctl_get_dataloc() is not initialized, resulting in a WARN and a NULL pointer dereference later on. We will have more of these on-stack in the future, so fix it with a convenience macro. Fixes: d30291b985d1 ("libceph: variable-sized ceph_object_id") Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-07-05	Use the right predicate in ->atomic_open() instances	Al Viro	1	-1/+1
	->atomic_open() can be given an in-lookup dentry or a negative one found in dcache. Use d_in_lookup() to tell one from another, rather than d_unhashed(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-07-01	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	Linus Torvalds	1	-7/+3
	Pull vfs fixes from Al Viro: "Tmpfs readdir throughput regression fix (this cycle) + some -stable fodder all over the place. One missing bit is Miklos' tonight locks.c fix - NFS folks had already grabbed that one by the time I woke up ;-)" [ The locks.c fix came through the nfsd tree just moments ago ] * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: namespace: update event counter when umounting a deleted dentry 9p: use file_dentry() ceph: fix d_obtain_alias() misuses lockless next_positive() libfs.c: new helper - next_positive() dcache_{readdir,dir_lseek}(): don't bother with nested ->d_lock
2016-06-24	ceph: fix d_obtain_alias() misuses	Al Viro	1	-7/+3
	on failure d_obtain_alias() will have done iput() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-06-10	vfs: make the string hashes salt the hash	Linus Torvalds	2	-3/+3
	We always mixed in the parent pointer into the dentry name hash, but we did it late at lookup time. It turns out that we can simplify that lookup-time action by salting the hash with the parent pointer early instead of late. A few other users of our string hashes also wanted to mix in their own pointers into the hash, and those are updated to use the same mechanism. Hash users that don't have any particular initial salt can just use the NULL pointer as a no-salt. Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: George Spelvin <linux@sciencehorizons.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-06-01	ceph: use i_version to check validity of fscache	Yan, Zheng	1	-0/+3
	Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-06-01	ceph: improve fscache revalidation	Yan, Zheng	4	-83/+41
	There are several issues in fscache revalidation code. - In ceph_revalidate_work(), fscache_invalidate() is called when fscache_check_consistency() return 0. This is complete wrong because 0 means cache is valid. - Handle_cap_grant() calls ceph_queue_revalidate() if client already has CAP_FILE_CACHE. This code is confusing. Client should revalidate the cache each time it got CAP_FILE_CACHE anew. - In Handle_cap_grant(), fscache_invalidate() is called if MDS revokes CAP_FILE_CACHE. This is inconsistency with the case that inode get evicted. In the later case, the cache is not discarded. Client may use the cache when inode is reloaded. This patch moves the fscache revalidation into ceph_get_caps(). Client revalidates the cache after it gets CAP_FILE_CACHE. i_rdcache_gen should keep constance while CAP_FILE_CACHE is used. If i_fscache_gen is not equal to i_rdcache_gen, client needs to check cache's consistency. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-06-01	ceph: disable fscache when inode is opened for write	Yan, Zheng	4	-53/+52
	All other filesystems do not add dirty pages to fscache. They all disable fscache when inode is opened for write. Only ceph adds dirty pages to fscache, but the code is buggy. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-06-01	ceph: avoid unnecessary fscache invalidation/revlidation	Yan, Zheng	1	-6/+3
	ceph_fill_file_size() has already called ceph_fscache_invalidate() if it return true. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-06-01	ceph: call __fscache_uncache_page() if readpages fails	Yan, Zheng	1	-1/+3
	If readpages fails, fscache needs to cleanup its internal state. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-30	libceph: change ceph_osdmap_flag() to take osdc	Ilya Dryomov	1	-4/+4
	For the benefit of every single caller, take osdc instead of map. Also, now that osdc->osdmap can't ever be NULL, drop the check. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-27	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	Linus Torvalds	1	-3/+4
	Pull vfs fixes from Al Viro: "Followups to the parallel lookup work: - update docs - restore killability of the places that used to take ->i_mutex killably now that we have down_write_killable() merged - Additionally, it turns out that I missed a prerequisite for security_d_instantiate() stuff - ->getxattr() wasn't the only thing that could be called before dentry is attached to inode; with smack we needed the same treatment applied to ->setxattr() as well" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: switch ->setxattr() to passing dentry and inode separately switch xattr_handler->set() to passing dentry and inode separately restore killability of old mutex_lock_killable(&inode->i_mutex) users add down_write_killable_nested() update D/f/directory-locking
2016-05-27	switch xattr_handler->set() to passing dentry and inode separately	Al Viro	1	-3/+4
	preparation for similar switch in ->setxattr() (see the next commit for rationale). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-26	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client	Linus Torvalds	14	-418/+773
	Pull Ceph updates from Sage Weil: "This changeset has a few main parts: - Ilya has finished a huge refactoring effort to sync up the client-side logic in libceph with the user-space client code, which has evolved significantly over the last couple years, with lots of additional behaviors (e.g., how requests are handled when cluster is full and transitions from full to non-full). This structure of the code is more closely aligned with userspace now such that it will be much easier to maintain going forward when behavior changes take place. There are some locking improvements bundled in as well. - Zheng adds multi-filesystem support (multiple namespaces within the same Ceph cluster) - Zheng has changed the readdir offsets and directory enumeration so that dentry offsets are hash-based and therefore stable across directory fragmentation events on the MDS. - Zheng has a smorgasbord of bug fixes across fs/ceph" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (71 commits) ceph: fix wake_up_session_cb() ceph: don't use truncate_pagecache() to invalidate read cache ceph: SetPageError() for writeback pages if writepages fails ceph: handle interrupted ceph_writepage() ceph: make ceph_update_writeable_page() uninterruptible libceph: make ceph_osdc_wait_request() uninterruptible ceph: handle -EAGAIN returned by ceph_update_writeable_page() ceph: make fault/page_mkwrite return VM_FAULT_OOM for -ENOMEM ceph: block non-fatal signals for fault/page_mkwrite ceph: make logical calculation functions return bool ceph: tolerate bad i_size for symlink inode ceph: improve fragtree change detection ceph: keep leaf frag when updating fragtree ceph: fix dir_auth check in ceph_fill_dirfrag() ceph: don't assume frag tree splits in mds reply are sorted ceph: fix inode reference leak ceph: using hash value to compose dentry offset ceph: don't forbid marking directory complete after forward seek ceph: record 'offset' for each entry of readdir result ceph: define 'end/complete' in readdir reply as bit flags ...
2016-05-26	ceph: fix wake_up_session_cb()	Yan, Zheng	1	-1/+1
	We should reset i_requested_max_size before waking the waiters. (zero i_requested_max_size make waiter re-request the max size) Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26	ceph: don't use truncate_pagecache() to invalidate read cache	Yan, Zheng	2	-5/+7
	truncate_pagecache() drops dirty pages, it's dangerous to use it to invalidate read cache. Besides, we shouldn't start invalidating read cache while there are buffer writers. Because buffer writers may add dirty pages later. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26	ceph: SetPageError() for writeback pages if writepages fails	Yan, Zheng	1	-1/+3
	Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26	ceph: handle interrupted ceph_writepage()	Yan, Zheng	1	-4/+18
	writepage() can be interrupted when it's called by direct memory reclaimer (the direct memory relaimer is killed). To avoid lossing data, we redirty the page. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26	ceph: make ceph_update_writeable_page() uninterruptible	Yan, Zheng	1	-1/+1
	ceph_update_writeable_page() is used by ceph_write_begin(). It beaks atomicity of write operation if it's interruptible. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26	ceph: handle -EAGAIN returned by ceph_update_writeable_page()	Yan, Zheng	1	-13/+15
	when ceph_update_writeable_page() return -EAGAIN, caller should lock the page and call ceph_update_writeable_page() again. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26	ceph: make fault/page_mkwrite return VM_FAULT_OOM for -ENOMEM	Yan, Zheng	1	-20/+17
	Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26	ceph: block non-fatal signals for fault/page_mkwrite	Yan, Zheng	1	-27/+39
	Fault and page_mkwrite are supposed to be uninterruptable. But they call ceph functions that are interruptible. So they should block signals before calling functions that are interruptible Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26	ceph: make logical calculation functions return bool	Zhang Zhuoyu	2	-2/+2
	This patch makes serverl logical caculation functions return bool to improve readability due to these particular functions only using 0/1 as their return value. No functional change. Signed-off-by: Zhang Zhuoyu <zhangzhuoyu@cmss.chinamobile.com>
2016-05-26	ceph: tolerate bad i_size for symlink inode	Yan, Zheng	1	-7/+15
	A mds bug can cause symlink's size to be truncated to zero. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26	ceph: improve fragtree change detection	Yan, Zheng	2	-4/+21
	check if number of splits in i_fragtree is equal to number of splits in mds reply Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26	ceph: keep leaf frag when updating fragtree	Yan, Zheng	1	-5/+23
	Nodes in i_fragtree are sorted according to ceph_compare_frag(). It means frag node in i_fragtree always follow its direct parent node. To check if a leaf node is valid, we just need to check if it's child of previous split node. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26	ceph: fix dir_auth check in ceph_fill_dirfrag()	Yan, Zheng	1	-0/+3
	-1 is CDIR_AUTH_PARENT, it means dir's auth mds is the same as inode's auth mds Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26	ceph: don't assume frag tree splits in mds reply are sorted	Yan, Zheng	1	-0/+13
	The algorithm that updates i_fragtree relies on that the frag tree splits in mds reply are of the same order of i_fragtree. This is not true because current MDS encodes frag tree splits in ascending order of (unsigned)frag_t. But nodes in i_fragtree are sorted according to ceph_frag_compare(). The fix is sort the frag tree splits first, then updates i_fragtree. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26	ceph: fix inode reference leak	Yan, Zheng	1	-1/+1
	Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26	ceph: using hash value to compose dentry offset	Yan, Zheng	5	-47/+135
	If MDS sorts dentries in dirfrag in hash order, we use hash value to compose dentry offset. dentry offset is: (0xff << 52) \| ((24 bits hash) << 28) \| (the nth entry hash hash collision) This offset is stable across directory fragmentation. This alos means there is no need to reset readdir offset if directory get fragmented in the middle of readdir. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26	ceph: don't forbid marking directory complete after forward seek	Yan, Zheng	1	-5/+0
	Forward seek within same frag does not update fi->last_name, it will not affect contents of later readdir reply. So there is no need to forbid marking directory complete Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26	ceph: record 'offset' for each entry of readdir result	Yan, Zheng	5	-29/+59
	This is preparation for using hash value as dentry 'offset' Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26	ceph: define 'end/complete' in readdir reply as bit flags	Yan, Zheng	3	-3/+8
	Set a flag in readdir request, which indicates that client interprets 'end/complete' as bit flags. So that mds can reply additional flags in readdir reply. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26	ceph: define struct for dir entry in readdir reply	Yan, Zheng	4	-52/+50
	This avoids defining multiple arrays for entries in readdir reply Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26	ceph: simplify 'offset in frag'	Yan, Zheng	2	-13/+4
	don't distinguish leftmost frag from other frags. always use 2 as first entry's offset. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26	ceph: remove unnecessary checks in __dcache_readdir	Yan, Zheng	1	-2/+0
	we never add snapdir and the hidden .ceph dir into readdir cache Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26	ceph: search cache postion for dcache readdir	Yan, Zheng	1	-46/+83
	use binary search to find cache index that corresponds to readdir postion. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26	ceph: use CEPH_MDS_OP_RMXATTR request to remove xattr	Yan, Zheng	1	-6/+11
	Setxattr with NULL value and XATTR_REPLACE flag should be equivalent to removexattr. But current MDS does not support deleting vxattrs through MDS_OP_SETXATTR request. The workaround is sending MDS_OP_RMXATTR request if setxattr actually removs xattr. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26	ceph: report mount root in session metadata	Yan, Zheng	3	-15/+23
	Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26	ceph: don't show symlink target in debugfs/mdsc	Yan, Zheng	1	-1/+1
	symlink target is useless for debug and can be very long. It's annoying to show it in debugfs/mdsc. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26	ceph: don't call truncate_pagecache in ceph_writepages_start	Yan, Zheng	3	-9/+38
	truncate_pagecache() may decrease inode's reference. This can cause deadlock if inode's last reference is dropped and iput_final() wants to evict the inode. (evict() calls inode_wait_for_writeback(), which waits for ceph_writepages_start() to return). The fix is use work thead to truncate dirty pages. Also add 'forced umount' check to ceph_update_writeable_page(), which prevents new pages getting dirty. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26	ceph: renew caps for read/write if mds session got killed.	Yan, Zheng	4	-11/+93
	When mds session gets killed, read/write operation may hang. Client waits for Frw caps, but mds does not know what caps client wants. To recover this, client sends an open request to mds. The request will tell mds what caps client wants. Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26	ceph: CEPH_FEATURE_MDSENC support	Yan, Zheng	2	-12/+36
	Signed-off-by: Yan, Zheng <zyan@redhat.com>
2016-05-26	ceph: multiple filesystem support	Yan, Zheng	2	-0/+10
	To access non-default filesystem, we just need to subscribe to mdsmap.<MDS_NAMESPACE_ID> and add a new mount option for mds namespace id. Signed-off-by: Yan, Zheng <zyan@redhat.com> [idryomov@gmail.com: switch to a new libceph API] Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26	libceph: a major OSD client update	Ilya Dryomov	2	-8/+8
	This is a major sync up, up to ~Jewel. The highlights are: - per-session request trees (vs a global per-client tree) - per-session locking (vs a global per-client rwlock) - homeless OSD session - no ad-hoc global per-client lists - support for pool quotas - foundation for watch/notify v2 support - foundation for map check (pool deletion detection) support The switchover is incomplete: lingering requests can be setup and teared down but aren't ever reestablished. This functionality is restored with the introduction of the new lingering infrastructure (ceph_osd_linger_request, linger_work, etc) in a later commit. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26	libceph: redo callbacks and factor out MOSDOpReply decoding	Ilya Dryomov	2	-2/+3
	If you specify ACK \| ONDISK and set ->r_unsafe_callback, both ->r_callback and ->r_unsafe_callback(true) are called on ack. This is very confusing. Redo this so that only one of them is called: ->r_unsafe_callback(true), on ack ->r_unsafe_callback(false), on commit or ->r_callback, on ack\|commit Decode everything in decode_MOSDOpReply() to reduce clutter. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26	libceph: drop msg argument from ceph_osdc_callback_t	Ilya Dryomov	2	-9/+7
	finish_read(), its only user, uses it to get to hdr.data_len, which is what ->r_result is set to on success. This gains us the ability to safely call callbacks from contexts other than reply, e.g. map check. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26	libceph: switch to calc_target(), part 2	Ilya Dryomov	2	-23/+9
	The crux of this is getting rid of ceph_osdc_build_request(), so that MOSDOp can be encoded not before but after calc_target() calculates the actual target. Encoding now happens within ceph_osdc_start_request(). Also nuked is the accompanying bunch of pointers into the encoded buffer that was used to update fields on each send - instead, the entire front is re-encoded. If we want to support target->name_len != base->name_len in the future, there is no other way, because oid is surrounded by other fields in the encoded buffer. Encoding OSD ops and adding data items to the request message were mixed together in osd_req_encode_op(). While we want to re-encode OSD ops, we don't want to add duplicate data items to the message when resending, so all call to ceph_osdc_msg_data_add() are factored out into a new setup_request_data(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-05-26	libceph: introduce ceph_osd_request_target, calc_target()	Ilya Dryomov	2	-2/+2
	Introduce ceph_osd_request_target, containing all mapping-related fields of ceph_osd_request and calc_target() for calculating mappings and populating it. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>