aboutsummaryrefslogtreecommitdiffstats
path: root/net/ceph (follow)
AgeCommit message (Collapse)AuthorFilesLines
2016-04-04mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macrosKirill A. Shutemov3-20/+20
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. We have many places where PAGE_CACHE_SIZE assumed to be equal to PAGE_SIZE. And it's constant source of confusion on whether PAGE_CACHE_* or PAGE_* constant should be used in a particular case, especially on the border between fs and mm. Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much breakage to be doable. Let's stop pretending that pages in page cache are special. They are not. The changes are pretty straight-forward: - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN}; - page_cache_get() -> get_page(); - page_cache_release() -> put_page(); This patch contains automated changes generated with coccinelle using script below. For some reason, coccinelle doesn't patch header files. I've called spatch for them manually. The only adjustment after coccinelle is revert of changes to PAGE_CAHCE_ALIGN definition: we are going to drop it later. There are few places in the code where coccinelle didn't reach. I'll fix them manually in a separate patch. Comments and documentation also will be addressed with the separate patch. virtual patch @@ expression E; @@ - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ expression E; @@ - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ @@ - PAGE_CACHE_SHIFT + PAGE_SHIFT @@ @@ - PAGE_CACHE_SIZE + PAGE_SIZE @@ @@ - PAGE_CACHE_MASK + PAGE_MASK @@ expression E; @@ - PAGE_CACHE_ALIGN(E) + PAGE_ALIGN(E) @@ expression E; @@ - page_cache_get(E) + get_page(E) @@ expression E; @@ - page_cache_release(E) + put_page(E) Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-26Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-clientLinus Torvalds5-272/+344
Pull Ceph updates from Sage Weil: "There is quite a bit here, including some overdue refactoring and cleanup on the mon_client and osd_client code from Ilya, scattered writeback support for CephFS and a pile of bug fixes from Zheng, and a few random cleanups and fixes from others" [ I already decided not to pull this because of it having been rebased recently, but ended up changing my mind after all. Next time I'll really hold people to it. Oh well. - Linus ] * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: (34 commits) libceph: use KMEM_CACHE macro ceph: use kmem_cache_zalloc rbd: use KMEM_CACHE macro ceph: use lookup request to revalidate dentry ceph: kill ceph_get_dentry_parent_inode() ceph: fix security xattr deadlock ceph: don't request vxattrs from MDS ceph: fix mounting same fs multiple times ceph: remove unnecessary NULL check ceph: avoid updating directory inode's i_size accidentally ceph: fix race during filling readdir cache libceph: use sizeof_footer() more ceph: kill ceph_empty_snapc ceph: fix a wrong comparison ceph: replace CURRENT_TIME by current_fs_time() ceph: scattered page writeback libceph: add helper that duplicates last extent operation libceph: enable large, variable-sized OSD requests libceph: osdc->req_mempool should be backed by a slab pool libceph: make r_request msg_size calculation clearer ...
2016-03-25libceph: use KMEM_CACHE macroGeliang Tang1-8/+2
Use KMEM_CACHE() instead of kmem_cache_create() to simplify the code. Signed-off-by: Geliang Tang <geliangtang@163.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25libceph: use sizeof_footer() moreIlya Dryomov1-16/+3
Don't open-code sizeof_footer() in read_partial_message() and ceph_msg_revoke(). Also, after switching to sizeof_footer(), it's now possible to use con_out_kvec_add() in prepare_write_message_footer(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>
2016-03-25libceph: add helper that duplicates last extent operationYan, Zheng1-0/+22
This helper duplicates last extent operation in OSD request, then adjusts the new extent operation's offset and length. The helper is for scatterd page writeback, which adds nonconsecutive dirty pages to single OSD request. Signed-off-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25libceph: enable large, variable-sized OSD requestsIlya Dryomov1-15/+28
Turn r_ops into a flexible array member to enable large, consisting of up to 16 ops, OSD requests. The use case is scattered writeback in cephfs and, as far as the kernel client is concerned, 16 is just a made up number. r_ops had size 3 for copyup+hint+write, but copyup is really a special case - it can only happen once. ceph_osd_request_cache is therefore stuffed with num_ops=2 requests, anything bigger than that is allocated with kmalloc(). req_mempool is backed by ceph_osd_request_cache, which means either num_ops=1 or num_ops=2 for use_mempool=true - all existing users (ceph_writepages_start(), ceph_osdc_writepages()) are fine with that. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25libceph: osdc->req_mempool should be backed by a slab poolIlya Dryomov1-2/+2
ceph_osd_request_cache was introduced a long time ago. Also, osd_req is about to get a flexible array member, which ceph_osd_request_cache is going to be aware of. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25libceph: make r_request msg_size calculation clearerIlya Dryomov1-10/+11
Although msg_size is calculated correctly, the terms are grouped in a misleading way - snaps appears to not have room for a u32 length. Move calculation closer to its use and regroup terms. No functional change. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25libceph: move r_reply_op_{len,result} into struct ceph_osd_req_opYan, Zheng1-2/+2
This avoids defining large array of r_reply_op_{len,result} in in struct ceph_osd_request. Signed-off-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25libceph: rename ceph_osd_req_op::payload_len to indata_lenIlya Dryomov1-6/+6
Follow userspace nomenclature on this - the next commit adds outdata_len. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25libceph: behave in mon_fault() if cur_mon < 0Ilya Dryomov1-14/+9
This can happen if __close_session() in ceph_monc_stop() races with a connection reset. We need to ignore such faults, otherwise it's likely we would take !hunting, call __schedule_delayed() and end up with delayed_work() executing on invalid memory, among other things. The (two!) con->private tests are useless, as nothing ever clears con->private. Nuke them. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25libceph: reschedule tick in mon_fault()Ilya Dryomov1-4/+4
Doing __schedule_delayed() in the hunting branch is pointless, as the tick will have already been scheduled by then. What we need to do instead is *reschedule* it in the !hunting branch, after reopen_session() changes hunt_mult, which affects the delay. This helps with spacing out connection attempts and avoiding things like two back-to-back attempts followed by a longer period of waiting around. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25libceph: introduce and switch to reopen_session()Ilya Dryomov1-17/+16
hunting is now set in __open_session() and cleared in finish_hunting(), instead of all around. The "session lost" message is printed not only on connection resets, but also on keepalive timeouts. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25libceph: monc hunt rate is 3s with backoff up to 30sIlya Dryomov1-9/+16
Unless we are in the process of setting up a client (i.e. connecting to the monitor cluster for the first time), apply a backoff: every time we want to reopen a session, increase our timeout by a multiple (currently 2); when we complete the connection, reduce that multipler by 50%. Mirrors ceph.git commit 794c86fd289bd62a35ed14368fa096c46736e9a2. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25libceph: monc ping rate is 10sIlya Dryomov2-7/+2
Split ping interval and ping timeout: ping interval is 10s; keepalive timeout is 30s. Make monc_ping_timeout a constant while at it - it's not actually exported as a mount option (and the rest of tick-related settings won't be either), so it's got no place in ceph_options. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25libceph: pick a different monitor when reconnectingIlya Dryomov1-28/+57
Don't try to reconnect to the same monitor when we fail to establish a session within a timeout or it's lost. For that, pick_new_mon() needs to see the old value of cur_mon, so don't clear it in __close_session() - all calls to __close_session() but one are followed by __open_session() anyway. __open_session() is only called when a new session needs to be established, so the "already open?" branch, which is now in the way, is simply dropped. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25libceph: revamp subs code, switch to SUBSCRIBE2 protocolIlya Dryomov3-83/+147
It is currently hard-coded in the mon_client that mdsmap and monmap subs are continuous, while osdmap sub is always "onetime". To better handle full clusters/pools in the osd_client, we need to be able to issue continuous osdmap subs. Revamp subs code to allow us to specify for each sub whether it should be continuous or not. Although not strictly required for the above, switch to SUBSCRIBE2 protocol while at it, eliminating the ambiguity between a request for "every map since X" and a request for "just the latest" when we don't have a map yet (i.e. have epoch 0). SUBSCRIBE2 feature bit is now required - it's been supported since pre-argonaut (2010). Move "got mdsmap" call to the end of ceph_mdsc_handle_map() - calling in before we validate the epoch and successfully install the new map can mess up mon_client sub state. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25libceph: decouple hunting and subs managementIlya Dryomov1-9/+22
Coupling hunting state with subscribe state is not a good idea. Clear hunting when we complete the authentication handshake. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-25libceph: move debugfs initialization into __ceph_open_session()Ilya Dryomov2-51/+4
Our debugfs dir name is a concatenation of cluster fsid and client unique ID ("global_id"). It used to be the case that we learned global_id first, nowadays we always learn fsid first - the monmap is sent before any auth replies are. ceph_debugfs_client_init() call in ceph_monc_handle_map() is therefore never executed and can be removed. Its counterpart in handle_auth_reply() doesn't really belong there either: having to do monc->client and unlocking early to work around lockdep is a testament to that. Move it into __ceph_open_session(), where it can be called unconditionally. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-03-20Merge branch 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tipLinus Torvalds1-1/+1
Pull x86 protection key support from Ingo Molnar: "This tree adds support for a new memory protection hardware feature that is available in upcoming Intel CPUs: 'protection keys' (pkeys). There's a background article at LWN.net: https://lwn.net/Articles/643797/ The gist is that protection keys allow the encoding of user-controllable permission masks in the pte. So instead of having a fixed protection mask in the pte (which needs a system call to change and works on a per page basis), the user can map a (handful of) protection mask variants and can change the masks runtime relatively cheaply, without having to change every single page in the affected virtual memory range. This allows the dynamic switching of the protection bits of large amounts of virtual memory, via user-space instructions. It also allows more precise control of MMU permission bits: for example the executable bit is separate from the read bit (see more about that below). This tree adds the MM infrastructure and low level x86 glue needed for that, plus it adds a high level API to make use of protection keys - if a user-space application calls: mmap(..., PROT_EXEC); or mprotect(ptr, sz, PROT_EXEC); (note PROT_EXEC-only, without PROT_READ/WRITE), the kernel will notice this special case, and will set a special protection key on this memory range. It also sets the appropriate bits in the Protection Keys User Rights (PKRU) register so that the memory becomes unreadable and unwritable. So using protection keys the kernel is able to implement 'true' PROT_EXEC on x86 CPUs: without protection keys PROT_EXEC implies PROT_READ as well. Unreadable executable mappings have security advantages: they cannot be read via information leaks to figure out ASLR details, nor can they be scanned for ROP gadgets - and they cannot be used by exploits for data purposes either. We know about no user-space code that relies on pure PROT_EXEC mappings today, but binary loaders could start making use of this new feature to map binaries and libraries in a more secure fashion. There is other pending pkeys work that offers more high level system call APIs to manage protection keys - but those are not part of this pull request. Right now there's a Kconfig that controls this feature (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) that is default enabled (like most x86 CPU feature enablement code that has no runtime overhead), but it's not user-configurable at the moment. If there's any serious problem with this then we can make it configurable and/or flip the default" * 'mm-pkeys-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits) x86/mm/pkeys: Fix mismerge of protection keys CPUID bits mm/pkeys: Fix siginfo ABI breakage caused by new u64 field x86/mm/pkeys: Fix access_error() denial of writes to write-only VMA mm/core, x86/mm/pkeys: Add execute-only protection keys support x86/mm/pkeys: Create an x86 arch_calc_vm_prot_bits() for VMA flags x86/mm/pkeys: Allow kernel to modify user pkey rights register x86/fpu: Allow setting of XSAVE state x86/mm: Factor out LDT init from context init mm/core, x86/mm/pkeys: Add arch_validate_pkey() mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits() x86/mm/pkeys: Actually enable Memory Protection Keys in the CPU x86/mm/pkeys: Add Kconfig prompt to existing config option x86/mm/pkeys: Dump pkey from VMA in /proc/pid/smaps x86/mm/pkeys: Dump PKRU with other kernel registers mm/core, x86/mm/pkeys: Differentiate instruction fetches x86/mm/pkeys: Optimize fault handling in access_error() mm/core: Do not enforce PKEY permissions on remote mm access um, pkeys: Add UML arch_*_access_permitted() methods mm/gup, x86/mm/pkeys: Check VMAs and PTEs for protection keys x86/mm/gup: Simplify get_user_pages() PTE bit handling ...
2016-03-17Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6Linus Torvalds1-45/+56
Pull crypto update from Herbert Xu: "Here is the crypto update for 4.6: API: - Convert remaining crypto_hash users to shash or ahash, also convert blkcipher/ablkcipher users to skcipher. - Remove crypto_hash interface. - Remove crypto_pcomp interface. - Add crypto engine for async cipher drivers. - Add akcipher documentation. - Add skcipher documentation. Algorithms: - Rename crypto/crc32 to avoid name clash with lib/crc32. - Fix bug in keywrap where we zero the wrong pointer. Drivers: - Support T5/M5, T7/M7 SPARC CPUs in n2 hwrng driver. - Add PIC32 hwrng driver. - Support BCM6368 in bcm63xx hwrng driver. - Pack structs for 32-bit compat users in qat. - Use crypto engine in omap-aes. - Add support for sama5d2x SoCs in atmel-sha. - Make atmel-sha available again. - Make sahara hashing available again. - Make ccp hashing available again. - Make sha1-mb available again. - Add support for multiple devices in ccp. - Improve DMA performance in caam. - Add hashing support to rockchip" * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (116 commits) crypto: qat - remove redundant arbiter configuration crypto: ux500 - fix checks of error code returned by devm_ioremap_resource() crypto: atmel - fix checks of error code returned by devm_ioremap_resource() crypto: qat - Change the definition of icp_qat_uof_regtype hwrng: exynos - use __maybe_unused to hide pm functions crypto: ccp - Add abstraction for device-specific calls crypto: ccp - CCP versioning support crypto: ccp - Support for multiple CCPs crypto: ccp - Remove check for x86 family and model crypto: ccp - memset request context to zero during import lib/mpi: use "static inline" instead of "extern inline" lib/mpi: avoid assembler warning hwrng: bcm63xx - fix non device tree compatibility crypto: testmgr - allow rfc3686 aes-ctr variants in fips mode. crypto: qat - The AE id should be less than the maximal AE number lib/mpi: Endianness fix crypto: rockchip - add hash support for crypto engine in rk3288 crypto: xts - fix compile errors crypto: doc - add skcipher API documentation crypto: doc - update AEAD AD handling ...
2016-02-24libceph: don't spam dmesg with stray reply warningsIlya Dryomov1-2/+2
Commit d15f9d694b77 ("libceph: check data_len in ->alloc_msg()") mistakenly bumped the log level on the "tid %llu unknown, skipping" message. Turn it back into a dout() - stray replies are perfectly normal when OSDs flap, crash, get killed for testing purposes, etc. Cc: stable@vger.kernel.org # 4.3+ Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>
2016-02-24libceph: use the right footer size when skipping a messageIlya Dryomov1-2/+9
ceph_msg_footer is 21 bytes long, while ceph_msg_footer_old is only 13. Don't skip too much when CEPH_FEATURE_MSG_AUTH isn't negotiated. Cc: stable@vger.kernel.org # 3.19+ Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org>
2016-02-24libceph: don't bail early from try_read() when skipping a messageIlya Dryomov1-2/+2
The contract between try_read() and try_write() is that when called each processes as much data as possible. When instructed by osd_client to skip a message, try_read() is violating this contract by returning after receiving and discarding a single message instead of checking for more. try_write() then gets a chance to write out more requests, generating more replies/skips for try_read() to handle, forcing the messenger into a starvation loop. Cc: stable@vger.kernel.org # 3.10+ Reported-by: Varada Kari <Varada.Kari@sandisk.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Tested-by: Varada Kari <Varada.Kari@sandisk.com> Reviewed-by: Alex Elder <elder@linaro.org>
2016-02-16mm/gup: Switch all callers of get_user_pages() to not pass tsk/mmDave Hansen1-1/+1
We will soon modify the vanilla get_user_pages() so it can no longer be used on mm/tasks other than 'current/current->mm', which is by far the most common way it is called. For now, we allow the old-style calls, but warn when they are used. (implemented in previous patch) This patch switches all callers of: get_user_pages() get_user_pages_unlocked() get_user_pages_locked() to stop passing tsk/mm so they will no longer see the warnings. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave@sr71.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: jack@suse.cz Cc: linux-mm@kvack.org Link: http://lkml.kernel.org/r/20160212210156.113E9407@viggo.jf.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-02-04libceph: MOSDOpReply v7 encodingIlya Dryomov1-0/+10
Empty request_redirect_t (struct ceph_request_redirect in the kernel client) is now encoded with a bool. NEW_OSDOPREPLY_ENCODING feature bit overlaps with already supported CRUSH_TUNABLES5. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Sage Weil <sage@redhat.com>
2016-02-04crush: decode and initialize chooseleaf_stableIlya Dryomov1-5/+14
Also add missing \n while at it. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Sage Weil <sage@redhat.com>
2016-02-04crush: add chooseleaf_stable tunableIlya Dryomov1-4/+14
Add a tunable to fix the bug that chooseleaf may cause unnecessary pg migrations when some device fails. Reflects ceph.git commit fdb3f664448e80d984470f32f04e2e6f03ab52ec. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Sage Weil <sage@redhat.com>
2016-02-04crush: ensure take bucket value is validIlya Dryomov1-1/+2
Ensure that the take argument is a valid bucket ID before indexing the buckets array. Reflects ceph.git commit 93ec538e8a667699876b72459b8ad78966d89c61. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Sage Weil <sage@redhat.com>
2016-02-04crush: ensure bucket id is valid before indexing buckets arrayIlya Dryomov1-2/+10
We were indexing the buckets array without verifying the index was within the [0,max_buckets) range. This could happen because a multistep rule does not have enough buckets and has CRUSH_ITEM_NONE for an intermediate result, which would feed in CRUSH_ITEM_NONE and make us crash. Reflects ceph.git commit 976a24a326da8931e689ee22fce35feab5b67b76. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Sage Weil <sage@redhat.com>
2016-01-27libceph: Remove unnecessary ivsize variablesIlya Dryomov1-12/+8
This patch removes the unnecessary ivsize variabls as they always have the value of AES_BLOCK_SIZE. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2016-01-27libceph: Use skcipherHerbert Xu1-41/+56
This patch replaces uses of blkcipher with skcipher. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2016-01-21libceph: remove outdated commentIlya Dryomov1-4/+0
MClientMount{,Ack} are long gone. The receipt of bare monmap doesn't actually indicate a mount success as we are yet to authenticate at that point in time. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-01-21libceph: kill off ceph_x_ticket_handler::validityIlya Dryomov2-5/+2
With it gone, no need to preserve ceph_timespec in process_one_ticket() either. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Sage Weil <sage@redhat.com>
2016-01-21libceph: invalidate AUTH in addition to a service ticketIlya Dryomov1-2/+14
If we fault due to authentication, we invalidate the service ticket we have and request a new one - the idea being that if a service rejected our authorizer, it must have expired, despite mon_client's attempts at periodic renewal. (The other possibility is that our ticket is too new and the service hasn't gotten it yet, in which case invalidating isn't necessary but doesn't hurt.) Invalidating just the service ticket is not enough, though. If we assume a failure on mon_client's part to renew a service ticket, we have to assume the same for the AUTH ticket. If our AUTH ticket is bad, we won't get any service tickets no matter how hard we try, so invalidate AUTH ticket along with the service ticket. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Sage Weil <sage@redhat.com>
2016-01-21libceph: fix authorizer invalidation, take 2Ilya Dryomov2-5/+23
Back in 2013, commit 4b8e8b5d78b8 ("libceph: fix authorizer invalidation") tried to fix authorizer invalidation issues by clearing validity field. However, nothing ever consults this field, so it doesn't force us to request any new secrets in any way and therefore we never get out of the exponential backoff mode: [ 129.973812] libceph: osd2 192.168.122.1:6810 connect authorization failure [ 130.706785] libceph: osd2 192.168.122.1:6810 connect authorization failure [ 131.710088] libceph: osd2 192.168.122.1:6810 connect authorization failure [ 133.708321] libceph: osd2 192.168.122.1:6810 connect authorization failure [ 137.706598] libceph: osd2 192.168.122.1:6810 connect authorization failure ... AFAICT this was the case at the time 4b8e8b5d78b8 was merged, too. Using timespec solely as a bool isn't nice, so introduce a new have_key flag, specifically for this purpose. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Sage Weil <sage@redhat.com>
2016-01-21libceph: clear messenger auth_retry flag if we faultIlya Dryomov1-3/+7
Commit 20e55c4cc758 ("libceph: clear messenger auth_retry flag when we authenticate") got us only half way there. We clear the flag if the second attempt succeeds, but it also needs to be cleared if that attempt fails, to allow for the exponential backoff to kick in. Otherwise, if ->should_authenticate() thinks our keys are valid, we will busy loop, incrementing auth_retry to no avail: process_connect ffff880079a63830 got BADAUTHORIZER attempt 1 process_connect ffff880079a63830 got BADAUTHORIZER attempt 2 process_connect ffff880079a63830 got BADAUTHORIZER attempt 3 process_connect ffff880079a63830 got BADAUTHORIZER attempt 4 process_connect ffff880079a63830 got BADAUTHORIZER attempt 5 ... Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Sage Weil <sage@redhat.com>
2016-01-21libceph: fix ceph_msg_revoke()Ilya Dryomov1-18/+58
There are a number of problems with revoking a "was sending" message: (1) We never make any attempt to revoke data - only kvecs contibute to con->out_skip. However, once the header (envelope) is written to the socket, our peer learns data_len and sets itself to expect at least data_len bytes to follow front or front+middle. If ceph_msg_revoke() is called while the messenger is sending message's data portion, anything we send after that call is counted by the OSD towards the now revoked message's data portion. The effects vary, the most common one is the eventual hang - higher layers get stuck waiting for the reply to the message that was sent out after ceph_msg_revoke() returned and treated by the OSD as a bunch of data bytes. This is what Matt ran into. (2) Flat out zeroing con->out_kvec_bytes worth of bytes to handle kvecs is wrong. If ceph_msg_revoke() is called before the tag is sent out or while the messenger is sending the header, we will get a connection reset, either due to a bad tag (0 is not a valid tag) or a bad header CRC, which kind of defeats the purpose of revoke. Currently the kernel client refuses to work with header CRCs disabled, but that will likely change in the future, making this even worse. (3) con->out_skip is not reset on connection reset, leading to one or more spurious connection resets if we happen to get a real one between con->out_skip is set in ceph_msg_revoke() and before it's cleared in write_partial_skip(). Fixing (1) and (3) is trivial. The idea behind fixing (2) is to never zero the tag or the header, i.e. send out tag+header regardless of when ceph_msg_revoke() is called. That way the header is always correct, no unnecessary resets are induced and revoke stands ready for disabled CRCs. Since ceph_msg_revoke() rips out con->out_msg, introduce a new "message out temp" and copy the header into it before sending. Cc: stable@vger.kernel.org # 4.0+ Reported-by: Matt Conner <matt.conner@keepertech.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Tested-by: Matt Conner <matt.conner@keepertech.com> Reviewed-by: Sage Weil <sage@redhat.com>
2016-01-21libceph: use list_for_each_entry_safeGeliang Tang1-9/+3
Use list_for_each_entry_safe() instead of list_for_each_safe() to simplify the code. Signed-off-by: Geliang Tang <geliangtang@163.com> [idryomov@gmail.com: nuke call to list_splice_init() as well] Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2016-01-21libceph: use list_next_entry instead of list_entry_nextGeliang Tang1-5/+2
list_next_entry has been defined in list.h, so I replace list_entry_next with it. Signed-off-by: Geliang Tang <geliangtang@163.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-11-13Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-clientLinus Torvalds5-87/+93
Pull Ceph updates from Sage Weil: "There are several patches from Ilya fixing RBD allocation lifecycle issues, a series adding a nocephx_sign_messages option (and associated bug fixes/cleanups), several patches from Zheng improving the (directory) fsync behavior, a big improvement in IO for direct-io requests when striping is enabled from Caifeng, and several other small fixes and cleanups" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: libceph: clear msg->con in ceph_msg_release() only libceph: add nocephx_sign_messages option libceph: stop duplicating client fields in messenger libceph: drop authorizer check from cephx msg signing routines libceph: msg signing callouts don't need con argument libceph: evaluate osd_req_op_data() arguments only once ceph: make fsync() wait unsafe requests that created/modified inode ceph: add request to i_unsafe_dirops when getting unsafe reply libceph: introduce ceph_x_authorizer_cleanup() ceph: don't invalidate page cache when inode is no longer used rbd: remove duplicate calls to rbd_dev_mapping_clear() rbd: set device_type::release instead of device::release rbd: don't free rbd_dev outside of the release callback rbd: return -ENOMEM instead of pool id if rbd_dev_create() fails libceph: use local variable cursor instead of &msg->cursor libceph: remove con argument in handle_reply() ceph: combine as many iovec as possile into one OSD request ceph: fix message length computation ceph: fix a comment typo rbd: drop null test before destroy functions
2015-11-05Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-securityLinus Torvalds2-4/+4
Pull security subsystem update from James Morris: "This is mostly maintenance updates across the subsystem, with a notable update for TPM 2.0, and addition of Jarkko Sakkinen as a maintainer of that" * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (40 commits) apparmor: clarify CRYPTO dependency selinux: Use a kmem_cache for allocation struct file_security_struct selinux: ioctl_has_perm should be static selinux: use sprintf return value selinux: use kstrdup() in security_get_bools() selinux: use kmemdup in security_sid_to_context_core() selinux: remove pointless cast in selinux_inode_setsecurity() selinux: introduce security_context_str_to_sid selinux: do not check open perm on ftruncate call selinux: change CONFIG_SECURITY_SELINUX_CHECKREQPROT_VALUE default KEYS: Merge the type-specific data with the payload data KEYS: Provide a script to extract a module signature KEYS: Provide a script to extract the sys cert list from a vmlinux file keys: Be more consistent in selection of union members used certs: add .gitignore to stop git nagging about x509_certificate_list KEYS: use kvfree() in add_key Smack: limited capability for changing process label TPM: remove unnecessary little endian conversion vTPM: support little endian guests char: Drop owner assignment from i2c_driver ...
2015-11-02libceph: clear msg->con in ceph_msg_release() onlyIlya Dryomov2-28/+20
The following bit in ceph_msg_revoke_incoming() is unsafe: struct ceph_connection *con = msg->con; if (!con) return; mutex_lock(&con->mutex); <more msg->con use> There is nothing preventing con from getting destroyed right after msg->con test. One easy way to reproduce this is to disable message signing only on the server side and try to map an image. The system will go into a libceph: read_partial_message ffff880073f0ab68 signature check failed libceph: osd0 192.168.255.155:6801 bad crc/signature libceph: read_partial_message ffff880073f0ab68 signature check failed libceph: osd0 192.168.255.155:6801 bad crc/signature loop which has to be interrupted with Ctrl-C. Hit Ctrl-C and you are likely to end up with a random GP fault if the reset handler executes "within" ceph_msg_revoke_incoming(): <yet another reply w/o a signature> ... <Ctrl-C> rbd_obj_request_end ceph_osdc_cancel_request __unregister_request ceph_osdc_put_request ceph_msg_revoke_incoming ... osd_reset __kick_osd_requests __reset_osd remove_osd ceph_con_close reset_connection <clear con->in_msg->con> <put con ref> put_osd <free osd/con> <msg->con use> <-- !!! If ceph_msg_revoke_incoming() executes "before" the reset handler, osd/con will be leaked because ceph_msg_revoke_incoming() clears con->in_msg but doesn't put con ref, while reset_connection() only puts con ref if con->in_msg != NULL. The current msg->con scheme was introduced by commits 38941f8031bf ("libceph: have messages point to their connection") and 92ce034b5a74 ("libceph: have messages take a connection reference"), which defined when messages get associated with a connection and when that association goes away. Part of the problem is that this association is supposed to go away in much too many places; closing this race entirely requires either a rework of the existing or an addition of a new layer of synchronization. In lieu of that, we can make it *much* less likely to hit by disassociating messages only on their destruction and resend through a different connection. This makes the code simpler and is probably a good thing to do regardless - this patch adds a msg_con_set() helper which is is called from only three places: ceph_con_send() and ceph_con_in_msg_alloc() to set msg->con and ceph_msg_release() to clear it. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-11-02libceph: add nocephx_sign_messages optionIlya Dryomov3-1/+20
Support for message signing was merged into 3.19, along with nocephx_require_signatures option. But, all that option does is allow the kernel client to talk to clusters that don't support MSG_AUTH feature bit. That's pretty useless, given that it's been supported since bobtail. Meanwhile, if one disables message signing on the server side with "cephx sign messages = false", it becomes impossible to use the kernel client since it expects messages to be signed if MSG_AUTH was negotiated. Add nocephx_sign_messages option to support this use case. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-11-02libceph: stop duplicating client fields in messengerIlya Dryomov2-22/+10
supported_features and required_features serve no purpose at all, while nocrc and tcp_nodelay belong to ceph_options::flags. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-11-02libceph: drop authorizer check from cephx msg signing routinesIlya Dryomov1-4/+1
I don't see a way for auth->authorizer to be NULL in ceph_x_sign_message() or ceph_x_check_message_signature(). Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-11-02libceph: msg signing callouts don't need con argumentIlya Dryomov2-8/+10
We can use msg->con instead - at the point we sign an outgoing message or check the signature on the incoming one, msg->con is always set. We wouldn't know how to sign a message without an associated session (i.e. msg->con == NULL) and being able to sign a message using an explicitly provided authorizer is of no use. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-11-02libceph: evaluate osd_req_op_data() arguments only onceIoana Ciornei1-5/+7
This patch changes the osd_req_op_data() macro to not evaluate arguments more than once in order to follow the kernel coding style. Signed-off-by: Ioana Ciornei <ciorneiioana@gmail.com> Reviewed-by: Alex Elder <elder@linaro.org> [idryomov@gmail.com: changelog, formatting] Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2015-11-02libceph: introduce ceph_x_authorizer_cleanup()Ilya Dryomov2-12/+20
Commit ae385eaf24dc ("libceph: store session key in cephx authorizer") introduced ceph_x_authorizer::session_key, but didn't update all the exit/error paths. Introduce ceph_x_authorizer_cleanup() to encapsulate ceph_x_authorizer cleanup and switch to it. This fixes ceph_x_destroy(), which currently always leaks key and ceph_x_build_authorizer() error paths. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Yan, Zheng <zyan@redhat.com>
2015-11-02libceph: use local variable cursor instead of &msg->cursorShraddha Barke1-6/+5
Use local variable cursor in place of &msg->cursor in read_partial_msg_data() and write_partial_msg_data(). Signed-off-by: Shraddha Barke <shraddha.6596@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>