aboutsummaryrefslogtreecommitdiffstats
path: root/security (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2016-03-03akcipher: Move the RSA DER encoding check to the crypto layerDavid Howells1-0/+1
Move the RSA EMSA-PKCS1-v1_5 encoding from the asymmetric-key public_key subtype to the rsa crypto module's pkcs1pad template. This means that the public_key subtype no longer has any dependencies on public key type. To make this work, the following changes have been made: (1) The rsa pkcs1pad template is now used for RSA keys. This strips off the padding and returns just the message hash. (2) In a previous patch, the pkcs1pad template gained an optional second parameter that, if given, specifies the hash used. We now give this, and pkcs1pad checks the encoded message E(M) for the EMSA-PKCS1-v1_5 encoding and verifies that the correct digest OID is present. (3) The crypto driver in crypto/asymmetric_keys/rsa.c is now reduced to something that doesn't care about what the encryption actually does and and has been merged into public_key.c. (4) CONFIG_PUBLIC_KEY_ALGO_RSA is gone. Module signing must set CONFIG_CRYPTO_RSA=y instead. Thoughts: (*) Should the encoding style (eg. raw, EMSA-PKCS1-v1_5) also be passed to the padding template? Should there be multiple padding templates registered that share most of the code? Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
2016-03-03crypto: Add hash param to pkcs1padTadeusz Struk1-26/+156
This adds hash param to pkcs1pad. The pkcs1pad template can work with or without the hash. When hash param is provided then the verify operation will also verify the output against the known digest. Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
2016-03-03sign-file: fix build with CMS support disabledMarc-Antoine Perennou1-1/+1
Some versions of openssl might have the CMS feature disabled LibreSSL disables this feature too If the feature is disabled, fallback to PKCS7 In file included from scripts/sign-file.c:46:0: /usr/x86_64-pc-linux-gnu/include/openssl/cms.h:62:2: error: #error CMS is disabled. #error CMS is disabled. Signed-off-by: Marc-Antoine Perennou <Marc-Antoine@Perennou.com> Signed-off-by: David Howells <dhowells@redhat.com>
2016-03-01MODSIGN: linux/string.h should be #included to get memcpy()David Howells1-0/+1
linux/string.h should be #included in module_signing.c to get memcpy(), lest the following occur: kernel/module_signing.c: In function 'mod_verify_sig': kernel/module_signing.c:57:2: error: implicit declaration of function 'memcpy' [-Werror=implicit-function-declaration] memcpy(&ms, mod + (modlen - sizeof(ms)), sizeof(ms)); ^ Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: David Howells <dhowells@redhat.com>
2016-02-29certs: Fix misaligned data in extra certificate listDavid Howells1-0/+1
Fix the following warning found by kbuild: certs/system_certificates.S:24: Error: misaligned data because: KEYS: Reserve an extra certificate symbol for inserting without recompiling doesn't correctly align system_extra_cert_used. Signed-off-by: David Howells <dhowells@redhat.com> cc: Mehmet Kayaalp <mkayaalp@linux.vnet.ibm.com>
2016-02-29X.509: Handle midnight alternative notation in GeneralizedTimeDavid Howells1-1/+1
The ASN.1 GeneralizedTime object carries an ISO 8601 format date and time. The time is permitted to show midnight as 00:00 or 24:00 (the latter being equivalent of 00:00 of the following day). The permitted value is checked in x509_decode_time() but the actual handling is left to mktime64(). Without this patch, certain X.509 certificates will be rejected and could lead to an unbootable kernel. Note that with this patch we also permit any 24:mm:ss time and extend this to UTCTime, which whilst not strictly correct don't permit much leeway in fiddling date strings. Reported-by: Rudolf Polzer <rpolzer@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> cc: David Woodhouse <David.Woodhouse@intel.com> cc: John Stultz <john.stultz@linaro.org>
2016-02-29X.509: Support leap secondsDavid Howells1-1/+1
The format of ASN.1 GeneralizedTime seems to be specified by ISO 8601 [X.680 46.3] and this apparently supports leap seconds (ie. the seconds field is 60). It's not entirely clear that ASN.1 expects it, but we can relax the seconds check slightly for GeneralizedTime. This results in us passing a time with sec as 60 to mktime64(), which handles it as being a duplicate of the 0th second of the next minute. We can't really do otherwise without giving the kernel much greater knowledge of where all the leap seconds are. Unfortunately, this would require change the mapping of the kernel's current-time-in-seconds. UTCTime, however, only supports a seconds value in the range 00-59, but for the sake of simplicity allow this with UTCTime also. Without this patch, certain X.509 certificates will be rejected, potentially making a kernel unbootable. Reported-by: Rudolf Polzer <rpolzer@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> cc: David Woodhouse <David.Woodhouse@intel.com> cc: John Stultz <john.stultz@linaro.org>
2016-02-29Handle ISO 8601 leap seconds and encodings of midnight in mktime64()David Howells1-1/+8
Handle the following ISO 8601 features in mktime64(): (1) Leap seconds. Leap seconds are indicated by the seconds parameter being the value 60. Handle this by treating it the same as 00 of the following minute. It has been pointed out that a minute may contain two leap seconds. However, pending discussion of what that looks like and how to handle it, I'm not going to concern myself with it. (2) Alternate encodings of midnight. Two different encodings of midnight are permitted - 00:00:00 and 24:00:00 - the first is midnight today and the second is midnight tomorrow and is exactly equivalent to the first with tomorrow's date. As it happens, we don't actually need to change mktime64() to handle either of these - just comment them as valid parameters. These facility will be used by the X.509 parser. Doing it in mktime64() makes the policy common to the whole kernel and easier to find. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Arnd Bergmann <arnd@arndb.de> cc: John Stultz <john.stultz@linaro.org> cc: Rudolf Polzer <rpolzer@google.com> cc: One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk>
2016-02-29X.509: Fix leap year handling againDavid Howells1-4/+4
There are still a couple of minor issues in the X.509 leap year handling: (1) To avoid doing a modulus-by-400 in addition to a modulus-by-100 when determining whether the year is a leap year or not, I divided the year by 100 after doing the modulus-by-100, thereby letting the compiler do one instruction for both, and then did a modulus-by-4. Unfortunately, I then passed the now-modified year value to mktime64() to construct a time value. Since this isn't a fast path and since mktime64() does a bunch of divisions, just condense down to "% 400". It's also easier to read. (2) The default month length for any February where the year doesn't divide by four exactly is obtained from the month_length[] array where the value is 29, not 28. This is fixed by altering the table. Reported-by: Rudolf Polzer <rpolzer@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: David Woodhouse <David.Woodhouse@intel.com> Acked-by: Arnd Bergmann <arnd@arndb.de> cc: stable@vger.kernel.org
2016-02-29PKCS#7: fix unitialized boolean 'want'Colin Ian King1-1/+1
The boolean want is not initialized and hence garbage. The default should be false (later it is only set to true on tne sinfo->authattrs check). Found with static analysis using CoverityScan Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David Howells <dhowells@redhat.com>
2016-02-26KEYS: Use the symbol value for list size, updated by scripts/insert-sys-certMehmet Kayaalp1-8/+21
When a certificate is inserted to the image using scripts/writekey, the value of __cert_list_end does not change. The updated size can be found out by reading the value pointed by the system_certificate_list_size symbol. Signed-off-by: Mehmet Kayaalp <mkayaalp@linux.vnet.ibm.com> Signed-off-by: David Howells <dhowells@redhat.com>
2016-02-26KEYS: Reserve an extra certificate symbol for inserting without recompilingMehmet Kayaalp5-0/+440
Place a system_extra_cert buffer of configurable size, right after the system_certificate_list, so that inserted keys can be readily processed by the existing mechanism. Added script takes a key file and a kernel image and inserts its contents to the reserved area. The system_certificate_list_size is also adjusted accordingly. Call the script as: scripts/insert-sys-cert -b <vmlinux> -c <certfile> If vmlinux has no symbol table, supply System.map file with -s flag. Subsequent runs replace the previously inserted key, instead of appending the new one. Signed-off-by: Mehmet Kayaalp <mkayaalp@linux.vnet.ibm.com> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
2016-02-26modsign: hide openssl output in silent buildsArnd Bergmann1-14/+19
When a user calls 'make -s', we can assume they don't want to see any output except for warnings and errors, but instead they see this for a warning free build: ### ### Now generating an X.509 key pair to be used for signing modules. ### ### If this takes a long time, you might wish to run rngd in the ### background to keep the supply of entropy topped up. It ### needs to be run as root, and uses a hardware random ### number generator if one is available. ### Generating a 4096 bit RSA private key .................................................................................................................................................................................................................................++ ..............................................................................................................................++ writing new private key to 'certs/signing_key.pem' ----- ### ### Key pair generated. ### The output can confuse simple build testing scripts that just check for an empty build log. This patch silences all the output: - "echo" is changed to "@$(kecho)", which is dropped when "-s" gets passed - the openssl command itself is only printed with V=1, using the $(Q) macro - The output of openssl gets redirected to /dev/null on "-s" builds. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David Howells <dhowells@redhat.com>
2016-02-18scripts/sign-file.c: Add support for signing with a raw signatureJuerg Haefliger1-90/+146
This patch adds support for signing a kernel module with a raw detached PKCS#7 signature/message. The signature is not converted and is simply appended to the module so it needs to be in the right format. Using openssl, a valid signature can be generated like this: $ openssl smime -sign -nocerts -noattr -binary -in <module> -inkey \ <key> -signer <x509> -outform der -out <raw sig> The resulting raw signature from the above command is (more or less) identical to the raw signature that sign-file itself can produce like this: $ scripts/sign-file -d <hash algo> <key> <x509> <module> Signed-off-by: Juerg Haefliger <juerg.haefliger@hpe.com> Signed-off-by: David Howells <dhowells@redhat.com>
2016-02-18security/keys: make big_key.c explicitly non-modularPaul Gortmaker1-14/+1
The Kconfig currently controlling compilation of this code is: config BIG_KEYS bool "Large payload keys" ...meaning that it currently is not being built as a module by anyone. Lets remove the modular code that is essentially orphaned, so that when reading the driver there is no doubt it is builtin-only. Since module_init translates to device_initcall in the non-modular case, the init ordering remains unchanged with this commit. We also delete the MODULE_LICENSE tag since all that information is already contained at the top of the file in the comments. Cc: James Morris <james.l.morris@oracle.com> Cc: "Serge E. Hallyn" <serge@hallyn.com> Cc: keyrings@vger.kernel.org Cc: linux-security-module@vger.kernel.org Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: David Howells <dhowells@redhat.com>
2016-02-18crypto: public_key: remove MPIs from public_key_signature structTadeusz Struk1-13/+1
After digsig_asymmetric.c is converted the MPIs can be now safely removed from the public_key_signature structure. Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David Howells <dhowells@redhat.com>
2016-02-18integrity: convert digsig to akcipher apiTadeusz Struk2-7/+4
Convert asymmetric_verify to akcipher api. Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David Howells <dhowells@redhat.com>
2016-02-10crypto: KEYS: convert public key and digsig asym to the akcipher apiTadeusz Struk12-295/+134
This patch converts the module verification code to the new akcipher API. Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David Howells <dhowells@redhat.com>
2016-02-10KEYS: CONFIG_KEYS_DEBUG_PROC_KEYS is no longer an optionDavid Howells26-26/+0
CONFIG_KEYS_DEBUG_PROC_KEYS is no longer an option as /proc/keys is now mandatory if the keyrings facility is enabled (it's used by libkeyutils in userspace). The defconfig references were removed with: perl -p -i -e 's/CONFIG_KEYS_DEBUG_PROC_KEYS=y\n//' \ `git grep -l CONFIG_KEYS_DEBUG_PROC_KEYS=y` and the integrity Kconfig fixed by hand. Signed-off-by: David Howells <dhowells@redhat.com> cc: Andreas Ziegler <andreas.ziegler@fau.de> cc: Dmitry Kasatkin <dmitry.kasatkin@huawei.com>
2016-02-09KEYS: Add an alloc flag to convey the builtinness of a keyDavid Howells3-2/+5
Add KEY_ALLOC_BUILT_IN to convey that a key should have KEY_FLAG_BUILTIN set rather than setting it after the fact. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
2016-02-09v2 linux-next scripts/sign-file.c Fix LibreSSL supportCodarren Velvindron1-1/+1
In file included from scripts/sign-file.c:47:0: /usr/include/openssl/cms.h:62:2: error: #error CMS is disabled. #error CMS is disabled. ^ scripts/Makefile.host:91: recipe for target 'scripts/sign-file' failed make[1]: *** [scripts/sign-file] Error 1 Makefile:567: recipe for target 'scripts' failed make: *** [scripts] Error 2 Fix SSL headers so that the kernel can build with LibreSSL Signed-off-by: Codarren Velvindron <codarren@hackers.mu> Acked-by: David Woodhouse <David.Woodhouse@intel.com> Signed-off-by: David Howells <dhowells@redhat.com>
2016-02-07Linux 4.5-rc3Linus Torvalds1-1/+1
2016-02-05epoll: restrict EPOLLEXCLUSIVE to POLLIN and POLLOUTJason Baron1-6/+32
In the current implementation of the EPOLLEXCLUSIVE flag (added for 4.5-rc1), if epoll waiters create different POLL* sets and register them as exclusive against the same target fd, the current implementation will stop waking any further waiters once it finds the first idle waiter. This means that waiters could miss wakeups in certain cases. For example, when we wake up a pipe for reading we do: wake_up_interruptible_sync_poll(&pipe->wait, POLLIN | POLLRDNORM); So if one epoll set or epfd is added to pipe p with POLLIN and a second set epfd2 is added to pipe p with POLLRDNORM, only epfd may receive the wakeup since the current implementation will stop after it finds any intersection of events with a waiter that is blocked in epoll_wait(). We could potentially address this by requiring all epoll waiters that are added to p be required to pass the same set of POLL* events. IE the first EPOLL_CTL_ADD that passes EPOLLEXCLUSIVE establishes the set POLL* flags to be used by any other epfds that are added as EPOLLEXCLUSIVE. However, I think it might be somewhat confusing interface as we would have to reference count the number of users for that set, and so userspace would have to keep track of that count, or we would need a more involved interface. It also adds some shared state that we'd have store somewhere. I don't think anybody will want to bloat __wait_queue_head for this. I think what we could do instead, is to simply restrict EPOLLEXCLUSIVE such that it can only be specified with EPOLLIN and/or EPOLLOUT. So that way if the wakeup includes 'POLLIN' and not 'POLLOUT', we can stop once we hit the first idle waiter that specifies the EPOLLIN bit, since any remaining waiters that only have 'POLLOUT' set wouldn't need to be woken. Likewise, we can do the same thing if 'POLLOUT' is in the wakeup bit set and not 'POLLIN'. If both 'POLLOUT' and 'POLLIN' are set in the wake bit set (there is at least one example of this I saw in fs/pipe.c), then we just wake the entire exclusive list. Having both 'POLLOUT' and 'POLLIN' both set should not be on any performance critical path, so I think that's ok (in fs/pipe.c its in pipe_release()). We also continue to include EPOLLERR and EPOLLHUP by default in any exclusive set. Thus, the user can specify EPOLLERR and/or EPOLLHUP but is not required to do so. Since epoll waiters may be interested in other events as well besides EPOLLIN, EPOLLOUT, EPOLLERR and EPOLLHUP, these can still be added by doing a 'dup' call on the target fd and adding that as one normally would with EPOLL_CTL_ADD. Since I think that the POLLIN and POLLOUT events are what we are interest in balancing, I think that the 'dup' thing could perhaps be added to only one of the waiter threads. However, I think that EPOLLIN, EPOLLOUT, EPOLLERR and EPOLLHUP should be sufficient for the majority of use-cases. Since EPOLLEXCLUSIVE is intended to be used with a target fd shared among multiple epfds, where between 1 and n of the epfds may receive an event, it does not satisfy the semantics of EPOLLONESHOT where only 1 epfd would get an event. Thus, it is not allowed to be specified in conjunction with EPOLLEXCLUSIVE. EPOLL_CTL_MOD is also not allowed if the fd was previously added as EPOLLEXCLUSIVE. It seems with the limited number of flags to not be as interesting, but this could be relaxed at some further point. Signed-off-by: Jason Baron <jbaron@akamai.com> Tested-by: Madars Vitolins <m@silodev.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Al Viro <viro@ftp.linux.org.uk> Cc: Eric Wong <normalperson@yhbt.net> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Hagen Paul Pfeifer <hagen@jauu.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05radix-tree: fix oops after radix_tree_iter_retryKonstantin Khlebnikov1-3/+3
Helper radix_tree_iter_retry() resets next_index to the current index. In following radix_tree_next_slot current chunk size becomes zero. This isn't checked and it tries to dereference null pointer in slot. Tagged iterator is fine because retry happens only at slot 0 where tag bitmask in iter->tags is filled with single bit. Fixes: 46437f9a554f ("radix-tree: fix race in gang lookup") Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ohad Ben-Cohen <ohad@wizery.com> Cc: Jeremiah Mahler <jmmahler@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05MAINTAINERS: trim the file triggers for ABI/APIMichael Kerrisk (man-pages)1-2/+0
Commit ea8f8fc8631 ("MAINTAINERS: add linux-api for review of API/ABI changes") added file triggers for various paths that likely indicated API/ABI changes. However, catching all changes in Documentation/ABI/ and include/uapi/ produces a large volume of mail to linux-api, rather than only API/ABI changes. Drop those two entries, but leave include/linux/syscalls.h and kernel/sys_ni.c to catch syscall-related changes. [josh@joshtriplett.org: redid changelog] Signed-off-by: Michael Kerrisk <mtk.man-pages@gmail.com> Acked-by: Shuah khan <shuahkh@osg.samsung.com> Cc: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05dax: dirty inode only if requiredDmitry Monakhov1-1/+2
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05thp: make deferred_split_scan() work againKirill A. Shutemov1-1/+1
We need to iterate over split_queue, not local empty list to get anything split from the shrinker. Fixes: e3ae19535c66 ("thp: limit number of object to scan on deferred_split_scan()") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05mm: replace vma_lock_anon_vma with anon_vma_lock_read/writeKonstantin Khlebnikov2-44/+25
Sequence vma_lock_anon_vma() - vma_unlock_anon_vma() isn't safe if anon_vma appeared between lock and unlock. We have to check anon_vma first or call anon_vma_prepare() to be sure that it's here. There are only few users of these legacy helpers. Let's get rid of them. This patch fixes anon_vma lock imbalance in validate_mm(). Write lock isn't required here, read lock is enough. And reorders expand_downwards/expand_upwards: security_mmap_addr() and wrapping-around check don't have to be under anon vma lock. Link: https://lkml.kernel.org/r/CACT4Y+Y908EjM2z=706dv4rV6dWtxTLK9nFg9_7DhRMLppBo2g@mail.gmail.com Signed-off-by: Konstantin Khlebnikov <koct9i@gmail.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05ocfs2/dlm: clear refmap bit of recovery lock while doing local recovery cleanupxuejiufei1-0/+2
When recovery master down, dlm_do_local_recovery_cleanup() only remove the $RECOVERY lock owned by dead node, but do not clear the refmap bit. Which will make umount thread falling in dead loop migrating $RECOVERY to the dead node. Signed-off-by: xuejiufei <xuejiufei@huawei.com> Reviewed-by: Joseph Qi <joseph.qi@huawei.com> Cc: Mark Fasheh <mfasheh@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05um: asm/page.h: remove the pte_high member from struct pte_tNicolai Stange1-13/+10
Commit 16da306849d0 ("um: kill pfn_t") introduced a compile warning for defconfig (SUBARCH=i386): arch/um/kernel/skas/mmu.c:38:206: warning: right shift count >= width of type [-Wshift-count-overflow] Aforementioned patch changes the definition of the phys_to_pfn() macro from ((pfn_t) ((p) >> PAGE_SHIFT)) to ((p) >> PAGE_SHIFT) This effectively changes the phys_to_pfn() expansion's type from unsigned long long to unsigned long. Through the callchain init_stub_pte() => mk_pte(), the expansion of phys_to_pfn() is (indirectly) fed into the 'phys' argument of the pte_set_val(pte, phys, prot) macro, eventually leading to (pte).pte_high = (phys) >> 32; This results in the warning from above. Since UML only deals with 32 bit addresses, the upper 32 bits from 'phys' used to be always zero anyway. Also, all page protection flags defined by UML don't use any bits beyond bit 9. Since the contents of a PTE are defined within architecture scope only, the ->pte_high member can be safely removed. Remove the ->pte_high member from struct pte_t. Rename ->pte_low to ->pte. Adapt the pte helper macros in arch/um/include/asm/page.h. Noteworthy is the pte_copy() macro where a smp_wmb() gets dropped. This write barrier doesn't seem to be paired with any read barrier though and thus, was useless anyway. Fixes: 16da306849d0 ("um: kill pfn_t") Signed-off-by: Nicolai Stange <nicstange@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Richard Weinberger <richard@nod.at> Cc: Nicolai Stange <nicstange@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05mm, hugetlb: don't require CMA for runtime gigantic pagesVlastimil Babka4-7/+7
Commit 944d9fec8d7a ("hugetlb: add support for gigantic page allocation at runtime") has added the runtime gigantic page allocation via alloc_contig_range(), making this support available only when CONFIG_CMA is enabled. Because it doesn't depend on MIGRATE_CMA pageblocks and the associated infrastructure, it is possible with few simple adjustments to require only CONFIG_MEMORY_ISOLATION instead of full CONFIG_CMA. After this patch, alloc_contig_range() and related functions are available and used for gigantic pages with just CONFIG_MEMORY_ISOLATION enabled. Note CONFIG_CMA selects CONFIG_MEMORY_ISOLATION. This allows supporting runtime gigantic pages without the CMA-specific checks in page allocator fastpaths. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05mm/hugetlb: fix gigantic page initialization/allocationMike Kravetz1-2/+3
Attempting to preallocate 1G gigantic huge pages at boot time with "hugepagesz=1G hugepages=1" on the kernel command line will prevent booting with the following: kernel BUG at mm/hugetlb.c:1218! When mapcount accounting was reworked, the setting of compound_mapcount_ptr in prep_compound_gigantic_page was overlooked. As a result, the validation of mapcount in free_huge_page fails. The "BUG_ON" checks in free_huge_page were also changed to "VM_BUG_ON_PAGE" to assist with debugging. Fixes: 53f9263baba69 ("mm: rework mapcount accounting to enable 4k mapping of THPs") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: David Rientjes <rientjes@google.com> Tested-by: Vlastimil Babka <vbabka@suse.cz> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Jerome Marchand <jmarchan@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05mm: downgrade VM_BUG in isolate_lru_page() to warningKirill A. Shutemov1-1/+1
Calling isolate_lru_page() is wrong and shouldn't happen, but it not nessesary fatal: the page just will not be isolated if it's not on LRU. Let's downgrade the VM_BUG_ON_PAGE() to WARN_RATELIMIT(). Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05mempolicy: do not try to queue pages from !vma_migratable()Kirill A. Shutemov1-9/+5
Maybe I miss some point, but I don't see a reason why we try to queue pages from non migratable VMAs. This testcase steps on VM_BUG_ON_PAGE() in isolate_lru_page(): #include <fcntl.h> #include <unistd.h> #include <stdio.h> #include <sys/mman.h> #include <numaif.h> #define SIZE 0x2000 int foo; int main() { int fd; char *p; unsigned long mask = 2; fd = open("/dev/sg0", O_RDWR); p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0); /* Faultin pages */ foo = p[0] + p[0x1000]; mbind(p, SIZE, MPOL_BIND, &mask, 4, MPOL_MF_MOVE | MPOL_MF_STRICT); return 0; } The only case when we can queue pages from such VMA is MPOL_MF_STRICT plus MPOL_MF_MOVE or MPOL_MF_MOVE_ALL for VMA which has pages on LRU, but gfp mask is not sutable for migaration (see mapping_gfp_mask() check in vma_migratable()). That's looks like a bug to me. Let's filter out non-migratable vma at start of queue_pages_test_walk() and go to queue_pages_pte_range() only if MPOL_MF_MOVE or MPOL_MF_MOVE_ALL flag is set. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Dmitry Vyukov <dvyukov@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05mm, vmstat: fix wrong WQ sleep when memory reclaim doesn't make any progressTetsuo Handa1-1/+1
Jan Stancek has reported that system occasionally hanging after "oom01" testcase from LTP triggers OOM. Guessing from a result that there is a kworker thread doing memory allocation and the values between "Node 0 Normal free:" and "Node 0 Normal:" differs when hanging, vmstat is not up-to-date for some reason. According to commit 373ccbe59270 ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress"), it meant to force the kworker thread to take a short sleep, but it by error used schedule_timeout(1). We missed that schedule_timeout() in state TASK_RUNNING doesn't do anything. Fix it by using schedule_timeout_uninterruptible(1) which forces the kworker thread to take a short sleep in order to make sure that vmstat is up-to-date. Fixes: 373ccbe59270 ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make any progress") Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-by: Jan Stancek <jstancek@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Tejun Heo <tj@kernel.org> Cc: Cristopher Lameter <clameter@sgi.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Arkadiusz Miskiewicz <arekm@maven.pl> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05vmstat: make vmstat_update deferrableMichal Hocko1-1/+1
Commit 0eb77e988032 ("vmstat: make vmstat_updater deferrable again and shut down on idle") made vmstat_shepherd deferrable. vmstat_update itself is still useing standard timer which might interrupt idle task. This is possible because "mm, vmstat: make quiet_vmstat lighter" removed cancel_delayed_work from the quiet_vmstat. Change vmstat_work to use DEFERRABLE_WORK to prevent from pointless wakeups from the idle context. Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05mm, vmstat: make quiet_vmstat lighterMichal Hocko1-22/+46
Mike has reported a considerable overhead of refresh_cpu_vm_stats from the idle entry during pipe test: 12.89% [kernel] [k] refresh_cpu_vm_stats.isra.12 4.75% [kernel] [k] __schedule 4.70% [kernel] [k] mutex_unlock 3.14% [kernel] [k] __switch_to This is caused by commit 0eb77e988032 ("vmstat: make vmstat_updater deferrable again and shut down on idle") which has placed quiet_vmstat into cpu_idle_loop. The main reason here seems to be that the idle entry has to get over all zones and perform atomic operations for each vmstat entry even though there might be no per cpu diffs. This is a pointless overhead for _each_ idle entry. Make sure that quiet_vmstat is as light as possible. First of all it doesn't make any sense to do any local sync if the current cpu is already set in oncpu_stat_off because vmstat_update puts itself there only if there is nothing to do. Then we can check need_update which should be a cheap way to check for potential per-cpu diffs and only then do refresh_cpu_vm_stats. The original patch also did cancel_delayed_work which we are not doing here. There are two reasons for that. Firstly cancel_delayed_work from idle context will blow up on RT kernels (reported by Mike): CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.5.0-rt3 #7 Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 09/23/2013 Call Trace: dump_stack+0x49/0x67 ___might_sleep+0xf5/0x180 rt_spin_lock+0x20/0x50 try_to_grab_pending+0x69/0x240 cancel_delayed_work+0x26/0xe0 quiet_vmstat+0x75/0xa0 cpu_idle_loop+0x38/0x3e0 cpu_startup_entry+0x13/0x20 start_secondary+0x114/0x140 And secondly, even on !RT kernels it might add some non trivial overhead which is not necessary. Even if the vmstat worker wakes up and preempts idle then it will be most likely a single shot noop because the stats were already synced and so it would end up on the oncpu_stat_off anyway. We just need to teach both vmstat_shepherd and vmstat_update to stop scheduling the worker if there is nothing to do. [mgalbraith@suse.de: cancel pending work of the cpu_stat_off CPU] Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Mike Galbraith <mgalbraith@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05mm/Kconfig: correct description of DEFERRED_STRUCT_PAGE_INITVlastimil Babka1-4/+5
The description mentions kswapd threads, while the deferred struct page initialization is actually done by one-off "pgdatinitX" threads. Fix the description so that potentially users are not confused about pgdatinit threads using CPU after boot instead of kswapd. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05memblock: don't mark memblock_phys_mem_size() as __initDavid Gibson1-1/+1
At the moment memblock_phys_mem_size() is marked as __init, and so is discarded after boot. This is different from most of the memblock functions which are marked __init_memblock, and are only discarded after boot if memory hotplug is not configured. To allow for upcoming code which will need memblock_phys_mem_size() in the hotplug path, change it from __init to __init_memblock. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05dump_stack: avoid potential deadlocksEric Dumazet1-3/+4
Some servers experienced fatal deadlocks because of a combination of bugs, leading to multiple cpus calling dump_stack(). The checksumming bug was fixed in commit 34ae6a1aa054 ("ipv6: update skb->csum when CE mark is propagated"). The second problem is a faulty locking in dump_stack() CPU1 runs in process context and calls dump_stack(), grabs dump_lock. CPU2 receives a TCP packet under softirq, grabs socket spinlock, and call dump_stack() from netdev_rx_csum_fault(). dump_stack() spins on atomic_cmpxchg(&dump_lock, -1, 2), since dump_lock is owned by CPU1 While dumping its stack, CPU1 is interrupted by a softirq, and happens to process a packet for the TCP socket locked by CPU2. CPU1 spins forever in spin_lock() : deadlock Stack trace on CPU1 looked like : NMI backtrace for cpu 1 RIP: _raw_spin_lock+0x25/0x30 ... Call Trace: <IRQ> tcp_v6_rcv+0x243/0x620 ip6_input_finish+0x11f/0x330 ip6_input+0x38/0x40 ip6_rcv_finish+0x3c/0x90 ipv6_rcv+0x2a9/0x500 process_backlog+0x461/0xaa0 net_rx_action+0x147/0x430 __do_softirq+0x167/0x2d0 call_softirq+0x1c/0x30 do_softirq+0x3f/0x80 irq_exit+0x6e/0xc0 smp_call_function_single_interrupt+0x35/0x40 call_function_single_interrupt+0x6a/0x70 <EOI> printk+0x4d/0x4f printk_address+0x31/0x33 print_trace_address+0x33/0x3c print_context_stack+0x7f/0x119 dump_trace+0x26b/0x28e show_trace_log_lvl+0x4f/0x5c show_stack_log_lvl+0x104/0x113 show_stack+0x42/0x44 dump_stack+0x46/0x58 netdev_rx_csum_fault+0x38/0x3c __skb_checksum_complete_head+0x6e/0x80 __skb_checksum_complete+0x11/0x20 tcp_rcv_established+0x2bd5/0x2fd0 tcp_v6_do_rcv+0x13c/0x620 sk_backlog_rcv+0x15/0x30 release_sock+0xd2/0x150 tcp_recvmsg+0x1c1/0xfc0 inet_recvmsg+0x7d/0x90 sock_recvmsg+0xaf/0xe0 ___sys_recvmsg+0x111/0x3b0 SyS_recvmsg+0x5c/0xb0 system_call_fastpath+0x16/0x1b Fixes: b58d977432c8 ("dump_stack: serialize the output from dump_stack()") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Alex Thorlton <athorlton@sgi.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05mm: validate_mm browse_rb SMP race conditionAndrea Arcangeli1-2/+5
The mmap_sem for reading in validate_mm called from expand_stack is not enough to prevent the argumented rbtree rb_subtree_gap information to change from under us because expand_stack may be running from other threads concurrently which will hold the mmap_sem for reading too. The argumented rbtree is updated with vma_gap_update under the page_table_lock so use it in browse_rb() too to avoid false positives. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Tested-by: Dmitry Vyukov <dvyukov@google.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05m32r: fix build failure due to SMP and MMUSudip Mukherjee1-0/+1
One of the randconfig build failed with the error: arch/m32r/kernel/smp.c: In function 'smp_flush_tlb_mm': arch/m32r/kernel/smp.c:283:20: error: subscripted value is neither array nor pointer nor vector mmc = &mm->context[cpu_id]; ^ arch/m32r/kernel/smp.c: In function 'smp_flush_tlb_page': arch/m32r/kernel/smp.c:353:20: error: subscripted value is neither array nor pointer nor vector mmc = &mm->context[cpu_id]; ^ arch/m32r/kernel/smp.c: In function 'smp_invalidate_interrupt': arch/m32r/kernel/smp.c:479:41: error: subscripted value is neither array nor pointer nor vector unsigned long *mmc = &flush_mm->context[cpu_id]; It turned out that CONFIG_SMP was defined but CONFIG_MMU was not defined. But arch/m32r/include/asm/mmu.h only defines mm_context_t as an array when both CONFIG_SMP and CONFIG_MMU are defined. And arch/m32r/kernel/smp.c is always using context as an array. So without MMU SMP can not work. Signed-off-by: Sudip Mukherjee <sudip@vectorindia.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05block: fix pfn_mkwrite() DAX fault handlerRoss Zwisler1-1/+7
Previously the pfn_mkwrite() fault handler for raw block devices called bldev_dax_fault() -> __dax_fault() to do a full DAX page fault. Really what the pfn_mkwrite() fault handler needs to do is call dax_pfn_mkwrite() to make sure that the radix tree entry for the given PTE is marked as dirty so that a follow-up fsync or msync call will flush it durably to media. Fixes: 5a023cdba50c ("block: enable dax for raw block devices") Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05signals: avoid random wakeups in sigsuspend()Sasha Levin1-2/+4
A random wakeup can get us out of sigsuspend() without TIF_SIGPENDING being set. Avoid that by making sure we were signaled, like sys_pause() does. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-02-05drm/dp/mst: deallocate payload on port destructionMykola Lysenko1-8/+83
This is needed to properly deallocate port payload after downstream branch get unplugged. In order to do this unplugged MST topology should be preserved, to find first alive port on path to unplugged MST topology, and send payload deallocation request to branch device of found port. For this mstb and port kref's are used in reversed order to track when port and branch memory could be freed. Added additional functions to find appropriate mstb as described above. Signed-off-by: Mykola Lysenko <Mykola.Lysenko@amd.com> Reviewed-by: Harry Wentland <Harry.Wentland@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Dave Airlie <airlied@redhat.com>
2016-02-05drm/dp/mst: Reverse order of MST enable and clearing VC payload table.Andrey Grodzovsky1-6/+6
On DELL U3014 if you clear the table before enabling MST it sometimes hangs the receiver. Signed-off-by: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com> Reviewed-by: Harry Wentland <harry.wentland@amd.com> Cc: stable@vger.kernel.org Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Dave Airlie <airlied@redhat.com>
2016-02-05drm/dp/mst: move GUID storage from mgr, port to only mst branchHersen Wu2-51/+38
Previous implementation does not handle case below: boot up one MST branch to DP connector of ASIC. After boot up, hot plug 2nd MST branch to DP output of 1st MST, GUID is not created for 2nd MST branch. When downstream port of 2nd MST branch send upstream request, it fails because 2nd MST branch GUID is not available. New Implementation: only create GUID for MST branch and save it within Branch. Signed-off-by: Hersen Wu <hersenxs.wu@amd.com> Reviewed-by: Harry Wentland <harry.wentland@amd.com> Cc: stable@vger.kernel.org Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Dave Airlie <airlied@redhat.com>
2016-02-05drm/dp/mst: change MST detection schemeMykola Lysenko1-18/+19
1. Get edid for all connected MST displays, not only on logical ports, in the same thread as MST topology detection is done: There are displays that have branches inside w/o logical ports. So in case another SST display connected downstream system can end-up in situation when 3 DOWN requests sent: two for ‘remote i2c read’ and one for ‘enum path resources’, making slots full. 2. Call notification callback in one place in the end of topology discovery/update: This is done to reduce number of events sent to userspace in case complex topology discovery is going, adding multiple number of connectors; 3. Remove notification callback call from short pulse interrupt processing function: This is done in order not to block interrupt processing function, in case any MST request will be made from it. Notification will be send from topology discovery/update work item. Signed-off-by: Mykola Lysenko <Mykola.Lysenko@amd.com> Reviewed-by: Harry Wentland <Harry.Wentland@amd.com> Cc: stable@vger.kernel.org Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Dave Airlie <airlied@redhat.com>
2016-02-05drm/dp/mst: Calculate MST PBN with 31.32 fixed pointHarry Wentland1-28/+39
Our PBN value overflows the 20 bits integer part of the 20.12 fixed point. We need to use 31.32 fixed point to avoid this. This happens with display clocks larger than 293122 (at 24 bpp), which we see with the Sharp (and similar) 4k tiled displays. Signed-off-by: Harry Wentland <harry.wentland@amd.com> Cc: stable@vger.kernel.org Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Dave Airlie <airlied@redhat.com>
2016-02-05drm: Add drm_fixp_from_fraction and drm_fixp2int_ceilHarry Wentland1-2/+51
drm_fixp_from_fraction allows us to create a fixed point directly from a fraction, rather than creating fixed point values and dividing later. This avoids overflow of our 64 bit value for large numbers. drm_fixp2int_ceil allows us to return the ceiling of our fixed point value. [airlied: squash Jordan's fix] 32-bit-build-fix: Jordan Lazare <Jordan.Lazare@amd.com> Signed-off-by: Harry Wentland <harry.wentland@amd.com> Cc: stable@vger.kernel.org Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Dave Airlie <airlied@redhat.com>