aboutsummaryrefslogtreecommitdiffstatshomepage
path: root/tools/perf/scripts/python/call-graph-from-postgresql.py (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2017-05-08md/md0: optimize raid0 discard handlingShaohua Li1-14/+102
There are complaints that raid0 discard handling is slow. Currently we divide discard request into chunks and dispatch to underlayer disks. The block layer will do merge to form big requests. This causes a lot of request split/merge and uses significant CPU time. A simple idea is to calculate the range for each raid disk for an IO request and send a discard request to raid disks, which will avoid the split/merge completely. Previously Coly tried the approach, but the implementation was too complex because of raid0 zones. This patch always split bio in zone boundary and handle bio within one zone. It simplifies the implementation a lot. Reviewed-by: NeilBrown <neilb@suse.com> Acked-by: Coly Li <colyli@suse.de> Signed-off-by: Shaohua Li <shli@fb.com>
2017-05-08md: don't return -EAGAIN in md_allow_write for external metadata arraysArtur Paszkiewicz4-28/+15
This essentially reverts commit b5470dc5fc18 ("md: resolve external metadata handling deadlock in md_allow_write") with some adjustments. Since commit 6791875e2e53 ("md: make reconfig_mutex optional for writes to md sysfs files.") changing array_state to 'active' does not use mddev_lock() and will not cause a deadlock with md_allow_write(). This revert simplifies userspace tools that write to sysfs attributes like "stripe_cache_size" or "consistency_policy" because it removes the need for special handling for external metadata arrays, checking for EAGAIN and retrying the write. Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-05-04md/raid5: make use of spin_lock_irq over local_irq_disable + spin_lockJulia Cartwright1-10/+7
On mainline, there is no functional difference, just less code, and symmetric lock/unlock paths. On PREEMPT_RT builds, this fixes the following warning, seen by Alexander GQ Gerasiov, due to the sleeping nature of spinlocks. BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:993 in_atomic(): 0, irqs_disabled(): 1, pid: 58, name: kworker/u12:1 CPU: 5 PID: 58 Comm: kworker/u12:1 Tainted: G W 4.9.20-rt16-stand6-686 #1 Hardware name: Supermicro SYS-5027R-WRF/X9SRW-F, BIOS 3.2a 10/28/2015 Workqueue: writeback wb_workfn (flush-253:0) Call Trace: dump_stack+0x47/0x68 ? migrate_enable+0x4a/0xf0 ___might_sleep+0x101/0x180 rt_spin_lock+0x17/0x40 add_stripe_bio+0x4e3/0x6c0 [raid456] ? preempt_count_add+0x42/0xb0 raid5_make_request+0x737/0xdd0 [raid456] Reported-by: Alexander GQ Gerasiov <gq@redlab-i.ru> Tested-by: Alexander GQ Gerasiov <gq@redlab-i.ru> Signed-off-by: Julia Cartwright <julia@ni.com> Signed-off-by: Shaohua Li <shli@fb.com>
2017-05-04cfg80211: make RATE_INFO_BW_20 the defaultJohannes Berg1-1/+1
Due to the way I did the RX bitrate conversions in mac80211 with spatch, going setting flags to setting the value, many drivers now don't set the bandwidth value for 20 MHz, since with the flags it wasn't necessary to (there was no 20 MHz flag, only the others.) Rather than go through and try to fix up all the drivers, instead renumber the enum so that 20 MHz, which is the typical bandwidth, actually has the value 0, making those drivers all work again. If VHT was hit used with a driver not reporting it, e.g. iwlmvm, this manifested in hitting the bandwidth warning in cfg80211_calculate_bitrate_vht(). Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Tested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-04ipv6: initialize route null entry in addrconf_init()WANG Cong3-11/+18
Andrey reported a crash on init_net.ipv6.ip6_null_entry->rt6i_idev since it is always NULL. This is clearly wrong, we have code to initialize it to loopback_dev, unfortunately the order is still not correct. loopback_dev is registered very early during boot, we lose a chance to re-initialize it in notifier. addrconf_init() is called after ip6_route_init(), which means we have no chance to correct it. Fix it by moving this initialization explicitly after ipv6_add_dev(init_net.loopback_dev) in addrconf_init(). Reported-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-04qede: Fix possible misconfiguration of advertised autoneg value.sudarsana.kalluru@cavium.com1-0/+5
Fail the configuration of advertised speed-autoneg value if the config update is not supported. Signed-off-by: Sudarsana Reddy Kalluru <Sudarsana.Kalluru@cavium.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-04qed: Fix overriding of supported autoneg value.sudarsana.kalluru@cavium.com3-1/+9
Driver currently uses advertised-autoneg value to populate the supported-autoneg field. When advertised field is updated, user gets the same value for supported field. Supported-autoneg value need to be populated from the link capabilities value returned by the MFW. Signed-off-by: Sudarsana Reddy Kalluru <Sudarsana.Kalluru@cavium.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-04qed*: Fix possible overflow for status block id field.sudarsana.kalluru@cavium.com5-11/+10
Value for status block id could be more than 256 in 100G mode, need to update its data type from u8 to u16. Signed-off-by: Sudarsana Reddy Kalluru <Sudarsana.Kalluru@cavium.com> Signed-off-by: Yuval Mintz <Yuval.Mintz@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-04rtnetlink: NUL-terminate IFLA_PHYS_PORT_NAME stringMichal Schmidt1-1/+1
IFLA_PHYS_PORT_NAME is a string attribute, so terminate it with \0. Otherwise libnl3 fails to validate netlink messages with this attribute. "ip -detail a" assumes too that the attribute is NUL-terminated when printing it. It often was, due to padding. I noticed this as libvirtd failing to start on a system with sfc driver after upgrading it to Linux 4.11, i.e. when sfc added support for phys_port_name. Signed-off-by: Michal Schmidt <mschmidt@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-04netvsc: make sure napi enabled before vmbus_openstephen hemminger2-4/+6
This fixes a race where vmbus callback for new packet arriving could occur before NAPI is initialized. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-04aquantia: Fix driver name reported by ethtoolPavel Belous1-1/+1
V2: using "aquantia" subsystem tag. The command "ethtool -i ethX" should display driver name (driver: atlantic) instead vendor name (driver: aquantia). Signed-off-by: Pavel Belous <pavel.belous@aquantia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-04ipv4, ipv6: ensure raw socket message is big enough to hold an IP headerAlexander Potapenko2-0/+5
raw_send_hdrinc() and rawv6_send_hdrinc() expect that the buffer copied from the userspace contains the IPv4/IPv6 header, so if too few bytes are copied, parts of the header may remain uninitialized. This bug has been detected with KMSAN. For the record, the KMSAN report: ================================================================== BUG: KMSAN: use of unitialized memory in nf_ct_frag6_gather+0xf5a/0x44a0 inter: 0 CPU: 0 PID: 1036 Comm: probe Not tainted 4.11.0-rc5+ #2455 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:16 dump_stack+0x143/0x1b0 lib/dump_stack.c:52 kmsan_report+0x16b/0x1e0 mm/kmsan/kmsan.c:1078 __kmsan_warning_32+0x5c/0xa0 mm/kmsan/kmsan_instr.c:510 nf_ct_frag6_gather+0xf5a/0x44a0 net/ipv6/netfilter/nf_conntrack_reasm.c:577 ipv6_defrag+0x1d9/0x280 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c:68 nf_hook_entry_hookfn ./include/linux/netfilter.h:102 nf_hook_slow+0x13f/0x3c0 net/netfilter/core.c:310 nf_hook ./include/linux/netfilter.h:212 NF_HOOK ./include/linux/netfilter.h:255 rawv6_send_hdrinc net/ipv6/raw.c:673 rawv6_sendmsg+0x2fcb/0x41a0 net/ipv6/raw.c:919 inet_sendmsg+0x3f8/0x6d0 net/ipv4/af_inet.c:762 sock_sendmsg_nosec net/socket.c:633 sock_sendmsg net/socket.c:643 SYSC_sendto+0x6a5/0x7c0 net/socket.c:1696 SyS_sendto+0xbc/0xe0 net/socket.c:1664 do_syscall_64+0x72/0xa0 arch/x86/entry/common.c:285 entry_SYSCALL64_slow_path+0x25/0x25 arch/x86/entry/entry_64.S:246 RIP: 0033:0x436e03 RSP: 002b:00007ffce48baf38 EFLAGS: 00000246 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 00000000004002b0 RCX: 0000000000436e03 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003 RBP: 00007ffce48baf90 R08: 00007ffce48baf50 R09: 000000000000001c R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000401790 R14: 0000000000401820 R15: 0000000000000000 origin: 00000000d9400053 save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:362 kmsan_internal_poison_shadow+0xb1/0x1a0 mm/kmsan/kmsan.c:257 kmsan_poison_shadow+0x6d/0xc0 mm/kmsan/kmsan.c:270 slab_alloc_node mm/slub.c:2735 __kmalloc_node_track_caller+0x1f4/0x390 mm/slub.c:4341 __kmalloc_reserve net/core/skbuff.c:138 __alloc_skb+0x2cd/0x740 net/core/skbuff.c:231 alloc_skb ./include/linux/skbuff.h:933 alloc_skb_with_frags+0x209/0xbc0 net/core/skbuff.c:4678 sock_alloc_send_pskb+0x9ff/0xe00 net/core/sock.c:1903 sock_alloc_send_skb+0xe4/0x100 net/core/sock.c:1920 rawv6_send_hdrinc net/ipv6/raw.c:638 rawv6_sendmsg+0x2918/0x41a0 net/ipv6/raw.c:919 inet_sendmsg+0x3f8/0x6d0 net/ipv4/af_inet.c:762 sock_sendmsg_nosec net/socket.c:633 sock_sendmsg net/socket.c:643 SYSC_sendto+0x6a5/0x7c0 net/socket.c:1696 SyS_sendto+0xbc/0xe0 net/socket.c:1664 do_syscall_64+0x72/0xa0 arch/x86/entry/common.c:285 return_from_SYSCALL_64+0x0/0x6a arch/x86/entry/entry_64.S:246 ================================================================== , triggered by the following syscalls: socket(PF_INET6, SOCK_RAW, IPPROTO_RAW) = 3 sendto(3, NULL, 0, 0, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "ff00::", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = -1 EPERM A similar report is triggered in net/ipv4/raw.c if we use a PF_INET socket instead of a PF_INET6 one. Signed-off-by: Alexander Potapenko <glider@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-04net/sched: remove redundant null check on headColin Ian King1-2/+1
head is previously null checked and so the 2nd null check on head is redundant and therefore can be removed. Detected by CoverityScan, CID#1399505 ("Logically dead code") Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-04tcp: do not inherit fastopen_req from parentEric Dumazet1-0/+1
Under fuzzer stress, it is possible that a child gets a non NULL fastopen_req pointer from its parent at accept() time, when/if parent morphs from listener to active session. We need to make sure this can not happen, by clearing the field after socket cloning. BUG: Double free or freeing an invalid pointer Unexpected shadow byte: 0xFB CPU: 3 PID: 20933 Comm: syz-executor3 Not tainted 4.11.0+ #306 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:16 [inline] dump_stack+0x292/0x395 lib/dump_stack.c:52 kasan_object_err+0x1c/0x70 mm/kasan/report.c:164 kasan_report_double_free+0x5c/0x70 mm/kasan/report.c:185 kasan_slab_free+0x9d/0xc0 mm/kasan/kasan.c:580 slab_free_hook mm/slub.c:1357 [inline] slab_free_freelist_hook mm/slub.c:1379 [inline] slab_free mm/slub.c:2961 [inline] kfree+0xe8/0x2b0 mm/slub.c:3882 tcp_free_fastopen_req net/ipv4/tcp.c:1077 [inline] tcp_disconnect+0xc15/0x13e0 net/ipv4/tcp.c:2328 inet_child_forget+0xb8/0x600 net/ipv4/inet_connection_sock.c:898 inet_csk_reqsk_queue_add+0x1e7/0x250 net/ipv4/inet_connection_sock.c:928 tcp_get_cookie_sock+0x21a/0x510 net/ipv4/syncookies.c:217 cookie_v4_check+0x1a19/0x28b0 net/ipv4/syncookies.c:384 tcp_v4_cookie_check net/ipv4/tcp_ipv4.c:1384 [inline] tcp_v4_do_rcv+0x731/0x940 net/ipv4/tcp_ipv4.c:1421 tcp_v4_rcv+0x2dc0/0x31c0 net/ipv4/tcp_ipv4.c:1715 ip_local_deliver_finish+0x4cc/0xc20 net/ipv4/ip_input.c:216 NF_HOOK include/linux/netfilter.h:257 [inline] ip_local_deliver+0x1ce/0x700 net/ipv4/ip_input.c:257 dst_input include/net/dst.h:492 [inline] ip_rcv_finish+0xb1d/0x20b0 net/ipv4/ip_input.c:396 NF_HOOK include/linux/netfilter.h:257 [inline] ip_rcv+0xd8c/0x19c0 net/ipv4/ip_input.c:487 __netif_receive_skb_core+0x1ad1/0x3400 net/core/dev.c:4210 __netif_receive_skb+0x2a/0x1a0 net/core/dev.c:4248 process_backlog+0xe5/0x6c0 net/core/dev.c:4868 napi_poll net/core/dev.c:5270 [inline] net_rx_action+0xe70/0x18e0 net/core/dev.c:5335 __do_softirq+0x2fb/0xb99 kernel/softirq.c:284 do_softirq_own_stack+0x1c/0x30 arch/x86/entry/entry_64.S:899 </IRQ> do_softirq.part.17+0x1e8/0x230 kernel/softirq.c:328 do_softirq kernel/softirq.c:176 [inline] __local_bh_enable_ip+0x1cf/0x1e0 kernel/softirq.c:181 local_bh_enable include/linux/bottom_half.h:31 [inline] rcu_read_unlock_bh include/linux/rcupdate.h:931 [inline] ip_finish_output2+0x9ab/0x15e0 net/ipv4/ip_output.c:230 ip_finish_output+0xa35/0xdf0 net/ipv4/ip_output.c:316 NF_HOOK_COND include/linux/netfilter.h:246 [inline] ip_output+0x1f6/0x7b0 net/ipv4/ip_output.c:404 dst_output include/net/dst.h:486 [inline] ip_local_out+0x95/0x160 net/ipv4/ip_output.c:124 ip_queue_xmit+0x9a8/0x1a10 net/ipv4/ip_output.c:503 tcp_transmit_skb+0x1ade/0x3470 net/ipv4/tcp_output.c:1057 tcp_write_xmit+0x79e/0x55b0 net/ipv4/tcp_output.c:2265 __tcp_push_pending_frames+0xfa/0x3a0 net/ipv4/tcp_output.c:2450 tcp_push+0x4ee/0x780 net/ipv4/tcp.c:683 tcp_sendmsg+0x128d/0x39b0 net/ipv4/tcp.c:1342 inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:762 sock_sendmsg_nosec net/socket.c:633 [inline] sock_sendmsg+0xca/0x110 net/socket.c:643 SYSC_sendto+0x660/0x810 net/socket.c:1696 SyS_sendto+0x40/0x50 net/socket.c:1664 entry_SYSCALL_64_fastpath+0x1f/0xbe RIP: 0033:0x446059 RSP: 002b:00007faa6761fb58 EFLAGS: 00000282 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 0000000000000017 RCX: 0000000000446059 RDX: 0000000000000001 RSI: 0000000020ba3fcd RDI: 0000000000000017 RBP: 00000000006e40a0 R08: 0000000020ba4ff0 R09: 0000000000000010 R10: 0000000020000000 R11: 0000000000000282 R12: 0000000000708150 R13: 0000000000000000 R14: 00007faa676209c0 R15: 00007faa67620700 Object at ffff88003b5bbcb8, in cache kmalloc-64 size: 64 Allocated: PID = 20909 save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59 save_stack+0x43/0xd0 mm/kasan/kasan.c:513 set_track mm/kasan/kasan.c:525 [inline] kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:616 kmem_cache_alloc_trace+0x82/0x270 mm/slub.c:2745 kmalloc include/linux/slab.h:490 [inline] kzalloc include/linux/slab.h:663 [inline] tcp_sendmsg_fastopen net/ipv4/tcp.c:1094 [inline] tcp_sendmsg+0x221a/0x39b0 net/ipv4/tcp.c:1139 inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:762 sock_sendmsg_nosec net/socket.c:633 [inline] sock_sendmsg+0xca/0x110 net/socket.c:643 SYSC_sendto+0x660/0x810 net/socket.c:1696 SyS_sendto+0x40/0x50 net/socket.c:1664 entry_SYSCALL_64_fastpath+0x1f/0xbe Freed: PID = 20909 save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59 save_stack+0x43/0xd0 mm/kasan/kasan.c:513 set_track mm/kasan/kasan.c:525 [inline] kasan_slab_free+0x73/0xc0 mm/kasan/kasan.c:589 slab_free_hook mm/slub.c:1357 [inline] slab_free_freelist_hook mm/slub.c:1379 [inline] slab_free mm/slub.c:2961 [inline] kfree+0xe8/0x2b0 mm/slub.c:3882 tcp_free_fastopen_req net/ipv4/tcp.c:1077 [inline] tcp_disconnect+0xc15/0x13e0 net/ipv4/tcp.c:2328 __inet_stream_connect+0x20c/0xf90 net/ipv4/af_inet.c:593 tcp_sendmsg_fastopen net/ipv4/tcp.c:1111 [inline] tcp_sendmsg+0x23a8/0x39b0 net/ipv4/tcp.c:1139 inet_sendmsg+0x164/0x5b0 net/ipv4/af_inet.c:762 sock_sendmsg_nosec net/socket.c:633 [inline] sock_sendmsg+0xca/0x110 net/socket.c:643 SYSC_sendto+0x660/0x810 net/socket.c:1696 SyS_sendto+0x40/0x50 net/socket.c:1664 entry_SYSCALL_64_fastpath+0x1f/0xbe Fixes: e994b2f0fb92 ("tcp: do not lock listener to process SYN packets") Fixes: 7db92362d2fe ("tcp: fix potential double free issue for fastopen_req") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrey Konovalov <andreyknvl@google.com> Acked-by: Wei Wang <weiwan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-04forcedeth: remove unnecessary carrier status checkZhu Yanjun1-4/+2
Since netif_carrier_on() will do nothing if device's carrier is already on, so it's unnecessary to do carrier status check. It's the same for netif_carrier_off(). Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2017-05-03kasan: separate report parts by empty linesAndrey Konovalov1-0/+7
Makes the report easier to read. Link: http://lkml.kernel.org/r/20170302134851.101218-10-andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Acked-by: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03kasan: improve double-free report formatAndrey Konovalov3-18/+17
Changes double-free report header from BUG: Double free or freeing an invalid pointer Unexpected shadow byte: 0xFB to BUG: KASAN: double-free or invalid-free in kmalloc_oob_left+0xe5/0xef This makes a bug uniquely identifiable by the first report line. To account for removing of the unexpected shadow value, print shadow bytes at the end of the report as in reports for other kinds of bugs. Link: http://lkml.kernel.org/r/20170302134851.101218-9-andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Acked-by: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03kasan: print page description after stacksAndrey Konovalov1-6/+8
Moves page description after the stacks since it's less important. Link: http://lkml.kernel.org/r/20170302134851.101218-8-andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Acked-by: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03kasan: improve slab object descriptionAndrey Konovalov1-11/+42
Changes slab object description from: Object at ffff880068388540, in cache kmalloc-128 size: 128 to: The buggy address belongs to the object at ffff880068388540 which belongs to the cache kmalloc-128 of size 128 The buggy address is located 123 bytes inside of 128-byte region [ffff880068388540, ffff8800683885c0) Makes it more explanatory and adds information about relative offset of the accessed address to the start of the object. Link: http://lkml.kernel.org/r/20170302134851.101218-7-andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Acked-by: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03kasan: change report headerAndrey Konovalov1-4/+4
Change report header format from: BUG: KASAN: use-after-free in unwind_get_return_address+0x28a/0x2c0 at addr ffff880069437950 Read of size 8 by task insmod/3925 to: BUG: KASAN: use-after-free in unwind_get_return_address+0x28a/0x2c0 Read of size 8 at addr ffff880069437950 by task insmod/3925 The exact access address is not usually important, so move it to the second line. This also makes the header look visually balanced. Link: http://lkml.kernel.org/r/20170302134851.101218-6-andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Acked-by: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03kasan: simplify address description logicAndrey Konovalov1-16/+21
Simplify logic for describing a memory address. Add addr_to_page() helper function. Makes the code easier to follow. Link: http://lkml.kernel.org/r/20170302134851.101218-5-andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Acked-by: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03kasan: change allocation and freeing stack traces headersAndrey Konovalov1-6/+4
Change stack traces headers from: Allocated: PID = 42 to: Allocated by task 42: Makes the report one line shorter and look better. Link: http://lkml.kernel.org/r/20170302134851.101218-4-andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Acked-by: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03kasan: unify report headersAndrey Konovalov1-13/+13
Unify KASAN report header format for different kinds of bad memory accesses. Makes the code simpler. Link: http://lkml.kernel.org/r/20170302134851.101218-3-andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Acked-by: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03kasan: introduce helper functions for determining bug typeAndrey Konovalov1-10/+30
Patch series "kasan: improve error reports", v2. This patchset improves KASAN reports by making them easier to read and a little more detailed. Also improves mm/kasan/report.c readability. Effectively changes a use-after-free report to: ================================================================== BUG: KASAN: use-after-free in kmalloc_uaf+0xaa/0xb6 [test_kasan] Write of size 1 at addr ffff88006aa59da8 by task insmod/3951 CPU: 1 PID: 3951 Comm: insmod Tainted: G B 4.10.0+ #84 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Call Trace: dump_stack+0x292/0x398 print_address_description+0x73/0x280 kasan_report.part.2+0x207/0x2f0 __asan_report_store1_noabort+0x2c/0x30 kmalloc_uaf+0xaa/0xb6 [test_kasan] kmalloc_tests_init+0x4f/0xa48 [test_kasan] do_one_initcall+0xf3/0x390 do_init_module+0x215/0x5d0 load_module+0x54de/0x82b0 SYSC_init_module+0x3be/0x430 SyS_init_module+0x9/0x10 entry_SYSCALL_64_fastpath+0x1f/0xc2 RIP: 0033:0x7f22cfd0b9da RSP: 002b:00007ffe69118a78 EFLAGS: 00000206 ORIG_RAX: 00000000000000af RAX: ffffffffffffffda RBX: 0000555671242090 RCX: 00007f22cfd0b9da RDX: 00007f22cffcaf88 RSI: 000000000004df7e RDI: 00007f22d0399000 RBP: 00007f22cffcaf88 R08: 0000000000000003 R09: 0000000000000000 R10: 00007f22cfd07d0a R11: 0000000000000206 R12: 0000555671243190 R13: 000000000001fe81 R14: 0000000000000000 R15: 0000000000000004 Allocated by task 3951: save_stack_trace+0x16/0x20 save_stack+0x43/0xd0 kasan_kmalloc+0xad/0xe0 kmem_cache_alloc_trace+0x82/0x270 kmalloc_uaf+0x56/0xb6 [test_kasan] kmalloc_tests_init+0x4f/0xa48 [test_kasan] do_one_initcall+0xf3/0x390 do_init_module+0x215/0x5d0 load_module+0x54de/0x82b0 SYSC_init_module+0x3be/0x430 SyS_init_module+0x9/0x10 entry_SYSCALL_64_fastpath+0x1f/0xc2 Freed by task 3951: save_stack_trace+0x16/0x20 save_stack+0x43/0xd0 kasan_slab_free+0x72/0xc0 kfree+0xe8/0x2b0 kmalloc_uaf+0x85/0xb6 [test_kasan] kmalloc_tests_init+0x4f/0xa48 [test_kasan] do_one_initcall+0xf3/0x390 do_init_module+0x215/0x5d0 load_module+0x54de/0x82b0 SYSC_init_module+0x3be/0x430 SyS_init_module+0x9/0x10 entry_SYSCALL_64_fastpath+0x1f/0xc The buggy address belongs to the object at ffff88006aa59da0 which belongs to the cache kmalloc-16 of size 16 The buggy address is located 8 bytes inside of 16-byte region [ffff88006aa59da0, ffff88006aa59db0) The buggy address belongs to the page: page:ffffea0001aa9640 count:1 mapcount:0 mapping: (null) index:0x0 flags: 0x100000000000100(slab) raw: 0100000000000100 0000000000000000 0000000000000000 0000000180800080 raw: ffffea0001abe380 0000000700000007 ffff88006c401b40 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff88006aa59c80: 00 00 fc fc 00 00 fc fc 00 00 fc fc 00 00 fc fc ffff88006aa59d00: 00 00 fc fc 00 00 fc fc 00 00 fc fc 00 00 fc fc >ffff88006aa59d80: fb fb fc fc fb fb fc fc fb fb fc fc fb fb fc fc ^ ffff88006aa59e00: fb fb fc fc fb fb fc fc fb fb fc fc fb fb fc fc ffff88006aa59e80: fb fb fc fc 00 00 fc fc 00 00 fc fc 00 00 fc fc ================================================================== from: ================================================================== BUG: KASAN: use-after-free in kmalloc_uaf+0xaa/0xb6 [test_kasan] at addr ffff88006c4dcb28 Write of size 1 by task insmod/3984 CPU: 1 PID: 3984 Comm: insmod Tainted: G B 4.10.0+ #83 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Call Trace: dump_stack+0x292/0x398 kasan_object_err+0x1c/0x70 kasan_report.part.1+0x20e/0x4e0 __asan_report_store1_noabort+0x2c/0x30 kmalloc_uaf+0xaa/0xb6 [test_kasan] kmalloc_tests_init+0x4f/0xa48 [test_kasan] do_one_initcall+0xf3/0x390 do_init_module+0x215/0x5d0 load_module+0x54de/0x82b0 SYSC_init_module+0x3be/0x430 SyS_init_module+0x9/0x10 entry_SYSCALL_64_fastpath+0x1f/0xc2 RIP: 0033:0x7feca0f779da RSP: 002b:00007ffdfeae5218 EFLAGS: 00000206 ORIG_RAX: 00000000000000af RAX: ffffffffffffffda RBX: 000055a064c13090 RCX: 00007feca0f779da RDX: 00007feca1236f88 RSI: 000000000004df7e RDI: 00007feca1605000 RBP: 00007feca1236f88 R08: 0000000000000003 R09: 0000000000000000 R10: 00007feca0f73d0a R11: 0000000000000206 R12: 000055a064c14190 R13: 000000000001fe81 R14: 0000000000000000 R15: 0000000000000004 Object at ffff88006c4dcb20, in cache kmalloc-16 size: 16 Allocated: PID = 3984 save_stack_trace+0x16/0x20 save_stack+0x43/0xd0 kasan_kmalloc+0xad/0xe0 kmem_cache_alloc_trace+0x82/0x270 kmalloc_uaf+0x56/0xb6 [test_kasan] kmalloc_tests_init+0x4f/0xa48 [test_kasan] do_one_initcall+0xf3/0x390 do_init_module+0x215/0x5d0 load_module+0x54de/0x82b0 SYSC_init_module+0x3be/0x430 SyS_init_module+0x9/0x10 entry_SYSCALL_64_fastpath+0x1f/0xc2 Freed: PID = 3984 save_stack_trace+0x16/0x20 save_stack+0x43/0xd0 kasan_slab_free+0x73/0xc0 kfree+0xe8/0x2b0 kmalloc_uaf+0x85/0xb6 [test_kasan] kmalloc_tests_init+0x4f/0xa48 [test_kasan] do_one_initcall+0xf3/0x390 do_init_module+0x215/0x5d0 load_module+0x54de/0x82b0 SYSC_init_module+0x3be/0x430 SyS_init_module+0x9/0x10 entry_SYSCALL_64_fastpath+0x1f/0xc2 Memory state around the buggy address: ffff88006c4dca00: fb fb fc fc fb fb fc fc fb fb fc fc fb fb fc fc ffff88006c4dca80: fb fb fc fc fb fb fc fc fb fb fc fc fb fb fc fc >ffff88006c4dcb00: fb fb fc fc fb fb fc fc fb fb fc fc fb fb fc fc ^ ffff88006c4dcb80: fb fb fc fc 00 00 fc fc fb fb fc fc fb fb fc fc ffff88006c4dcc00: fb fb fc fc fb fb fc fc fb fb fc fc fb fb fc fc ================================================================== This patch (of 9): Introduce get_shadow_bug_type() function, which determines bug type based on the shadow value for a particular kernel address. Introduce get_wild_bug_type() function, which determines bug type for addresses which don't have a corresponding shadow value. Link: http://lkml.kernel.org/r/20170302134851.101218-2-andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Acked-by: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03mm: hwpoison: call shake_page() after try_to_unmap() for mlocked pageNaoya Horiguchi1-0/+8
Memory error handler calls try_to_unmap() for error pages in various states. If the error page is a mlocked page, error handling could fail with "still referenced by 1 users" message. This is because the page is linked to and stays in lru cache after the following call chain. try_to_unmap_one page_remove_rmap clear_page_mlock putback_lru_page lru_cache_add memory_failure() calls shake_page() to hanlde the similar issue, but current code doesn't cover because shake_page() is called only before try_to_unmap(). So this patches adds shake_page(). Fixes: 23a003bfd23ea9ea0b7756b920e51f64b284b468 ("mm/madvise: pass return code of memory_failure() to userspace") Link: http://lkml.kernel.org/r/20170417055948.GM31394@yexl-desktop Link: http://lkml.kernel.org/r/1493197841-23986-3-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reported-by: kernel test robot <lkp@intel.com> Cc: Xiaolong Ye <xiaolong.ye@intel.com> Cc: Chen Gong <gong.chen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03mm: hwpoison: call shake_page() unconditionallyNaoya Horiguchi2-18/+12
shake_page() is called before going into core error handling code in order to ensure that the error page is flushed from lru_cache lists where pages stay during transferring among LRU lists. But currently it's not fully functional because when the page is linked to lru_cache by calling activate_page(), its PageLRU flag is set and shake_page() is skipped. The result is to fail error handling with "still referenced by 1 users" message. When the page is linked to lru_cache by isolate_lru_page(), its PageLRU is clear, so that's fine. This patch makes shake_page() unconditionally called to avoild the failure. Fixes: 23a003bfd23ea9ea0b7756b920e51f64b284b468 ("mm/madvise: pass return code of memory_failure() to userspace") Link: http://lkml.kernel.org/r/20170417055948.GM31394@yexl-desktop Link: http://lkml.kernel.org/r/1493197841-23986-2-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reported-by: kernel test robot <lkp@intel.com> Cc: Xiaolong Ye <xiaolong.ye@intel.com> Cc: Chen Gong <gong.chen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03mm/swapfile.c: fix swap space leak in error path of swap_free_entries()Huang Ying1-2/+0
In swapcache_free_entries(), if swap_info_get_cont() returns NULL, something wrong occurs for the swap entry. But we should still continue to free the following swap entries in the array instead of skip them to avoid swap space leak. This is just problem in error path, where system may be in an inconsistent state, but it is still good to fix it. Link: http://lkml.kernel.org/r/20170421124739.24534-1-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: Tim Chen <tim.c.chen@linux.intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Shaohua Li <shli@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03mm/gup.c: fix access_ok() argument typeArnd Bergmann1-1/+1
MIPS just got changed to only accept a pointer argument for access_ok(), causing one warning in drivers/scsi/pmcraid.c. I tried changing x86 the same way and found the same warning in __get_user_pages_fast() and nowhere else in the kernel during randconfig testing: mm/gup.c: In function '__get_user_pages_fast': mm/gup.c:1578:6: error: passing argument 1 of '__chk_range_not_ok' makes pointer from integer without a cast [-Werror=int-conversion] It would probably be a good idea to enforce type-safety in general, so let's change this file to not cause a warning if we do that. I don't know why the warning did not appear on MIPS. Fixes: 2667f50e8b81 ("mm: introduce a general RCU get_user_pages_fast()") Link: http://lkml.kernel.org/r/20170421162659.3314521-1-arnd@arndb.de Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03mm/truncate: avoid pointless cleancache_invalidate_inode() calls.Andrey Ryabinin1-5/+7
cleancache_invalidate_inode() called truncate_inode_pages_range() and invalidate_inode_pages2_range() twice - on entry and on exit. It's stupid and waste of time. It's enough to call it once at exit. Link: http://lkml.kernel.org/r/20170424164135.22350-5-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Alexey Kuznetsov <kuznet@virtuozzo.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03mm/truncate: bail out early from invalidate_inode_pages2_range() if mapping is emptyAndrey Ryabinin1-0/+3
If mapping is empty (both ->nrpages and ->nrexceptional is zero) we can avoid pointless lookups in empty radix tree and bail out immediately after cleancache invalidation. Link: http://lkml.kernel.org/r/20170424164135.22350-4-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Alexey Kuznetsov <kuznet@virtuozzo.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03fs/block_dev: always invalidate cleancache in invalidate_bdev()Andrey Ryabinin1-6/+5
invalidate_bdev() calls cleancache_invalidate_inode() iff ->nrpages != 0 which doen't make any sense. Make sure that invalidate_bdev() always calls cleancache_invalidate_inode() regardless of mapping->nrpages value. Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache") Link: http://lkml.kernel.org/r/20170424164135.22350-3-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Alexey Kuznetsov <kuznet@virtuozzo.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Nikolay Borisov <n.borisov.lkml@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03fs: fix data invalidation in the cleancache during direct IOAndrey Ryabinin2-25/+19
Patch series "Properly invalidate data in the cleancache", v2. We've noticed that after direct IO write, buffered read sometimes gets stale data which is coming from the cleancache. The reason for this is that some direct write hooks call call invalidate_inode_pages2[_range]() conditionally iff mapping->nrpages is not zero, so we may not invalidate data in the cleancache. Another odd thing is that we check only for ->nrpages and don't check for ->nrexceptional, but invalidate_inode_pages2[_range] also invalidates exceptional entries as well. So we invalidate exceptional entries only if ->nrpages != 0? This doesn't feel right. - Patch 1 fixes direct IO writes by removing ->nrpages check. - Patch 2 fixes similar case in invalidate_bdev(). Note: I only fixed conditional cleancache_invalidate_inode() here. Do we also need to add ->nrexceptional check in into invalidate_bdev()? - Patches 3-4: some optimizations. This patch (of 4): Some direct IO write fs hooks call invalidate_inode_pages2[_range]() conditionally iff mapping->nrpages is not zero. This can't be right, because invalidate_inode_pages2[_range]() also invalidate data in the cleancache via cleancache_invalidate_inode() call. So if page cache is empty but there is some data in the cleancache, buffered read after direct IO write would get stale data from the cleancache. Also it doesn't feel right to check only for ->nrpages because invalidate_inode_pages2[_range] invalidates exceptional entries as well. Fix this by calling invalidate_inode_pages2[_range]() regardless of nrpages state. Note: nfs,cifs,9p doesn't need similar fix because the never call cleancache_get_page() (nor directly, nor via mpage_readpage[s]()), so they are not affected by this bug. Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache") Link: http://lkml.kernel.org/r/20170424164135.22350-2-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Alexey Kuznetsov <kuznet@virtuozzo.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Nikolay Borisov <n.borisov.lkml@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03zram: reduce load operation in page_same_filledSangwoo Park1-3/+5
In page_same_filled function, all elements in the page is compared with next index value. The current comparison routine compares the (i)th and (i+1)th values of the page. In this case, two load operaions occur for each comparison. But if we store first value of the page stores at 'val' variable and using it to compare with others, the load opearation is reduced. It reduce load operation per page by up to 64times. Link: http://lkml.kernel.org/r/1488428104-7257-1-git-send-email-sangwoo2.park@lge.com Signed-off-by: Sangwoo Park <sangwoo2.park@lge.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03zram: use zram_free_page instead of open-codedMinchan Kim1-14/+3
The zram_free_page already handles NULL handle case and same page so use it to reduce error probability. (Acutaully, I made a mistake when I handled same page feature) Link: http://lkml.kernel.org/r/1492052365-16169-7-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03zram: introduce zram data accessorMinchan Kim1-11/+22
With element, sometime I got confused handle and element access. It might be my bad but I think it's time to introduce accessor to prevent future idiot like me. This patch is just clean-up patch so it shouldn't change any behavior. Link: http://lkml.kernel.org/r/1492052365-16169-6-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03zram: remove zram_meta structureMinchan Kim2-117/+78
It's redundant now. Instead, remove it and use zram structure directly. Link: http://lkml.kernel.org/r/1492052365-16169-5-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03zram: use zram_slot_lock instead of raw bit_spin_lock opMinchan Kim1-14/+27
With this clean-up phase, I want to use zram's wrapper function to lock table access which is more consistent with other zram's functions. Link: http://lkml.kernel.org/r/1492052365-16169-4-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03zram: partial IO refactoringMinchan Kim1-153/+184
For architecture(PAGE_SIZE > 4K), zram have supported partial IO. However, the mixed code for handling normal/partial IO is too mess, error-prone to modify IO handler functions with upcoming feature so this patch aims for cleaning up zram's IO handling functions. Link: http://lkml.kernel.org/r/1492052365-16169-3-git-send-email-minchan@kernel.org Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03zram: handle multiple pages attached bio's bvecMinchan Kim1-29/+11
Patch series "zram clean up", v2. This patchset aims to clean up zram . [1] clean up multiple pages's bvec handling. [2] clean up partial IO handling [3-6] clean up zram via using accessor and removing pointless structure. With [2-6] applied, we can get a few hundred bytes as well as huge readibility enhance. x86: 708 byte save add/remove: 1/1 grow/shrink: 0/11 up/down: 478/-1186 (-708) function old new delta zram_special_page_read - 478 +478 zram_reset_device 317 314 -3 mem_used_max_store 131 128 -3 compact_store 96 93 -3 mm_stat_show 203 197 -6 zram_add 719 712 -7 zram_slot_free_notify 229 214 -15 zram_make_request 819 803 -16 zram_meta_free 128 111 -17 zram_free_page 180 151 -29 disksize_store 432 361 -71 zram_decompress_page.isra 504 - -504 zram_bvec_rw 2592 2080 -512 Total: Before=25350773, After=25350065, chg -0.00% ppc64: 231 byte save add/remove: 2/0 grow/shrink: 1/9 up/down: 681/-912 (-231) function old new delta zram_special_page_read - 480 +480 zram_slot_lock - 200 +200 vermagic 39 40 +1 mm_stat_show 256 248 -8 zram_meta_free 200 184 -16 zram_add 944 912 -32 zram_free_page 348 308 -40 disksize_store 572 492 -80 zram_decompress_page 664 564 -100 zram_slot_free_notify 292 160 -132 zram_make_request 1132 1000 -132 zram_bvec_rw 2768 2396 -372 Total: Before=17565825, After=17565594, chg -0.00% This patch (of 6): Johannes Thumshirn reported system goes the panic when using NVMe over Fabrics loopback target with zram. The reason is zram expects each bvec in bio contains a single page but nvme can attach a huge bulk of pages attached to the bio's bvec so that zram's index arithmetic could be wrong so that out-of-bound access makes system panic. [1] in mainline solved solved the problem by limiting max_sectors with SECTORS_PER_PAGE but it makes zram slow because bio should split with each pages so this patch makes zram aware of multiple pages in a bvec so it could solve without any regression(ie, bio split). [1] 0bc315381fe9, zram: set physical queue limits to avoid array out of bounds accesses Link: http://lkml.kernel.org/r/20170413134057.GA27499@bbox Signed-off-by: Minchan Kim <minchan@kernel.org> Reported-by: Johannes Thumshirn <jthumshirn@suse.de> Tested-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Hannes Reinecke <hare@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03mm, page_alloc: remove debug_guardpage_minorder() test in warn_alloc()Tetsuo Handa1-2/+1
Commit c0a32fc5a2e4 ("mm: more intensive memory corruption debugging") changed to check debug_guardpage_minorder() > 0 when reporting allocation failures. The reasoning was When we use guard page to debug memory corruption, it shrinks available pages to 1/2, 1/4, 1/8 and so on, depending on parameter value. In such case memory allocation failures can be common and printing errors can flood dmesg. If somebody debug corruption, allocation failures are not the things he/she is interested about. but this is misguided. Allocation requests with __GFP_NOWARN flag by definition do not cause flooding of allocation failure messages. Allocation requests with __GFP_NORETRY flag likely also have __GFP_NOWARN flag. Costly allocation requests likely also have __GFP_NOWARN flag. Allocation requests without __GFP_DIRECT_RECLAIM flag likely also have __GFP_NOWARN flag or __GFP_HIGH flag. Non-costly allocation requests with __GFP_DIRECT_RECLAIM flag basically retry forever due to the "too small to fail" memory-allocation rule. Therefore, as a whole, shrinking available pages by debug_guardpage_minorder= kernel boot parameter might cause flooding of OOM killer messages but unlikely causes flooding of allocation failure messages. Let's remove debug_guardpage_minorder() > 0 check which would likely be pointless. Link: http://lkml.kernel.org/r/1491910035-4231-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Rafael J . Wysocki" <rafael.j.wysocki@intel.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03mm/memory-failure.c: add page flag description in error pathsAnshuman Khandual1-8/+8
It helps to provide page flag description along with the raw value in error paths during soft offline process. From sample experiments Before the patch: soft offline: 0x6100: migration failed 1, type 3ffff800008018 soft offline: 0x7400: migration failed 1, type 3ffff800008018 After the patch: soft offline: 0x5900: migration failed 1, type 3ffff800008018 (uptodate|dirty|head) soft offline: 0x6c00: migration failed 1, type 3ffff800008018 (uptodate|dirty|head) Link: http://lkml.kernel.org/r/20170409023829.10788-1-khandual@linux.vnet.ibm.com Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03mm/madvise: move up the behavior parameter validationAnshuman Khandual1-4/+9
madvise_behavior_valid() should be called before acting upon the behavior parameter. Hence move up the function. This also includes MADV_SOFT_OFFLINE and MADV_HWPOISON options as valid behavior parameter for the system call madvise(). Link: http://lkml.kernel.org/r/20170418052844.24891-1-khandual@linux.vnet.ibm.com Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03mm/madvise.c: clean up MADV_SOFT_OFFLINE and MADV_HWPOISONAnshuman Khandual1-14/+20
This cleans up handling MADV_SOFT_OFFLINE and MADV_HWPOISON called through madvise() system call. * madvise_memory_failure() was misleading to accommodate handling of both memory_failure() as well as soft_offline_page() functions. Basically it handles memory error injection from user space which can go either way as memory failure or soft offline. Renamed as madvise_inject_error() instead. * Renamed struct page pointer 'p' to 'page'. * pr_info() was essentially printing PFN value but it said 'page' which was misleading. Made the process virtual address explicit. Before the patch: Soft offlining page 0x15e3e at 0x3fff8c230000 Soft offlining page 0x1f3 at 0x3fffa0da0000 Soft offlining page 0x744 at 0x3fff7d200000 Soft offlining page 0x1634d at 0x3fff95e20000 Soft offlining page 0x16349 at 0x3fff95e30000 Soft offlining page 0x1d6 at 0x3fff9e8b0000 Soft offlining page 0x5f3 at 0x3fff91bd0000 Injecting memory failure for page 0x15c8b at 0x3fff83280000 Injecting memory failure for page 0x16190 at 0x3fff83290000 Injecting memory failure for page 0x740 at 0x3fff9a2e0000 Injecting memory failure for page 0x741 at 0x3fff9a2f0000 After the patch: Soft offlining pfn 0x1484e at process virtual address 0x3fff883c0000 Soft offlining pfn 0x1484f at process virtual address 0x3fff883d0000 Soft offlining pfn 0x14850 at process virtual address 0x3fff883e0000 Soft offlining pfn 0x14851 at process virtual address 0x3fff883f0000 Soft offlining pfn 0x14852 at process virtual address 0x3fff88400000 Soft offlining pfn 0x14853 at process virtual address 0x3fff88410000 Soft offlining pfn 0x14854 at process virtual address 0x3fff88420000 Soft offlining pfn 0x1521c at process virtual address 0x3fff6bc70000 Injecting memory failure for pfn 0x10fcf at process virtual address 0x3fff86310000 Injecting memory failure for pfn 0x10fd0 at process virtual address 0x3fff86320000 Injecting memory failure for pfn 0x10fd1 at process virtual address 0x3fff86330000 Injecting memory failure for pfn 0x10fd2 at process virtual address 0x3fff86340000 Injecting memory failure for pfn 0x10fd3 at process virtual address 0x3fff86350000 Injecting memory failure for pfn 0x10fd4 at process virtual address 0x3fff86360000 Injecting memory failure for pfn 0x10fd5 at process virtual address 0x3fff86370000 Link: http://lkml.kernel.org/r/20170410084701.11248-1-khandual@linux.vnet.ibm.com Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03Documentation: vm, add hugetlbfs reservation overviewMike Kravetz2-0/+531
Adding a brief overview of hugetlbfs reservation design and implementation as an aid to those making code modifications in this area. Link: http://lkml.kernel.org/r/1491586995-13085-1-git-send-email-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03mm, swap: remove unused function prototypeHuang Ying1-3/+0
This is a code cleanup patch, no functionality changes. There are 2 unused function prototype in swap.h, they are removed. Link: http://lkml.kernel.org/r/20170405071017.23677-1-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03mm: memcontrol: use node page state naming scheme for memcgJohannes Weiner6-68/+68
The memory controllers stat function names are awkwardly long and arbitrarily different from the zone and node stat functions. The current interface is named: mem_cgroup_read_stat() mem_cgroup_update_stat() mem_cgroup_inc_stat() mem_cgroup_dec_stat() mem_cgroup_update_page_stat() mem_cgroup_inc_page_stat() mem_cgroup_dec_page_stat() This patch renames it to match the corresponding node stat functions: memcg_page_state() [node_page_state()] mod_memcg_state() [mod_node_state()] inc_memcg_state() [inc_node_state()] dec_memcg_state() [dec_node_state()] mod_memcg_page_state() [mod_node_page_state()] inc_memcg_page_state() [inc_node_page_state()] dec_memcg_page_state() [dec_node_page_state()] Link: http://lkml.kernel.org/r/20170404220148.28338-4-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03mm: memcontrol: re-use node VM page state enumJohannes Weiner6-138/+123
The current duplication is a high-maintenance mess, and it's painful to add new items or query memcg state from the rest of the VM. This increases the size of the stat array marginally, but we should aim to track all these stats on a per-cgroup level anyway. Link: http://lkml.kernel.org/r/20170404220148.28338-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03mm: memcontrol: re-use global VM event enumJohannes Weiner2-56/+42
The current duplication is a high-maintenance mess, and it's painful to add new items. This increases the size of the event array, but we'll eventually want most of the VM events tracked on a per-cgroup basis anyway. Link: http://lkml.kernel.org/r/20170404220148.28338-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03mm: memcontrol: clean up memory.events counting functionJohannes Weiner3-18/+10
We only ever count single events, drop the @nr parameter. Rename the function accordingly. Remove low-information kerneldoc. Link: http://lkml.kernel.org/r/20170404220148.28338-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-05-03mm: vmscan: fix IO/refault regression in cache workingset transitionJohannes Weiner5-41/+150
Since commit 59dc76b0d4df ("mm: vmscan: reduce size of inactive file list") we noticed bigger IO spikes during changes in cache access patterns. The patch in question shrunk the inactive list size to leave more room for the current workingset in the presence of streaming IO. However, workingset transitions that previously happened on the inactive list are now pushed out of memory and incur more refaults to complete. This patch disables active list protection when refaults are being observed. This accelerates workingset transitions, and allows more of the new set to establish itself from memory, without eating into the ability to protect the established workingset during stable periods. The workloads that were measurably affected for us were hit pretty bad by it, with refault/majfault rates doubling and tripling during cache transitions, and the machines sustaining half-hour periods of 100% IO utilization, where they'd previously have sub-minute peaks at 60-90%. Stateful services that handle user data tend to be more conservative with kernel upgrades. As a result we hit most page cache issues with some delay, as was the case here. The severity seemed to warrant a stable tag. Fixes: 59dc76b0d4df ("mm: vmscan: reduce size of inactive file list") Link: http://lkml.kernel.org/r/20170404220052.27593-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: <stable@vger.kernel.org> [4.7+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>