aboutsummaryrefslogtreecommitdiffstats
path: root/security/integrity (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2014-03-09bnx2: Fix shutdown sequenceMichael Chan2-4/+38
The pci shutdown handler added in: bnx2: Add pci shutdown handler commit 25bfb1dd4ba3b2d9a49ce9d9b0cd7be1840e15ed created a shutdown down sequence without chip reset if the device was never brought up. This can cause the firmware to shutdown the PHY prematurely and cause MMIO read cycles to be unresponsive. On some systems, it may generate NMI in the bnx2's pci shutdown handler. The fix is to tell the firmware not to shutdown the PHY if there was no prior chip reset. Signed-off-by: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-06ipv6: don't set DST_NOCOUNT for remotely added routesSabrina Dubroca1-1/+1
DST_NOCOUNT should only be used if an authorized user adds routes locally. In case of routes which are added on behalf of router advertisments this flag must not get used as it allows an unlimited number of routes getting added remotely. Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-06net/mlx4_core: mlx4_init_slave() shouldn't access comm channel before PF is readyAmir Vadai1-0/+11
Currently, the PF call to pci_enable_sriov from the PF probe function stalls for 10 seconds times the number of VFs probed on the host. This happens because the way for such VFs to determine of the PF initialization finished, is by attempting to issue reset on the comm-channel and get timeout (after 10s). The PF probe function is called from a kenernel workqueue, and therefore during that time, rcu lock is being held and kernel's workqueue is stalled. This blocks other processes that try to use the workqueue or rcu lock. For example, interface renaming which is calling rcu_synchronize is blocked, and timedout by systemd. Changed mlx4_init_slave() to allow VF probed on the host to immediatly detect that the PF is not ready, and return EPROBE_DEFER instantly. Only when the PF finishes the initialization, allow such VFs to access the comm channel. This issue and fix are relevant only for probed VFs on the hypervisor, there is no way to pass this information to a VM until comm channel is ready, so in a VM, if PF is not ready, the first command will be timedout after 10 seconds and return EPROBE_DEFER. Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-06net/mlx4_core: Fix memory access error in mlx4_QUERY_DEV_CAP_wrapper()Amir Vadai1-3/+3
Fix a regression introduced by [1]. outbox was accessed instead of outbox->buf. Typo was copy-pasted to [2] and [3]. [1] - cc1ade9 mlx4_core: Disable memory windows for virtual functions [2] - 4de6580 mlx4_core: Add support for steerable IB UD QPs [3] - 7ffdf72 net/mlx4_core: Add basic support for TCP/IP offloads under tunneling Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-06bonding: correctly handle out of range parameters for lp_intervalSasha Levin1-0/+1
We didn't correctly check cases where the value for lp_interval is not within the legal range due to a missing table terminator. This would let userspace trigger a kernel panic by specifying a value out of range: echo -1 > /sys/devices/virtual/net/bond0/bonding/lp_interval Introduced by commit 4325b374f84 ("bonding: convert lp_interval to use the new option API"). Acked-by: Nikolay Aleksandrov <nikolay@redhat.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-06ipv6: Fix exthdrs offload registration.Anton Nayshtut1-2/+2
Without this fix, ipv6_exthdrs_offload_init doesn't register IPPROTO_DSTOPTS offload, but returns 0 (as the IPPROTO_ROUTING registration actually succeeds). This then causes the ipv6_gso_segment to drop IPv6 packets with IPPROTO_DSTOPTS header. The issue detected and the fix verified by running MS HCK Offload LSO test on top of QEMU Windows guests, as this test sends IPv6 packets with IPPROTO_DSTOPTS. Signed-off-by: Anton Nayshtut <anton@swortex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-06ibmveth: Fix endian issues with MAC addressesAnton Blanchard2-10/+16
The code to load a MAC address into a u64 for passing to the hypervisor via a register is broken on little endian. Create a helper function called ibmveth_encode_mac_addr which does the right thing in both big and little endian. We were storing the MAC address in a long in struct ibmveth_adapter. It's never used so remove it - we don't need another place in the driver where we create endian issues with MAC addresses. Signed-off-by: Anton Blanchard <anton@samba.org> Cc: stable@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-06net: unix socket code abuses csum_partialAnton Blanchard1-2/+1
The unix socket code is using the result of csum_partial to hash into a lookup table: unix_hash_fold(csum_partial(sunaddr, len, 0)); csum_partial is only guaranteed to produce something that can be folded into a checksum, as its prototype explains: * returns a 32-bit number suitable for feeding into itself * or csum_tcpudp_magic The 32bit value should not be used directly. Depending on the alignment, the ppc64 csum_partial will return different 32bit partial checksums that will fold into the same 16bit checksum. This difference causes the following testcase (courtesy of Gustavo) to sometimes fail: #include <sys/socket.h> #include <stdio.h> int main() { int fd = socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC, 0); int i = 1; setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &i, 4); struct sockaddr addr; addr.sa_family = AF_LOCAL; bind(fd, &addr, 2); listen(fd, 128); struct sockaddr_storage ss; socklen_t sslen = (socklen_t)sizeof(ss); getsockname(fd, (struct sockaddr*)&ss, &sslen); fd = socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC, 0); if (connect(fd, (struct sockaddr*)&ss, sslen) == -1){ perror(NULL); return 1; } printf("OK\n"); return 0; } As suggested by davem, fix this by using csum_fold to fold the partial 32bit checksum into a 16bit checksum before using it. Signed-off-by: Anton Blanchard <anton@samba.org> Cc: stable@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-06net: Improve SO_TIMESTAMPING documentation and fix a minor code bugAndrew Lutomirski2-21/+32
The original documentation was very unclear. The code fix is presumably related to the formerly unclear documentation: SOCK_TIMESTAMPING_RX_SOFTWARE has no effect on __sock_recv_timestamp's behavior, so calling __sock_recv_ts_and_drops from sock_recv_ts_and_drops if only SOCK_TIMESTAMPING_RX_SOFTWARE is set is pointless. This should have no user-observable effect. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-06phy: fix compiler array bounds warning on settings[]Bjorn Helgaas1-5/+6
With -Werror=array-bounds, gcc v4.7.x warns that in phy_find_valid(), the settings[] "array subscript is above array bounds", I think because idx is a signed integer and if the caller supplied idx < 0, we pass the guard but still reference out of bounds. Fix this by making idx unsigned here and elsewhere. Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-06inet: frag: make sure forced eviction removes all fragsFlorian Westphal1-1/+1
Quoting Alexander Aring: While fragmentation and unloading of 6lowpan module I got this kernel Oops after few seconds: BUG: unable to handle kernel paging request at f88bbc30 [..] Modules linked in: ipv6 [last unloaded: 6lowpan] Call Trace: [<c012af4c>] ? call_timer_fn+0x54/0xb3 [<c012aef8>] ? process_timeout+0xa/0xa [<c012b66b>] run_timer_softirq+0x140/0x15f Problem is that incomplete frags are still around after unload; when their frag expire timer fires, we get crash. When a netns is removed (also done when unloading module), inet_frag calls the evictor with 'force' argument to purge remaining frags. The evictor loop terminates when accounted memory ('work') drops to 0 or the lru-list becomes empty. However, the mem accounting is done via percpu counters and may not be accurate, i.e. loop may terminate prematurely. Alter evictor to only stop once the lru list is empty when force is requested. Reported-by: Phoebe Buckheister <phoebe.buckheister@itwm.fraunhofer.de> Reported-by: Alexander Aring <alex.aring@gmail.com> Tested-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-06tipc: don't log disabled tasklet handler errorsErik Hugne1-1/+0
Failure to schedule a TIPC tasklet with tipc_k_signal because the tasklet handler is disabled is not an error. It means TIPC is currently in the process of shutting down. We remove the error logging in this case. Signed-off-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-06tipc: fix memory leak during module removalErik Hugne1-3/+34
When the TIPC module is removed, the tasklet handler is disabled before all other subsystems. This will cause lingering publications in the name table because the node_down tasklets responsible to clean up publications from an unreachable node will never run. When the name table is shut down, these publications are detected and an error message is logged: tipc: nametbl_stop(): orphaned hash chain detected This is actually a memory leak, introduced with commit 993b858e37b3120ee76d9957a901cca22312ffaa ("tipc: correct the order of stopping services at rmmod") Instead of just logging an error and leaking memory, we free the orphaned entries during nametable shutdown. Signed-off-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-06tipc: drop subscriber connection id invalidationErik Hugne1-11/+0
When a topology server subscriber is disconnected, the associated connection id is set to zero. A check vs zero is then done in the subscription timeout function to see if the subscriber have been shut down. This is unnecessary, because all subscription timers will be cancelled when a subscriber terminates. Setting the connection id to zero is actually harmful because id zero is the identity of the topology server listening socket, and can cause a race that leads to this socket being closed instead. Signed-off-by: Erik Hugne <erik.hugne@ericsson.com> Acked-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-06tipc: avoid to unnecessary process switch under non-block modeYing Xue1-2/+2
When messages are received via tipc socket under non-block mode, schedule_timeout() is called in tipc_wait_for_rcvmsg(), that is, the process of receiving messages will be scheduled once although timeout value passed to schedule_timeout() is 0. The same issue exists in accept()/wait_for_accept(). To avoid this unnecessary process switch, we only call schedule_timeout() if the timeout value is non-zero. Signed-off-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-06tipc: fix connection refcount leakYing Xue1-2/+4
When tipc_conn_sendmsg() calls tipc_conn_lookup() to query a connection instance, its reference count value is increased if it's found. But subsequently if it's found that the connection is closed, the work of sending message is not queued into its server send workqueue, and the connection reference count is not decreased. This will cause a reference count leak. To reproduce this problem, an application would need to open and closes topology server connections with high intensity. We fix this by immediately decrementing the connection reference count if a send fails due to the connection being closed. Signed-off-by: Ying Xue <ying.xue@windriver.com> Acked-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-06tipc: allow connection shutdown callback to be invoked in advanceYing Xue3-18/+7
Currently connection shutdown callback function is called when connection instance is released in tipc_conn_kref_release(), and receiving packets and sending packets are running in different threads. Even if connection is closed by the thread of receiving packets, its shutdown callback may not be called immediately as the connection reference count is non-zero at that moment. So, although the connection is shut down by the thread of receiving packets, the thread of sending packets doesn't know it. Before its shutdown callback is invoked to tell the sending thread its connection has been closed, the sending thread may deliver messages by tipc_conn_sendmsg(), this is why the following error information appears: "Sending subscription event failed, no memory" To eliminate it, allow connection shutdown callback function to be called before connection id is removed in tipc_close_conn(), which makes the sending thread know the truth in time that its socket is closed so that it doesn't send message to it. We also remove the "Sending XXX failed..." error reporting for topology and config services. Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-06l2tp: fix userspace reception on plain L2TP socketsGuillaume Nault1-5/+7
As pppol2tp_recv() never queues up packets to plain L2TP sockets, pppol2tp_recvmsg() never returns data to userspace, thus making the recv*() system calls unusable. Instead of dropping packets when the L2TP socket isn't bound to a PPP channel, this patch adds them to its reception queue. Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-06l2tp: fix manual sequencing (de)activation in L2TPv2Guillaume Nault4-3/+7
Commit e0d4435f "l2tp: Update PPP-over-L2TP driver to work over L2TPv3" broke the PPPOL2TP_SO_SENDSEQ setsockopt. The L2TP header length was previously computed by pppol2tp_l2t_header_len() before each call to l2tp_xmit_skb(). Now that header length is retrieved from the hdr_len session field, this field must be updated every time the L2TP header format is modified, or l2tp_xmit_skb() won't push the right amount of data for the L2TP header. This patch uses l2tp_session_set_header_len() to adjust hdr_len every time sequencing is (de)activated from userspace (either by the PPPOL2TP_SO_SENDSEQ setsockopt or the L2TP_ATTR_SEND_SEQ netlink attribute). Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-05hyperv: Move state setting for link queryHaiyang Zhang2-1/+24
It moves the state setting for query into rndis_filter_receive_response(). All callbacks including query-complete and status-callback are synchronized by channel->inbound_lock. This prevents pentential race between them. Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-05net: macb: DMA-unmap full rx-bufferSoren Brinkmann1-1/+1
When allocating RX buffers a fixed size is used, while freeing is based on actually received bytes, resulting in the following kernel warning when CONFIG_DMA_API_DEBUG is enabled: WARNING: CPU: 0 PID: 0 at lib/dma-debug.c:1051 check_unmap+0x258/0x894() macb e000b000.ethernet: DMA-API: device driver frees DMA memory with different size [device address=0x000000002d170040] [map size=1536 bytes] [unmap size=60 bytes] Modules linked in: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.14.0-rc3-xilinx-00220-g49f84081ce4f #65 [<c001516c>] (unwind_backtrace) from [<c0011df8>] (show_stack+0x10/0x14) [<c0011df8>] (show_stack) from [<c03c775c>] (dump_stack+0x7c/0xc8) [<c03c775c>] (dump_stack) from [<c00245cc>] (warn_slowpath_common+0x60/0x84) [<c00245cc>] (warn_slowpath_common) from [<c0024670>] (warn_slowpath_fmt+0x2c/0x3c) [<c0024670>] (warn_slowpath_fmt) from [<c0227d44>] (check_unmap+0x258/0x894) [<c0227d44>] (check_unmap) from [<c0228588>] (debug_dma_unmap_page+0x64/0x70) [<c0228588>] (debug_dma_unmap_page) from [<c02ab78c>] (gem_rx+0x118/0x170) [<c02ab78c>] (gem_rx) from [<c02ac4d4>] (macb_poll+0x24/0x94) [<c02ac4d4>] (macb_poll) from [<c031222c>] (net_rx_action+0x6c/0x188) [<c031222c>] (net_rx_action) from [<c0028a28>] (__do_softirq+0x108/0x280) [<c0028a28>] (__do_softirq) from [<c0028e8c>] (irq_exit+0x84/0xf8) [<c0028e8c>] (irq_exit) from [<c000f360>] (handle_IRQ+0x68/0x8c) [<c000f360>] (handle_IRQ) from [<c0008528>] (gic_handle_irq+0x3c/0x60) [<c0008528>] (gic_handle_irq) from [<c0012904>] (__irq_svc+0x44/0x78) Exception stack(0xc056df20 to 0xc056df68) df20: 00000001 c0577430 00000000 c0577430 04ce8e0d 00000002 edfce238 00000000 df40: 04e20f78 00000002 c05981f4 00000000 00000008 c056df68 c0064008 c02d7658 df60: 20000013 ffffffff [<c0012904>] (__irq_svc) from [<c02d7658>] (cpuidle_enter_state+0x54/0xf8) [<c02d7658>] (cpuidle_enter_state) from [<c02d77dc>] (cpuidle_idle_call+0xe0/0x138) [<c02d77dc>] (cpuidle_idle_call) from [<c000f660>] (arch_cpu_idle+0x8/0x3c) [<c000f660>] (arch_cpu_idle) from [<c006bec4>] (cpu_startup_entry+0xbc/0x124) [<c006bec4>] (cpu_startup_entry) from [<c053daec>] (start_kernel+0x350/0x3b0) ---[ end trace d5fdc38641bd3a11 ]--- Mapped at: [<c0227184>] debug_dma_map_page+0x48/0x11c [<c02ab32c>] gem_rx_refill+0x154/0x1f8 [<c02ac7b4>] macb_open+0x270/0x3e0 [<c03152e0>] __dev_open+0x7c/0xfc [<c031554c>] __dev_change_flags+0x8c/0x140 Fixing this by passing the same size which is passed during mapping the memory to the unmap function as well. Signed-off-by: Soren Brinkmann <soren.brinkmann@xilinx.com> Reviewed-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-05net: macb: Check DMA mappings for errorSoren Brinkmann1-2/+12
With CONFIG_DMA_API_DEBUG enabled the following warning is printed: WARNING: CPU: 0 PID: 619 at lib/dma-debug.c:1101 check_unmap+0x758/0x894() macb e000b000.ethernet: DMA-API: device driver failed to check map error[device address=0x000000002d171c02] [size=322 bytes] [mapped as single] Modules linked in: CPU: 0 PID: 619 Comm: udhcpc Not tainted 3.14.0-rc3-xilinx-00219-gd158fc7f36a2 #63 [<c001516c>] (unwind_backtrace) from [<c0011df8>] (show_stack+0x10/0x14) [<c0011df8>] (show_stack) from [<c03c7714>] (dump_stack+0x7c/0xc8) [<c03c7714>] (dump_stack) from [<c00245cc>] (warn_slowpath_common+0x60/0x84) [<c00245cc>] (warn_slowpath_common) from [<c0024670>] (warn_slowpath_fmt+0x2c/0x3c) [<c0024670>] (warn_slowpath_fmt) from [<c0228244>] (check_unmap+0x758/0x894) [<c0228244>] (check_unmap) from [<c0228588>] (debug_dma_unmap_page+0x64/0x70) [<c0228588>] (debug_dma_unmap_page) from [<c02aba64>] (macb_interrupt+0x1f8/0x2dc) [<c02aba64>] (macb_interrupt) from [<c006c6e4>] (handle_irq_event_percpu+0x2c/0x178) [<c006c6e4>] (handle_irq_event_percpu) from [<c006c86c>] (handle_irq_event+0x3c/0x5c) [<c006c86c>] (handle_irq_event) from [<c006f548>] (handle_fasteoi_irq+0xb8/0x100) [<c006f548>] (handle_fasteoi_irq) from [<c006c148>] (generic_handle_irq+0x20/0x30) [<c006c148>] (generic_handle_irq) from [<c000f35c>] (handle_IRQ+0x64/0x8c) [<c000f35c>] (handle_IRQ) from [<c0008528>] (gic_handle_irq+0x3c/0x60) [<c0008528>] (gic_handle_irq) from [<c0012904>] (__irq_svc+0x44/0x78) Exception stack(0xed197f60 to 0xed197fa8) 7f60: 00000134 60000013 bd94362e bd94362e be96b37c 00000014 fffffd72 00000122 7f80: c000ebe4 ed196000 00000000 00000011 c032c0d8 ed197fa8 c0064008 c000ea20 7fa0: 60000013 ffffffff [<c0012904>] (__irq_svc) from [<c000ea20>] (ret_fast_syscall+0x0/0x48) ---[ end trace 478f921d0d542d1e ]--- Mapped at: [<c0227184>] debug_dma_map_page+0x48/0x11c [<c02aaca0>] macb_start_xmit+0x184/0x2a8 [<c03143c0>] dev_hard_start_xmit+0x334/0x470 [<c032c09c>] sch_direct_xmit+0x78/0x2f8 [<c0314814>] __dev_queue_xmit+0x318/0x708 due to missing checks of the dma mapping. Add the appropriate checks to fix this. Signed-off-by: Soren Brinkmann <soren.brinkmann@xilinx.com> Reviewed-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-05net: sctp: fix skb leakage in COOKIE ECHO path of chunk->auth_chunkDaniel Borkmann2-7/+2
While working on ec0223ec48a9 ("net: sctp: fix sctp_sf_do_5_1D_ce to verify if we/peer is AUTH capable"), we noticed that there's a skb memory leakage in the error path. Running the same reproducer as in ec0223ec48a9 and by unconditionally jumping to the error label (to simulate an error condition) in sctp_sf_do_5_1D_ce() receive path lets kmemleak detector bark about the unfreed chunk->auth_chunk skb clone: Unreferenced object 0xffff8800b8f3a000 (size 256): comm "softirq", pid 0, jiffies 4294769856 (age 110.757s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 89 ab 75 5e d4 01 58 13 00 00 00 00 00 00 00 00 ..u^..X......... backtrace: [<ffffffff816660be>] kmemleak_alloc+0x4e/0xb0 [<ffffffff8119f328>] kmem_cache_alloc+0xc8/0x210 [<ffffffff81566929>] skb_clone+0x49/0xb0 [<ffffffffa0467459>] sctp_endpoint_bh_rcv+0x1d9/0x230 [sctp] [<ffffffffa046fdbc>] sctp_inq_push+0x4c/0x70 [sctp] [<ffffffffa047e8de>] sctp_rcv+0x82e/0x9a0 [sctp] [<ffffffff815abd38>] ip_local_deliver_finish+0xa8/0x210 [<ffffffff815a64af>] nf_reinject+0xbf/0x180 [<ffffffffa04b4762>] nfqnl_recv_verdict+0x1d2/0x2b0 [nfnetlink_queue] [<ffffffffa04aa40b>] nfnetlink_rcv_msg+0x14b/0x250 [nfnetlink] [<ffffffff815a3269>] netlink_rcv_skb+0xa9/0xc0 [<ffffffffa04aa7cf>] nfnetlink_rcv+0x23f/0x408 [nfnetlink] [<ffffffff815a2bd8>] netlink_unicast+0x168/0x250 [<ffffffff815a2fa1>] netlink_sendmsg+0x2e1/0x3f0 [<ffffffff8155cc6b>] sock_sendmsg+0x8b/0xc0 [<ffffffff8155d449>] ___sys_sendmsg+0x369/0x380 What happens is that commit bbd0d59809f9 clones the skb containing the AUTH chunk in sctp_endpoint_bh_rcv() when having the edge case that an endpoint requires COOKIE-ECHO chunks to be authenticated: ---------- INIT[RANDOM; CHUNKS; HMAC-ALGO] ----------> <------- INIT-ACK[RANDOM; CHUNKS; HMAC-ALGO] --------- ------------------ AUTH; COOKIE-ECHO ----------------> <-------------------- COOKIE-ACK --------------------- When we enter sctp_sf_do_5_1D_ce() and before we actually get to the point where we process (and subsequently free) a non-NULL chunk->auth_chunk, we could hit the "goto nomem_init" path from an error condition and thus leave the cloned skb around w/o freeing it. The fix is to centrally free such clones in sctp_chunk_destroy() handler that is invoked from sctp_chunk_free() after all refs have dropped; and also move both kfree_skb(chunk->auth_chunk) there, so that chunk->auth_chunk is either NULL (since sctp_chunkify() allocs new chunks through kmem_cache_zalloc()) or non-NULL with a valid skb pointer. chunk->skb and chunk->auth_chunk are the only skbs in the sctp_chunk structure that need to be handeled. While at it, we should use consume_skb() for both. It is the same as dev_kfree_skb() but more appropriately named as we are not a device but a protocol. Also, this effectively replaces the kfree_skb() from both invocations into consume_skb(). Functions are the same only that kfree_skb() assumes that the frame was being dropped after a failure (e.g. for tools like drop monitor), usage of consume_skb() seems more appropriate in function sctp_chunk_destroy() though. Fixes: bbd0d59809f9 ("[SCTP]: Implement the receive and verification of AUTH chunk") Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Vlad Yasevich <yasevich@gmail.com> Cc: Neil Horman <nhorman@tuxdriver.com> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-05r8152: disable the ECM modehayeswang4-255/+17
There are known issues for switching the drivers between ECM mode and vendor mode. The interrup transfer may become abnormal. The hardware may have the opportunity to die if you change the configuration without unloading the current driver first, because all the control transfers of the current driver would fail after the command of switching the configuration. Although to use the ecm driver and vendor driver independently is fine, it may have problems to change the driver from one to the other by switching the configuration. Additionally, now the vendor mode driver is more powerful than the ECM driver. Thus, disable the ECM mode driver, and let r8152 to set the configuration to vendor mode and reset the device automatically. Signed-off-by: Hayes Wang <hayeswang@realtek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-05net/mlx4: Support shutdown() interfaceGavin Shan1-0/+1
In kexec scenario, we failed to load the mlx4 driver in the second kernel because the ownership bit was hold by the first kernel without release correctly. The patch adds shutdown() interface so that the ownership can be released correctly in the first kernel. It also helps avoiding EEH error happened during boot stage of the second kernel because of undesired traffic, which can't be handled by hardware during that stage on Power platform. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Tested-by: Wei Yang <weiyang@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-05bridge: multicast: add sanity check for query source addressesLinus Lüssing1-0/+6
MLD queries are supposed to have an IPv6 link-local source address according to RFC2710, section 4 and RFC3810, section 5.1.14. This patch adds a sanity check to ignore such broken MLD queries. Without this check, such malformed MLD queries can result in a denial of service: The queries are ignored by any MLD listener therefore they will not respond with an MLD report. However, without this patch these malformed MLD queries would enable the snooping part in the bridge code, potentially shutting down the according ports towards these hosts for multicast traffic as the bridge did not learn about these listeners. Reported-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Linus Lüssing <linus.luessing@web.de> Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-05net: fix for a race condition in the inet frag codeNikolay Aleksandrov1-1/+2
I stumbled upon this very serious bug while hunting for another one, it's a very subtle race condition between inet_frag_evictor, inet_frag_intern and the IPv4/6 frag_queue and expire functions (basically the users of inet_frag_kill/inet_frag_put). What happens is that after a fragment has been added to the hash chain but before it's been added to the lru_list (inet_frag_lru_add) in inet_frag_intern, it may get deleted (either by an expired timer if the system load is high or the timer sufficiently low, or by the fraq_queue function for different reasons) before it's added to the lru_list, then after it gets added it's a matter of time for the evictor to get to a piece of memory which has been freed leading to a number of different bugs depending on what's left there. I've been able to trigger this on both IPv4 and IPv6 (which is normal as the frag code is the same), but it's been much more difficult to trigger on IPv4 due to the protocol differences about how fragments are treated. The setup I used to reproduce this is: 2 machines with 4 x 10G bonded in a RR bond, so the same flow can be seen on multiple cards at the same time. Then I used multiple instances of ping/ping6 to generate fragmented packets and flood the machines with them while running other processes to load the attacked machine. *It is very important to have the _same flow_ coming in on multiple CPUs concurrently. Usually the attacked machine would die in less than 30 minutes, if configured properly to have many evictor calls and timeouts it could happen in 10 minutes or so. An important point to make is that any caller (frag_queue or timer) of inet_frag_kill will remove both the timer refcount and the original/guarding refcount thus removing everything that's keeping the frag from being freed at the next inet_frag_put. All of this could happen before the frag was ever added to the LRU list, then it gets added and the evictor uses a freed fragment. An example for IPv6 would be if a fragment is being added and is at the stage of being inserted in the hash after the hash lock is released, but before inet_frag_lru_add executes (or is able to obtain the lru lock) another overlapping fragment for the same flow arrives at a different CPU which finds it in the hash, but since it's overlapping it drops it invoking inet_frag_kill and thus removing all guarding refcounts, and afterwards freeing it by invoking inet_frag_put which removes the last refcount added previously by inet_frag_find, then inet_frag_lru_add gets executed by inet_frag_intern and we have a freed fragment in the lru_list. The fix is simple, just move the lru_add under the hash chain locked region so when a removing function is called it'll have to wait for the fragment to be added to the lru_list, and then it'll remove it (it works because the hash chain removal is done before the lru_list one and there's no window between the two list adds when the frag can get dropped). With this fix applied I couldn't kill the same machine in 24 hours with the same setup. Fixes: 3ef0eb0db4bf ("net: frag, move LRU list maintenance outside of rwlock") CC: Florian Westphal <fw@strlen.de> CC: Jesper Dangaard Brouer <brouer@redhat.com> CC: David S. Miller <davem@davemloft.net> Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-04mm: page_alloc: exempt GFP_THISNODE allocations from zone fairnessJohannes Weiner1-4/+22
Jan Stancek reports manual page migration encountering allocation failures after some pages when there is still plenty of memory free, and bisected the problem down to commit 81c0a2bb515f ("mm: page_alloc: fair zone allocator policy"). The problem is that GFP_THISNODE obeys the zone fairness allocation batches on one hand, but doesn't reset them and wake kswapd on the other hand. After a few of those allocations, the batches are exhausted and the allocations fail. Fixing this means either having GFP_THISNODE wake up kswapd, or GFP_THISNODE not participating in zone fairness at all. The latter seems safer as an acute bugfix, we can clean up later. Reported-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: <stable@kernel.org> [3.12+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-04mm: numa: bugfix for LAST_CPUPID_NOT_IN_PAGE_FLAGSLiu Ping Fan1-2/+2
When doing some numa tests on powerpc, I triggered an oops bug. I find it is caused by using page->_last_cpupid. It should be initialized as "-1 & LAST_CPUPID_MASK", but not "-1". Otherwise, in task_numa_fault(), we will miss the checking (last_cpupid == (-1 & LAST_CPUPID_MASK)). And finally cause an oops bug in task_numa_group(), since the online cpu is less than possible cpu. This happen with CONFIG_SPARSE_VMEMMAP disabled Call trace: SMP NR_CPUS=64 NUMA PowerNV Modules linked in: CPU: 24 PID: 804 Comm: systemd-udevd Not tainted3.13.0-rc1+ #32 task: c000001e2746aa80 ti: c000001e32c50000 task.ti:c000001e32c50000 REGS: c000001e32c53510 TRAP: 0300 Not tainted(3.13.0-rc1+) MSR: 9000000000009032 <SF,HV,EE,ME,IR,DR,RI> CR:28024424 XER: 20000000 CFAR: c000000000009324 DAR: 7265717569726857 DSISR:40000000 SOFTE: 1 NIP .task_numa_fault+0x1470/0x2370 LR .task_numa_fault+0x1468/0x2370 Call Trace: .task_numa_fault+0x1468/0x2370 (unreliable) .do_numa_page+0x480/0x4a0 .handle_mm_fault+0x4ec/0xc90 .do_page_fault+0x3a8/0x890 handle_page_fault+0x10/0x30 Instruction dump: 3c82fefb 3884b138 48d9cff1 60000000 48000574 3c62fefb3863af78 3c82fefb 3884b138 48d9cfd5 60000000 e93f0100 <812902e4> 7d2907b45529063e 7d2a07b4 ---[ end trace 15f2510da5ae07cf ]--- Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-04MAINTAINERS: add and correct types of some "T:" entriesJoe Perches1-8/+9
Tree location entries should start with the appropriate type. Add git to some, hg to another. Neaten tree type description. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-04MAINTAINERS: use tab for separatorJoe Perches1-44/+44
Convert whitespace to single tab for separators. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-04rapidio/tsi721: fix tasklet termination in dma channel releaseAlexandre Bounine2-9/+19
This patch is a modification of the patch originally proposed by Xiaotian Feng <xtfeng@gmail.com>: https://lkml.org/lkml/2012/11/5/413 This new version disables DMA channel interrupts and ensures that the tasklet wil not be scheduled again before calling tasklet_kill(). Unfortunately the updated patch was not released at that time due to planned rework of Tsi721 mport driver to use threaded interrupts (which has yet to happen). Recently the issue was reported again: https://lkml.org/lkml/2014/2/19/762. Description from the original Xiaotian's patch: "Some drivers use tasklet_disable in device remove/release process, tasklet_disable will inc tasklet->count and return. If the tasklet is not handled yet under some softirq pressure, the tasklet will be placed on the tasklet_vec, never have a chance to be excuted. This might lead to a heavy loaded ksoftirqd, wakeup with pending_softirq, but tasklet is disabled. tasklet_kill should be used in this case." This patch is applicable to kernel versions starting from v3.5. Signed-off-by: Alexandre Bounine <alexandre.bounine@idt.com> Cc: Matt Porter <mporter@kernel.crashing.org> Cc: Xiaotian Feng <xtfeng@gmail.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: Mike Galbraith <bitbucket@online.de> Cc: <stable@vger.kernel.org> [3.5+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-04hfsplus: fix remount issueVyacheslav Dubeyko1-1/+1
Current implementation of HFS+ driver has small issue with remount option. Namely, for example, you are unable to remount from RO mode into RW mode by means of command "mount -o remount,rw /dev/loop0 /mnt/hfsplus". Trying to execute sequence of commands results in an error message: mount /dev/loop0 /mnt/hfsplus mount -o remount,ro /dev/loop0 /mnt/hfsplus mount -o remount,rw /dev/loop0 /mnt/hfsplus mount: you must specify the filesystem type mount -t hfsplus -o remount,rw /dev/loop0 /mnt/hfsplus mount: /mnt/hfsplus not mounted or bad option The reason of such issue is failure of mount syscall: mount("/dev/loop0", "/mnt/hfsplus", 0x2282a60, MS_MGC_VAL|MS_REMOUNT, NULL) = -1 EINVAL (Invalid argument) Namely, hfsplus_parse_options_remount() method receives empty "input" argument and return false in such case. As a result, hfsplus_remount() returns -EINVAL error code. This patch fixes the issue by means of return true for the case of empty "input" argument in hfsplus_parse_options_remount() method. Signed-off-by: Vyacheslav Dubeyko <slava@dubeyko.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-04zram: avoid null access when fail to alloc metaMinchan Kim1-0/+2
zram_meta_alloc could fail so caller should check it. Otherwise, your system will hang. Signed-off-by: Minchan Kim <minchan@kernel.org> Acked-by: Jerome Marchand <jmarchan@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-04sh: prefix sh-specific "CCR" and "CCR2" by "SH_"Geert Uytterhoeven11-18/+20
Commit bcf24e1daa94 ("mmc: omap_hsmmc: use the generic config for omap2plus devices"), enabled the build for other platforms for compile testing. sh-allmodconfig now fails with: include/linux/omap-dma.h:171:8: error: expected identifier before numeric constant make[4]: *** [drivers/mmc/host/omap_hsmmc.o] Error 1 This happens because SuperH #defines "CCR", which is one of the enum values in include/linux/omap-dma.h. There's a similar issue with "CCR2" on sh2a. As "CCR" and "CCR2" are too generic names for global #defines, prefix them with "SH_" to fix this. Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-04ocfs2: fix quota file corruptionJan Kara2-14/+17
Global quota files are accessed from different nodes. Thus we cannot cache offset of quota structure in the quota file after we drop our node reference count to it because after that moment quota structure may be freed and reallocated elsewhere by a different node resulting in corruption of quota file. Fix the problem by clearing dq_off when we are releasing dquot structure. We also remove the DB_READ_B handling because it is useless - DQ_ACTIVE_B is set iff DQ_READ_B is set. Signed-off-by: Jan Kara <jack@suse.cz> Cc: Goldwyn Rodrigues <rgoldwyn@suse.de> Cc: Joel Becker <jlbec@evilplan.org> Reviewed-by: Mark Fasheh <mfasheh@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-04drivers/rtc/rtc-s3c.c: fix incorrect way of save/restore of S3C2410_TICNT for TYPE_S3C64XXVikas Sajjan1-5/+12
On exynos5250, exynos5420 and exynos5260 it was observed that, after 1 cycle of S2R, the rtc-tick occurs at a very fast rate as compared to the rtc-tick occuring before S2R. This patch fixes the above issue by correcting the wrong way of save/restore of S3C2410_TICNT for TYPE_S3C64XX. Signed-off-by: Vikas Sajjan <vikas.sajjan@samsung.com> Cc: Grant Likely <grant.likely@linaro.org> Cc: Rob Herring <robh+dt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-04kallsyms: fix absolute addresses for kASLRAndy Honig1-2/+1
Currently symbols that are absolute addresses are incorrectly displayed in /proc/kallsyms if the kernel is loaded with kASLR. The problem was that the scripts/kallsyms.c file which generates the array of symbol names and addresses uses an relocatable value for all symbols, even absolute symbols. This patch fixes that. Several kallsyms output in different boot states for comparison: $ egrep '_(stext|_per_cpu_(start|end))' /root/kallsyms.nokaslr 0000000000000000 D __per_cpu_start 0000000000014280 D __per_cpu_end ffffffff810001c8 T _stext $ egrep '_(stext|_per_cpu_(start|end))' /root/kallsyms.kaslr1 000000001f200000 D __per_cpu_start 000000001f214280 D __per_cpu_end ffffffffa02001c8 T _stext $ egrep '_(stext|_per_cpu_(start|end))' /root/kallsyms.kaslr2 000000000d400000 D __per_cpu_start 000000000d414280 D __per_cpu_end ffffffff8e4001c8 T _stext $ egrep '_(stext|_per_cpu_(start|end))' /root/kallsyms.kaslr-fixed 0000000000000000 D __per_cpu_start 0000000000014280 D __per_cpu_end ffffffffadc001c8 T _stext Signed-off-by: Andy Honig <ahonig@google.com> Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Michal Marek <mmarek@suse.cz> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-04scripts/gen_initramfs_list.sh: fix flags for initramfs LZ4 compressionDaniel M. Weeks1-1/+1
LZ4 as implemented in the kernel differs from the default method now used by the reference implementation of LZ4. Until the in-kernel method is updated to support the new default, passing the legacy flag (-l) to the compressor is necessary. Without this flag the kernel-generated, LZ4-compressed initramfs is junk. Kyungsik said: : It seems that lz4 supports legacy format with the same option as lz4c : does. Just looking at the first few bytes of lz4 compressed image, we can : see whether it is new format or not. : : It shows new format magic number without this patch. New format magic : number is 0x184d2204. : : $ hexdump -C ./initramfs_data.cpio.lz4 |more : 00000000 04 22 4d 18 64 70 b9 69 (Little Endian) : ... : : Currently kernel supports legacy format only. Signed-off-by: Daniel M. Weeks <dan@danweeks.net> Cc: Michal Marek <mmarek@suse.cz> Acked-by: Kyungsik Lee <kyungsik.lee@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-04mm: include VM_MIXEDMAP flag in the VM_SPECIAL list to avoid m(un)lockingVlastimil Babka2-2/+2
Daniel Borkmann reported a VM_BUG_ON assertion failing: ------------[ cut here ]------------ kernel BUG at mm/mlock.c:528! invalid opcode: 0000 [#1] SMP Modules linked in: ccm arc4 iwldvm [...] video CPU: 3 PID: 2266 Comm: netsniff-ng Not tainted 3.14.0-rc2+ #8 Hardware name: LENOVO 2429BP3/2429BP3, BIOS G4ET37WW (1.12 ) 05/29/2012 task: ffff8801f87f9820 ti: ffff88002cb44000 task.ti: ffff88002cb44000 RIP: 0010:[<ffffffff81171ad0>] [<ffffffff81171ad0>] munlock_vma_pages_range+0x2e0/0x2f0 Call Trace: do_munmap+0x18f/0x3b0 vm_munmap+0x41/0x60 SyS_munmap+0x22/0x30 system_call_fastpath+0x1a/0x1f RIP munlock_vma_pages_range+0x2e0/0x2f0 ---[ end trace a0088dcf07ae10f2 ]--- because munlock_vma_pages_range() thinks it's unexpectedly in the middle of a THP page. This can be reproduced with default config since 3.11 kernels. A reproducer can be found in the kernel's selftest directory for networking by running ./psock_tpacket. The problem is that an order=2 compound page (allocated by alloc_one_pg_vec_page() is part of the munlocked VM_MIXEDMAP vma (mapped by packet_mmap()) and mistaken for a THP page and assumed to be order=9. The checks for THP in munlock came with commit ff6a6da60b89 ("mm: accelerate munlock() treatment of THP pages"), i.e. since 3.9, but did not trigger a bug. It just makes munlock_vma_pages_range() skip such compound pages until the next 512-pages-aligned page, when it encounters a head page. This is however not a problem for vma's where mlocking has no effect anyway, but it can distort the accounting. Since commit 7225522bb429 ("mm: munlock: batch non-THP page isolation and munlock+putback using pagevec") this can trigger a VM_BUG_ON in PageTransHuge() check. This patch fixes the issue by adding VM_MIXEDMAP flag to VM_SPECIAL, a list of flags that make vma's non-mlockable and non-mergeable. The reasoning is that VM_MIXEDMAP vma's are similar to VM_PFNMAP, which is already on the VM_SPECIAL list, and both are intended for non-LRU pages where mlocking makes no sense anyway. Related Lkml discussion can be found in [2]. [1] tools/testing/selftests/net/psock_tpacket [2] https://lkml.org/lkml/2014/1/10/427 Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Reported-by: Daniel Borkmann <dborkman@redhat.com> Tested-by: Daniel Borkmann <dborkman@redhat.com> Cc: Thomas Hellstrom <thellstrom@vmware.com> Cc: John David Anglin <dave.anglin@bell.net> Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Jared Hulbert <jaredeh@gmail.com> Tested-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> [3.11.x+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-04memcg: reparent charges of children before processing parentFilipe Brandenburger1-1/+9
Sometimes the cleanup after memcg hierarchy testing gets stuck in mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0. There may turn out to be several causes, but a major cause is this: the workitem to offline parent can get run before workitem to offline child; parent's mem_cgroup_reparent_charges() circles around waiting for the child's pages to be reparented to its lrus, but it's holding cgroup_mutex which prevents the child from reaching its mem_cgroup_reparent_charges(). Further testing showed that an ordered workqueue for cgroup_destroy_wq is not always good enough: percpu_ref_kill_and_confirm's call_rcu_sched stage on the way can mess up the order before reaching the workqueue. Instead, when offlining a memcg, call mem_cgroup_reparent_charges() on all its children (and grandchildren, in the correct order) to have their charges reparented first. Fixes: e5fca243abae ("cgroup: use a dedicated workqueue for cgroup destruction") Signed-off-by: Filipe Brandenburger <filbranden@google.com> Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> [v3.10+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-04memcg: fix endless loop in __mem_cgroup_iter_next()Hugh Dickins1-2/+2
Commit 0eef615665ed ("memcg: fix css reference leak and endless loop in mem_cgroup_iter") got the interaction with the commit a few before it d8ad30559715 ("mm/memcg: iteration skip memcgs not yet fully initialized") slightly wrong, and we didn't notice at the time. It's elusive, and harder to get than the original, but for a couple of days before rc1, I several times saw a endless loop similar to that supposedly being fixed. This time it was a tighter loop in __mem_cgroup_iter_next(): because we can get here when our root has already been offlined, and the ordering of conditions was such that we then just cycled around forever. Fixes: 0eef615665ed ("memcg: fix css reference leak and endless loop in mem_cgroup_iter"). Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Greg Thelen <gthelen@google.com> Cc: <stable@vger.kernel.org> [3.12+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-04lib/radix-tree.c: swapoff tmpfs radix_tree: remember to rcu_read_unlockHugh Dickins1-1/+3
Running fsx on tmpfs with concurrent memhog-swapoff-swapon, lots of BUG: sleeping function called from invalid context at kernel/fork.c:606 in_atomic(): 0, irqs_disabled(): 0, pid: 1394, name: swapoff 1 lock held by swapoff/1394: #0: (rcu_read_lock){.+.+.+}, at: [<ffffffff812520a1>] radix_tree_locate_item+0x1f/0x2b6 followed by ================================================ [ BUG: lock held when returning to user space! ] 3.14.0-rc1 #3 Not tainted ------------------------------------------------ swapoff/1394 is leaving the kernel with locks still held! 1 lock held by swapoff/1394: #0: (rcu_read_lock){.+.+.+}, at: [<ffffffff812520a1>] radix_tree_locate_item+0x1f/0x2b6 after which the system recovered nicely. Whoops, I long ago forgot the rcu_read_unlock() on one unlikely branch. Fixes e504f3fdd63d ("tmpfs radix_tree: locate_item to speed up swapoff") Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-04dma debug: account for cachelines and read-only mappings in overlap trackingDan Williams1-46/+85
While debug_dma_assert_idle() checks if a given *page* is actively undergoing dma the valid granularity of a dma mapping is a *cacheline*. Sander's testing shows that the warning message "DMA-API: exceeded 7 overlapping mappings of pfn..." is falsely triggering. The test is simply mapping multiple cachelines in a given page. Ultimately we want overlap tracking to be valid as it is a real api violation, so we need to track active mappings by cachelines. Update the active dma tracking to use the page-frame-relative cacheline of the mapping as the key, and update debug_dma_assert_idle() to check for all possible mapped cachelines for a given page. However, the need to track active mappings is only relevant when the dma-mapping is writable by the device. In fact it is fairly standard for read-only mappings to have hundreds or thousands of overlapping mappings at once. Limiting the overlap tracking to writable (!DMA_TO_DEVICE) eliminates this class of false-positive overlap reports. Note, the radix gang lookup is sub-optimal. It would be best if it stopped fetching entries once the search passed a page boundary. Nevertheless, this implementation does not perturb the original net_dma failing case. That is to say the extra overhead does not show up in terms of making the failing case pass due to a timing change. References: http://marc.info/?l=linux-netdev&m=139232263419315&w=2 http://marc.info/?l=linux-netdev&m=139217088107122&w=2 Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reported-by: Sander Eikelenboom <linux@eikelenboom.it> Reported-by: Dave Jones <davej@redhat.com> Tested-by: Dave Jones <davej@redhat.com> Tested-by: Sander Eikelenboom <linux@eikelenboom.it> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Francois Romieu <romieu@fr.zoreil.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Wei Liu <wei.liu2@citrix.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-04mm: close PageTail raceDavid Rientjes9-55/+25
Commit bf6bddf1924e ("mm: introduce compaction and migration for ballooned pages") introduces page_count(page) into memory compaction which dereferences page->first_page if PageTail(page). This results in a very rare NULL pointer dereference on the aforementioned page_count(page). Indeed, anything that does compound_head(), including page_count() is susceptible to racing with prep_compound_page() and seeing a NULL or dangling page->first_page pointer. This patch uses Andrea's implementation of compound_trans_head() that deals with such a race and makes it the default compound_head() implementation. This includes a read memory barrier that ensures that if PageTail(head) is true that we return a head page that is neither NULL nor dangling. The patch then adds a store memory barrier to prep_compound_page() to ensure page->first_page is set. This is the safest way to ensure we see the head page that we are expecting, PageTail(page) is already in the unlikely() path and the memory barriers are unfortunately required. Hugetlbfs is the exception, we don't enforce a store memory barrier during init since no race is possible. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Holger Kiehl <Holger.Kiehl@dwd.de> Cc: Christoph Lameter <cl@linux.com> Cc: Rafael Aquini <aquini@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-04MAINTAINERS: EDAC: add Mauro and Borislav as interim patch collectorsBorislav Petkov1-0/+2
We're more or less collecting EDAC patches already anyway so let's hold it down so that get_maintainer sees it too. Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Mauro Carvalho Chehab <m.chehab@samsung.com> Cc: Doug Thompson <dougthompson@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-03-03macvlan: Add support for 'always_on' offload featuresVlad Yasevich1-2/+5
Macvlan currently inherits all of its features from the lower device. When lower device disables offload support, this causes macvlan to disable offload support as well. This causes performance regression when using macvlan/macvtap in bridge mode. It can be easily demonstrated by creating 2 namespaces using macvlan in bridge mode and running netperf between them: MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.0.0.1 () port 0 AF_INET Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 20.00 1204.61 To restore the performance, we add software offload features to the list of "always_on" features for macvlan. This way when a namespace or a guest using macvtap initially sends a packet, this packet will not be segmented at macvlan level. It will only be segmented when macvlan sends the packet to the lower device. MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.0.0.1 () port 0 AF_INET Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 20.00 5507.35 Fixes: 6acf54f1cf0a6747bac9fea26f34cfc5a9029523 (macvtap: Add support of packet capture on macvtap device.) Fixes: 797f87f83b60685ff8a13fa0572d2f10393c50d3 (macvlan: fix netdev feature propagation from lower device) CC: Florian Westphal <fw@strlen.de> CC: Christian Borntraeger <borntraeger@de.ibm.com> CC: Jason Wang <jasowang@redhat.com> CC: Michael S. Tsirkin <mst@redhat.com> Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Vlad Yasevich <vyasevic@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-03net: sctp: fix sctp_sf_do_5_1D_ce to verify if we/peer is AUTH capableDaniel Borkmann1-0/+7
RFC4895 introduced AUTH chunks for SCTP; during the SCTP handshake RANDOM; CHUNKS; HMAC-ALGO are negotiated (CHUNKS being optional though): ---------- INIT[RANDOM; CHUNKS; HMAC-ALGO] ----------> <------- INIT-ACK[RANDOM; CHUNKS; HMAC-ALGO] --------- -------------------- COOKIE-ECHO --------------------> <-------------------- COOKIE-ACK --------------------- A special case is when an endpoint requires COOKIE-ECHO chunks to be authenticated: ---------- INIT[RANDOM; CHUNKS; HMAC-ALGO] ----------> <------- INIT-ACK[RANDOM; CHUNKS; HMAC-ALGO] --------- ------------------ AUTH; COOKIE-ECHO ----------------> <-------------------- COOKIE-ACK --------------------- RFC4895, section 6.3. Receiving Authenticated Chunks says: The receiver MUST use the HMAC algorithm indicated in the HMAC Identifier field. If this algorithm was not specified by the receiver in the HMAC-ALGO parameter in the INIT or INIT-ACK chunk during association setup, the AUTH chunk and all the chunks after it MUST be discarded and an ERROR chunk SHOULD be sent with the error cause defined in Section 4.1. [...] If no endpoint pair shared key has been configured for that Shared Key Identifier, all authenticated chunks MUST be silently discarded. [...] When an endpoint requires COOKIE-ECHO chunks to be authenticated, some special procedures have to be followed because the reception of a COOKIE-ECHO chunk might result in the creation of an SCTP association. If a packet arrives containing an AUTH chunk as a first chunk, a COOKIE-ECHO chunk as the second chunk, and possibly more chunks after them, and the receiver does not have an STCB for that packet, then authentication is based on the contents of the COOKIE-ECHO chunk. In this situation, the receiver MUST authenticate the chunks in the packet by using the RANDOM parameters, CHUNKS parameters and HMAC_ALGO parameters obtained from the COOKIE-ECHO chunk, and possibly a local shared secret as inputs to the authentication procedure specified in Section 6.3. If authentication fails, then the packet is discarded. If the authentication is successful, the COOKIE-ECHO and all the chunks after the COOKIE-ECHO MUST be processed. If the receiver has an STCB, it MUST process the AUTH chunk as described above using the STCB from the existing association to authenticate the COOKIE-ECHO chunk and all the chunks after it. [...] Commit bbd0d59809f9 introduced the possibility to receive and verification of AUTH chunk, including the edge case for authenticated COOKIE-ECHO. On reception of COOKIE-ECHO, the function sctp_sf_do_5_1D_ce() handles processing, unpacks and creates a new association if it passed sanity checks and also tests for authentication chunks being present. After a new association has been processed, it invokes sctp_process_init() on the new association and walks through the parameter list it received from the INIT chunk. It checks SCTP_PARAM_RANDOM, SCTP_PARAM_HMAC_ALGO and SCTP_PARAM_CHUNKS, and copies them into asoc->peer meta data (peer_random, peer_hmacs, peer_chunks) in case sysctl -w net.sctp.auth_enable=1 is set. If in INIT's SCTP_PARAM_SUPPORTED_EXT parameter SCTP_CID_AUTH is set, peer_random != NULL and peer_hmacs != NULL the peer is to be assumed asoc->peer.auth_capable=1, in any other case asoc->peer.auth_capable=0. Now, if in sctp_sf_do_5_1D_ce() chunk->auth_chunk is available, we set up a fake auth chunk and pass that on to sctp_sf_authenticate(), which at latest in sctp_auth_calculate_hmac() reliably dereferences a NULL pointer at position 0..0008 when setting up the crypto key in crypto_hash_setkey() by using asoc->asoc_shared_key that is NULL as condition key_id == asoc->active_key_id is true if the AUTH chunk was injected correctly from remote. This happens no matter what net.sctp.auth_enable sysctl says. The fix is to check for net->sctp.auth_enable and for asoc->peer.auth_capable before doing any operations like sctp_sf_authenticate() as no key is activated in sctp_auth_asoc_init_active_key() for each case. Now as RFC4895 section 6.3 states that if the used HMAC-ALGO passed from the INIT chunk was not used in the AUTH chunk, we SHOULD send an error; however in this case it would be better to just silently discard such a maliciously prepared handshake as we didn't even receive a parameter at all. Also, as our endpoint has no shared key configured, section 6.3 says that MUST silently discard, which we are doing from now onwards. Before calling sctp_sf_pdiscard(), we need not only to free the association, but also the chunk->auth_chunk skb, as commit bbd0d59809f9 created a skb clone in that case. I have tested this locally by using netfilter's nfqueue and re-injecting packets into the local stack after maliciously modifying the INIT chunk (removing RANDOM; HMAC-ALGO param) and the SCTP packet containing the COOKIE_ECHO (injecting AUTH chunk before COOKIE_ECHO). Fixed with this patch applied. Fixes: bbd0d59809f9 ("[SCTP]: Implement the receive and verification of AUTH chunk") Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Vlad Yasevich <yasevich@gmail.com> Cc: Neil Horman <nhorman@tuxdriver.com> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-03ip_tunnel:multicast process cause panic due to skb->_skb_refdst NULL pointerXin Long1-1/+0
when ip_tunnel process multicast packets, it may check if the packet is looped back packet though 'rt_is_output_route(skb_rtable(skb))' in ip_tunnel_rcv(), but before that , skb->_skb_refdst has been dropped in iptunnel_pull_header(), so which leads to a panic. fix the bug: https://bugzilla.kernel.org/show_bug.cgi?id=70681 Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-03net: cpsw: fix cpdma rx descriptor leak on down interfaceSchuyler Patton1-0/+6
This patch fixes a CPDMA RX Descriptor leak that occurs after taking the interface down when the CPSW is in Dual MAC mode. Previously the CPSW_ALE port was left open up which causes packets to be received and processed by the RX interrupt handler and were passed to the non active network interface where they were ignored. The fix is for the slave_stop function of the selected interface to disable the respective CPSW_ALE Port from forwarding packets. This blocks traffic from being received on the inactive interface. Signed-off-by: Schuyler Patton <spatton@ti.com> Reviewed-by: Felipe Balbi <balbi@ti.com> Signed-off-by: Mugunthan V N <mugunthanvnm@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>