aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/logo.txt (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2014-12-11net/macb: fix compilation warning for print_hex_dump() called with skb->mac_headerCyrille Pitchen1-1/+1
Signed-off-by: Cyrille Pitchen <cyrille.pitchen@atmel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-11net/mlx4: Add support for A0 steeringMatan Barak7-19/+191
Add the required firmware commands for A0 steering and a way to enable that. The firmware support focuses on INIT_HCA, QUERY_HCA, QUERY_PORT, QUERY_DEV_CAP and QUERY_FUNC_CAP commands. Those commands are used to configure and query the device. The different A0 DMFS (steering) modes are: Static - optimized performance, but flow steering rules are limited. This mode should be choosed explicitly by the user in order to be used. Dynamic - this mode should be explicitly choosed by the user. In this mode, the FW works in optimized steering mode as long as it can and afterwards automatically drops to classic (full) DMFS. Disable - this mode should be explicitly choosed by the user. The user instructs the system not to use optimized steering, even if the FW supports Dynamic A0 DMFS (and thus will be able to use optimized steering in Default A0 DMFS mode). Default - this mode is implicitly choosed. In this mode, if the FW supports Dynamic A0 DMFS, it'll work in this mode. Otherwise, it'll work at Disable A0 DMFS mode. Under SRIOV configuration, when the A0 steering mode is enabled, older guest VF drivers who aren't using the RX QP allocation flag (MLX4_RESERVE_A0_QP) will get a QP from the general range and fail when attempting to register a steering rule. To avoid that, the PF context behaviour is changed once on A0 static mode, to require support for the allocation flag in VF drivers too. In order to enable A0 steering, we use log_num_mgm_entry_size param. If the value of the parameter is not positive, we treat the absolute value of log_num_mgm_entry_size as a bit field. Setting bit 2 of this bit field enables static A0 steering. Signed-off-by: Matan Barak <matanb@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-11net/mlx4: Refactor QUERY_PORTMatan Barak3-95/+154
Currently QUERY_PORT is done as a part of QUERY_DEV_CAP firmware command. Since we would like to use it without querying all device capabilities, extract this part to be a function of its own. Signed-off-by: Matan Barak <matanb@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-11net/mlx4_core: Add explicit error message when rule doesn't meet configurationMatan Barak1-3/+18
When a given flow steering rule is invalid in respect to the current steering configuration, print the correct error message to the system log. Signed-off-by: Matan Barak <matanb@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-11net/mlx4: Add A0 hybrid steeringMatan Barak8-25/+300
A0 hybrid steering is a form of high performance flow steering. By using this mode, mlx4 cards use a fast limited table based steering, in order to enable fast steering of unicast packets to a QP. In order to implement A0 hybrid steering we allocate resources from different zones: (1) General range (2) Special MAC-assigned QPs [RSS, Raw-Ethernet] each has its own region. When we create a rss QP or a raw ethernet (A0 steerable and BF ready) QP, we try hard to allocate the QP from range (2). Otherwise, we try hard not to allocate from this range. However, when the system is pushed to its limits and one needs every resource, the allocator uses every region it can. Meaning, when we run out of raw-eth qps, the allocator allocates from the general range (and the special-A0 area is no longer active). If we run out of RSS qps, the mechanism tries to allocate from the raw-eth QP zone. If that is also exhausted, the allocator will allocate from the general range (and the A0 region is no longer active). Note that if a raw-eth qp is allocated from the general range, it attempts to allocate the range such that bits 6 and 7 (blueflame bits) in the QP number are not set. When the feature is used in SRIOV, the VF has to notify the PF what kind of QP attributes it needs. In order to do that, along with the "Eth QP blueflame" bit, we reserve a new "A0 steerable QP". According to the combination of these bits, the PF tries to allocate a suitable QP. In order to maintain backward compatibility (with older PFs), the PF notifies which QP attributes it supports via QUERY_FUNC_CAP command. Signed-off-by: Matan Barak <matanb@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-11net/mlx4: Add mlx4_bitmap zone allocatorMatan Barak2-0/+451
The zone allocator is a mechanism which manages a few mlx4_bitmaps. When allocating a resource, the user indicates the desired zone of which this resource will be allocated from. If possible, the resource will be allocated from this zone. Otherwise, the resource will be allocated from a less-than, equal-to, higher-than priority zone, according to the desired zone's properties with that respective allocation order. Signed-off-by: Matan Barak <matanb@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-11net/mlx4: Add a check if there are too many reserved QPsDotan Barak1-1/+7
The number of reserved QPs is affected both from the firmware and from the driver's requirements. This patch adds a check that validates that this number is indeed feasable. Signed-off-by: Dotan Barak <dotanb@dev.mellanox.co.il> Signed-off-by: Matan Barak <matanb@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-11net/mlx4: Change QP allocation schemeEugenia Emantayev14-38/+137
When using BF (Blue-Flame), the QPN overrides the VLAN, CV, and SV fields in the WQE. Thus, BF may only be used for QPNs with bits 6,7 unset. The current Ethernet driver code reserves a Tx QP range with 256b alignment. This is wrong because if there are more than 64 Tx QPs in use, QPNs >= base + 65 will have bits 6/7 set. This problem is not specific for the Ethernet driver, any entity that tries to reserve more than 64 BF-enabled QPs should fail. Also, using ranges is not necessary here and is wasteful. The new mechanism introduced here will support reservation for "Eth QPs eligible for BF" for all drivers: bare-metal, multi-PF, and VFs (when hypervisors support WC in VMs). The flow we use is: 1. In mlx4_en, allocate Tx QPs one by one instead of a range allocation, and request "BF enabled QPs" if BF is supported for the function 2. In the ALLOC_RES FW command, change param1 to: a. param1[23:0] - number of QPs b. param1[31-24] - flags controlling QPs reservation Bit 31 refers to Eth blueflame supported QPs. Those QPs must have bits 6 and 7 unset in order to be used in Ethernet. Bits 24-30 of the flags are currently reserved. When a function tries to allocate a QP, it states the required attributes for this QP. Those attributes are considered "best-effort". If an attribute, such as Ethernet BF enabled QP, is a must-have attribute, the function has to check that attribute is supported before trying to do the allocation. In a lower layer of the code, mlx4_qp_reserve_range masks out the bits which are unsupported. If SRIOV is used, the PF validates those attributes and masks out unsupported attributes as well. In order to notify VFs which attributes are supported, the VF uses QUERY_FUNC_CAP command. This command's mailbox is filled by the PF, which notifies which QP allocation attributes it supports. Signed-off-by: Eugenia Emantayev <eugenia@mellanox.co.il> Signed-off-by: Matan Barak <matanb@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-11net/mlx4_core: Use tasklet for user-space CQ completion eventsMatan Barak5-2/+86
Previously, we've fired all our completion callbacks straight from our ISR. Some of those callbacks were lightweight (for example, mlx4_en's and IPoIB napi callbacks), but some of them did more work (for example, the user-space RDMA stack uverbs' completion handler). Besides that, doing more than the minimal work in ISR is generally considered wrong, it could even lead to a hard lockup of the system. Since when a lot of completion events are generated by the hardware, the loop over those events could be so long, that we'll get into a hard lockup by the system watchdog. In order to avoid that, add a new way of invoking completion events callbacks. In the interrupt itself, we add the CQs which receive completion event to a per-EQ list and schedule a tasklet. In the tasklet context we loop over all the CQs in the list and invoke the user callback. Signed-off-by: Matan Barak <matanb@mellanox.com> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-11net/mlx4_core: Mask out host side virtualization features for guestsOr Gerlitz1-1/+11
When VFs (guests in this context) issue the QUERY_DEV_CAP command, they need not be told that host side virtualization features such as VST, FSM (MAC anti-spoofing) and running > 80 VFs are supported by the device. Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-11net/mlx4_en: Set csum level for encapsulated packetsOr Gerlitz1-1/+2
This was dropped by mistake for the napi_gro_frags flow, fix that. Fixes: dd65beac48a5 ('net/mlx4_en: Extend usage of napi_gro_frags') Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-11be2net: Export tunnel offloads only when a VxLAN tunnel is createdSriharsha Basavapatna2-10/+33
The encapsulated offload flags shouldn't be unconditionally exported to the stack. The stack expects offloading to work across all tunnel types when those flags are set. This would break other tunnels (like GRE) since be2net currently supports tunnel offload for VxLAN only. Also, with VxLANs Skyhawk-R can offload only 1 UDP dport. If more than 1 UDP port is added, we should disable offloads in that case too. Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@emulex.com> Signed-off-by: Sathya Perla <sathya.perla@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-11gianfar: Fix dma check map error when DMA_API_DEBUG is enabledKevin Hao1-28/+56
We need to use dma_mapping_error() to check the dma address returned by dma_map_single/page(). Otherwise we would get warning like this: WARNING: at lib/dma-debug.c:1140 Modules linked in: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.18.0-rc2-next-20141029 #196 task: c0834300 ti: effe6000 task.ti: c0874000 NIP: c02b2c98 LR: c02b2c98 CTR: c030abc4 REGS: effe7d70 TRAP: 0700 Not tainted (3.18.0-rc2-next-20141029) MSR: 00021000 <CE,ME> CR: 22044022 XER: 20000000 GPR00: c02b2c98 effe7e20 c0834300 00000098 00021000 00000000 c030b898 00000003 GPR08: 00000001 00000000 00000001 749eec9d 22044022 1001abe0 00000020 ef278678 GPR16: ef278670 ef278668 ef278660 070a8040 c087f99c c08cdc60 00029000 c0840d44 GPR24: c08be6e8 c0840000 effe7e78 ef041340 00000600 ef114e10 00000000 c08be6e0 NIP [c02b2c98] check_unmap+0x51c/0x9e4 LR [c02b2c98] check_unmap+0x51c/0x9e4 Call Trace: [effe7e20] [c02b2c98] check_unmap+0x51c/0x9e4 (unreliable) [effe7e70] [c02b31d8] debug_dma_unmap_page+0x78/0x8c [effe7ed0] [c03d1640] gfar_clean_rx_ring+0x208/0x488 [effe7f40] [c03d1a9c] gfar_poll_rx_sq+0x3c/0xa8 [effe7f60] [c04f8714] net_rx_action+0xc0/0x178 [effe7f90] [c00435a0] __do_softirq+0x100/0x1fc [effe7fe0] [c0043958] irq_exit+0xa4/0xc8 [effe7ff0] [c000d14c] call_do_irq+0x24/0x3c [c0875e90] [c00048a0] do_IRQ+0x8c/0xf8 [c0875eb0] [c000ed10] ret_from_except+0x0/0x18 For TX, we need to unmap the pages which has already been mapped and free the skb before return. For RX, move the dma mapping and error check to gfar_new_skb(). We would reuse the original skb in the rx ring when either allocating skb failure or dma mapping error. Signed-off-by: Kevin Hao <haokexin@gmail.com> Signed-off-by: Claudiu Manoil <claudiu.manoil@freescale.com> Reviewed-by: Claudiu Manoil <claudiu.manoil@freescale.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-11cxgb4/csiostor: Don't use MASTER_MUST for fw_hello callHariprasad Shenai3-16/+3
Remove use of calls into t4_fw_hello() with MASTER_MUST, which results in FW_HELLO_CMD_MASTERFORCE being set. The firmware doesn't support this and of course any existing PF Drivers will totally go for a toss. Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-10net: fec: only enable mdio interrupt before phy device link upNimrod Andy1-1/+4
Before phy device link up, we only enable FEC mdio interrupt, which is more reasonable. Signed-off-by: Fugang Duan <B38611@freescale.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-10net: fec: clear all interrupt events to support i.MX6SXNimrod Andy1-1/+1
For i.MX6SX FEC controller, there have interrupt mask and event field extension. To support all SOCs FEC, we clear all interrupt events during MAVC initial process. Signed-off-by: Fugang Duan <B38611@freescale.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-10net: fec: reset fep link status in suspend functionNimrod Andy1-0/+6
On some i.MX6 serial boards, phy power and refrence clock are supplied or controlled by SOC. When do suspend/resume test, the power and clock are disabled, so phy device link down. For current driver, fep->link is still up status, which cause extra operation like below code. To avoid the dumy operation, we set fep->link to down when phy device is real down. ... if (fep->link) { napi_disable(&fep->napi); netif_tx_lock_bh(ndev); fec_stop(ndev); netif_tx_unlock_bh(ndev); napi_enable(&fep->napi); fep->link = phy_dev->link; status_change = 1; } ... Signed-off-by: Fugang Duan <B38611@freescale.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-10net: sock: fix access via invalid file descriptorAlexei Starovoitov1-2/+2
0day robot reported the following crash: [ 21.233581] BUG: unable to handle kernel NULL pointer dereference at 0000000000000007 [ 21.234709] IP: [<ffffffff8156ebda>] sk_attach_bpf+0x39/0xc2 It's due to bpf_prog_get() returning ERR_PTR. Check it properly. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Fixes: 89aa075832b0 ("net: sock: allow eBPF programs to be attached to sockets") Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-10net: introduce helper macro for_each_cmsghdrGu Zheng10-16/+15
Introduce helper macro for_each_cmsghdr as a wrapper of the enumerating cmsghdr from msghdr, just cleanup. Signed-off-by: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-10exit: pidns: fix/update the comments in zap_pid_ns_processes()Oleg Nesterov1-4/+24
The comments in zap_pid_ns_processes() are not clear, we need to explain how this code actually works. 1. "Ignore SIGCHLD" looks like optimization but it is not, we also need this for correctness. 2. The comment above sys_wait4() could tell more. EXIT_ZOMBIE child is only possible if it has exited before we ignored SIGCHLD. Or if it is traced from the parent namespace, but in this case it will be reaped by debugger after detach, sys_wait4() acts as a synchronization point. 3. The comment about TASK_DEAD (EXIT_DEAD in fact) children is outdated. Contrary to what it says we do not need to make sure they all go away after 0a01f2cc390e "pidns: Make the pidns proc mount/umount logic obvious". At the same time, we do need to wait for nr_hashed==init_pids, but the reasons are quite different and not obvious: setns(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Serge Hallyn <serge.hallyn@ubuntu.com> Cc: Sterling Alexander <stalexan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exitingOleg Nesterov1-0/+2
alloc_pid() does get_pid_ns() beforehand but forgets to put_pid_ns() if it fails because disable_pid_allocation() was called by the exiting child_reaper. We could simply move get_pid_ns() down to successful return, but this fix tries to be as trivial as possible. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Serge Hallyn <serge.hallyn@ubuntu.com> Cc: Sterling Alexander <stalexan@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10exit: exit_notify: re-use "dead" list to autoreap currentOleg Nesterov1-4/+2
After the previous change we can add just the exiting EXIT_DEAD task to the "dead" list and remove another release_task(tsk). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Sterling Alexander <stalexan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10exit: reparent: call forget_original_parent() under tasklist_lockOleg Nesterov1-24/+23
Shift "release dead children" loop from forget_original_parent() to its caller, exit_notify(). It is safe to reap them even if our parent reaps us right after we drop tasklist_lock, those children no longer have any connection to the exiting task. And this allows us to avoid write_lock_irq(tasklist_lock) right after it was released by forget_original_parent(), we can simply call it with tasklist_lock held. While at it, move the comment about forget_original_parent() up to this function. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Sterling Alexander <stalexan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10exit: reparent: avoid find_new_reaper() if no childrenOleg Nesterov1-0/+3
Now that pid_ns logic was isolated we can change forget_original_parent() to return right after find_child_reaper() when father->children is empty, there is nothing to reparent in this case. In particular this avoids find_alive_thread() and this can help if the whole process exits and it has a lot of PF_EXITING threads at the start of the thread list, this can easily lead to O(nr_threads ** 2) iterations. Trivial test case (tested under KVM, 2 CPUs): static void *tfunc(void *arg) { pause(); return NULL; } static int child(unsigned int nt) { pthread_t pt; while (nt--) assert(pthread_create(&pt, NULL, tfunc, NULL) == 0); pthread_kill(pt, SIGTRAP); pause(); return 0; } int main(int argc, const char *argv[]) { int stat; unsigned int nf = atoi(argv[1]); unsigned int nt = atoi(argv[2]); while (nf--) { if (!fork()) return child(nt); wait(&stat); assert(stat == SIGTRAP); } return 0; } $ time ./test 16 16536 shows: real user sys - 5m37.628s 0m4.437s 8m5.560s + 0m50.032s 0m7.130s 1m4.927s Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Sterling Alexander <stalexan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10exit: reparent: introduce find_alive_thread()Oleg Nesterov1-13/+19
Add the new simple helper to factor out the for_each_thread() code in find_child_reaper() and find_new_reaper(). It can also simplify the potential PF_EXITING -> exit_state change, plus perhaps we can change this code to take SIGNAL_GROUP_EXIT into account. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Kay Sievers <kay@vrfy.org> Cc: Lennart Poettering <lennart@poettering.net> Cc: Sterling Alexander <stalexan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10exit: reparent: introduce find_child_reaper()Oleg Nesterov1-21/+35
find_new_reaper() does 2 completely different things. Not only it finds a reaper, it also updates pid_ns->child_reaper or kills the whole namespace if the caller is ->child_reaper. Now that has_child_subreaper logic doesn't depend on child_reaper check we can move that pid_ns code into a separate helper. IMHO this makes the code more clean, and this allows the next changes. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Kay Sievers <kay@vrfy.org> Cc: Lennart Poettering <lennart@poettering.net> Cc: Sterling Alexander <stalexan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10exit: reparent: document the ->has_child_subreaper checksOleg Nesterov1-8/+6
Swap the "init_task" and same_thread_group() checks. This way it is more simple to document these checks and we can remove the link to the previous discussion on lkml. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Kay Sievers <kay@vrfy.org> Cc: Lennart Poettering <lennart@poettering.net> Cc: Sterling Alexander <stalexan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10exit: reparent: s/while_each_thread/for_each_thread/ in find_new_reaper()Oleg Nesterov1-5/+3
Change find_new_reaper() to use for_each_thread() instead of deprecated while_each_thread(). We do not bother to check "thread != father" in the 1st loop, we can rely on PF_EXITING check. Note: this means the minor behavioural change: for_each_thread() starts from the group leader. But this should be fine, nobody should make any assumption about do_wait(__WNOTHREAD) when it comes to reparented tasks. And this can avoid the pointless reparenting to a short-living thread While zombie leaders are not that common. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Kay Sievers <kay@vrfy.org> Cc: Lennart Poettering <lennart@poettering.net> Cc: Sterling Alexander <stalexan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10exit: reparent: fix the cross-namespace PR_SET_CHILD_SUBREAPER reparentingOleg Nesterov1-2/+4
find_new_reaper() assumes that "has_child_subreaper" logic is safe as long as we are not the exiting ->child_reaper and this is doubly wrong: 1. In fact it is safe if "pid_ns->child_reaper == father"; there must be no children after zap_pid_ns_processes() returns, so it doesn't matter what we return in this case and even pid_ns->child_reaper is wrong otherwise: we can't reparent to ->child_reaper == current. This is not a bug, but this is confusing. 2. It is not safe if we are not pid_ns->child_reaper but from the same thread group. We drop tasklist_lock before zap_pid_ns_processes(), so another thread can lock it and choose the new reaper from the upper namespace if has_child_subreaper == T, and this is obviously wrong. This is not that bad, zap_pid_ns_processes() won't return until the the new reaper reaps all zombies, but this should be fixed anyway. We could change for_each_thread() loop to use ->exit_state instead of PF_EXITING which we had to use until 8aac62706ada, or we could change copy_signal() to check CLONE_NEWPID before setting has_child_subreaper, but lets change this code so that it is clear we can't look outside of our namespace, otherwise same_thread_group(reaper, child_reaper) check will look wrong and confusing anyway. We can simply start from "father" and fix the problem. We can't wrongly return a thread from the same thread group if ->is_child_subreaper == T, we know that all threads have PF_EXITING set. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Kay Sievers <kay@vrfy.org> Cc: Lennart Poettering <lennart@poettering.net> Cc: Sterling Alexander <stalexan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10exit: reparent: fix the dead-parent PR_SET_CHILD_SUBREAPER reparentingOleg Nesterov1-1/+1
The ->has_child_subreaper code in find_new_reaper() finds alive "thread" but returns another "reaper" thread which can be dead. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Kay Sievers <kay@vrfy.org> Cc: Lennart Poettering <lennart@poettering.net> Cc: Sterling Alexander <stalexan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10exit: proc: don't try to flush /proc/tgid/task/tgidOleg Nesterov1-0/+3
proc_flush_task_mnt() always tries to flush task/pid, but this is pointless if we reap the leader. d_invalidate() is recursive, and if nothing else the next d_hash_and_lookup(tgid) should fail anyway. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Rik van Riel <riel@redhat.com> Cc: Sterling Alexander <stalexan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10exit: release_task: fix the comment about group leader accountingOleg Nesterov1-7/+4
Contrary to what the comment in __exit_signal() says we do account the group leader. Fix this and explain why. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Rik van Riel <riel@redhat.com> Cc: Sterling Alexander <stalexan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10exit: wait: drop tasklist_lock before psig->c* accountingOleg Nesterov1-7/+5
wait_task_zombie() no longer needs tasklist_lock to accumulate the psig->c* counters, we can drop it right after cmpxchg(exit_state). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Rik van Riel <riel@redhat.com> Cc: Sterling Alexander <stalexan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10exit: wait: don't use zombie->real_parentOleg Nesterov1-12/+11
1. wait_task_zombie() uses p->real_parent to get psig/siglock. This is correct but needs tasklist_lock, ->real_parent can exit. We can use "current" instead. This is our natural child, its parent must be our sub-thread. 2. Read psig/sig outside of ->siglock, ->signal is no longer protected by this lock. 3. Fix the outdated comments about tasklist_lock. We can not race with __exit_signal(), the whole thread group is dead, nobody but us can call it. Also clarify the usage of ->stats_lock and ->siglock. Note: thread_group_cputime_adjusted() is sub-optimal in this case, we probably want to export cputime_adjust() to avoid thread_group_cputime(). The comment says "all threads" but there are no other threads. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Rik van Riel <riel@redhat.com> Cc: Sterling Alexander <stalexan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10exit: wait: cleanup the ptrace_reparented() checksOleg Nesterov1-8/+6
Now that EXIT_DEAD is the terminal state we can kill "int traced" variable and check "state == EXIT_DEAD" instead to cleanup the code. In particular, this way it is clear that the check obviously doesn't need tasklist_lock. Also fix the type of "unsigned long state", "long" was always wrong although this doesn't matter because cmpxchg/xchg uses typeof(*ptr). [akpm@linux-foundation.org: don't make me google the C Operator Precedence table] Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Rik van Riel <riel@redhat.com> Cc: Sterling Alexander <stalexan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10usermodehelper: kill the kmod_thread_locker logicOleg Nesterov1-30/+3
Now that we do not call kernel_thread(CLONE_VFORK) from the worker thread we can not deadlock if do_execve() in turn triggers another call_usermodehelper(), we can remove the kmod_thread_locker code. Note: we should probably kill khelper_wq and simply use one of the global workqueues, say, system_unbound_wq, this special wq for umh buys nothing nowadays. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10usermodehelper: don't use CLONE_VFORK for ____call_usermodehelper()Oleg Nesterov1-9/+3
After "kernel/kmod: fix use-after-free of the sub_infostructure" CLONE_VFORK in __call_usermodehelper() buys nothing, we rely on on umh_complete() in ____call_usermodehelper() anyway. Remove it. This also eliminates the unnecessary sleep/wakeup in the likely case, and this allows the next change. While at it, kill the "int wait" locals in ____call_usermodehelper() and __call_usermodehelper(), they can safely use sub_info->wait. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10fs/hfs/catalog.c: fix comparison bug in hfs_cat_keycmpRasmus Villemoes1-6/+8
Relying on the sign (after casting to int) of the difference of two quantities for comparison is usually wrong. For example, should a-b turn out to be 2^31, the return value of cmp(a,b) is -2^31; but that would also be the return value from cmp(b, a). So a compares less than b and b compares less than a. One can also easily find three values a,b,c such that a compares less than b, b compares less than c, but a does not compare less than c. Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Reviewed-by: Vyacheslav Dubeyko <slava@dubeyko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10nilfs2: fix the nilfs_iget() vs. nilfs_new_inode() racesRyusuke Konishi2-11/+36
Same story as in commit 41080b5a2401 ("nfsd race fixes: ext2") (similar ext2 fix) except that nilfs2 needs to use insert_inode_locked4() instead of insert_inode_locked() and a bug of a check for dead inodes needs to be fixed. If nilfs_iget() is called from nfsd after nilfs_new_inode() calls insert_inode_locked4(), nilfs_iget() will wait for unlock_new_inode() at the end of nilfs_mkdir()/nilfs_create()/etc to unlock the inode. If nilfs_iget() is called before nilfs_new_inode() calls insert_inode_locked4(), it will create an in-core inode and read its data from the on-disk inode. But, nilfs_iget() will find i_nlink equals zero and fail at nilfs_read_inode_common(), which will lead it to call iget_failed() and cleanly fail. However, this sanity check doesn't work as expected for reused on-disk inodes because they leave a non-zero value in i_mode field and it hinders the test of i_nlink. This patch also fixes the issue by removing the test on i_mode that nilfs2 doesn't need. Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10nilfs2: deletion of an unnecessary check before the function call "iput"Markus Elfring1-2/+1
The iput() function tests whether its argument is NULL and then returns immediately. Thus the test around the call is not needed. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10nilfs2: avoid duplicate segment construction for fsync()Andreas Rohner1-8/+2
This patch removes filemap_write_and_wait_range() from nilfs_sync_file(), because it triggers a data segment construction by calling nilfs_writepages() with WB_SYNC_ALL. A data segment construction does not remove the inode from the i_dirty list and it does not clear the NILFS_I_DIRTY flag. Therefore nilfs_inode_dirty() still returns true, which leads to an unnecessary duplicate segment construction in nilfs_sync_file(). A call to filemap_write_and_wait_range() is not needed, because NILFS2 does not rely on the generic writeback mechanisms. Instead it implements its own mechanism to collect all dirty pages and write them into segments. It is more efficient to initiate the segment construction directly in nilfs_sync_file() without the detour over filemap_write_and_wait_range(). Additionally the lock of i_mutex is not needed, because all code blocks that are protected by i_mutex are also protected by a NILFS transaction: Function i_mutex nilfs_transaction ------------------------------------------------------ nilfs_ioctl_setflags: yes yes nilfs_fiemap: yes no nilfs_write_begin: yes yes nilfs_write_end: yes yes nilfs_lookup: yes no nilfs_create: yes yes nilfs_link: yes yes nilfs_mknod: yes yes nilfs_symlink: yes yes nilfs_mkdir: yes yes nilfs_unlink: yes yes nilfs_rmdir: yes yes nilfs_rename: yes yes nilfs_setattr: yes yes For nilfs_lookup() i_mutex is held for the parent directory, to protect it from modification. The segment construction does not modify directory inodes, so no lock is needed. nilfs_fiemap() reads the block layout on the disk, by using nilfs_bmap_lookup_contig(). This is already protected by bmap->b_sem. Signed-off-by: Andreas Rohner <andreas.rohner@gmx.net> Signed-off-by: Ryusuke Konishi <konishi.ryusuke@lab.ntt.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10rtc: refine rtc_timer_do_work() to consider other set alarm failuresXunlei Pang1-0/+13
rtc_timer_do_work() only judges -ETIME failure of__rtc_set_alarm(), but doesn't handle other failures like -EIO, -EBUSY, etc. If there is a failure other than -ETIME, the next rtc_timer will stay in the timerqueue. Then later rtc_timers will be enqueued directly because they have a later expires time, so the alarm irq will never be programmed. When such failures happen, this patch will retry __rtc_set_alarm(), if still can't program the alarm time, it will remove current rtc_timer from timerqueue and fetch next one, thus preventing it from affecting other rtc timers. Signed-off-by: Xunlei Pang <pang.xunlei@linaro.org> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: John Stultz <john.stultz@linaro.org> Cc: Arnd Bergmann <arnd.bergmann@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10rtc/ab8500: set uie_unsupported flagXunlei Pang1-0/+2
Currently, ab8500 doesn't set uie_unsupported of rtc_device, while it doesn't support UIE, see ab8500_rtc_set_alarm(). Thus, when going through rtc_update_irq_enable()->rtc_timer_enqueue(), there's a chance it has an alarm timer1 queued before which is going to fired, so this update timer2 will be queued because it isn't the leftmost one, which means rtc_timer_enqueue() will return 0. This will result in two problems: 1) UIE EMUL will not be used. 2) When the alarm timer1 is fired, in rtc_timer_do_work() timer2 will fail to set the alarm time, so this rtc will disfunctional due to timer2 with the earliest expires in the timerqueue. So, rtc drivers must set this flag if they don't support UIE. Signed-off-by: Xunlei Pang <pang.xunlei@linaro.org> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: John Stultz <john.stultz@linaro.org> Cc: Arnd Bergmann <arnd.bergmann@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10drivers/rtc/rtc-snvs: fix suspend/resumeSanchayan Maity1-1/+4
The alarm interrupt handler also reads registers which are part of SNVS and need clocks enabled. However, the resume function is called after IRQ's have been enabled, hence this leads to a abort: Unhandled fault: external abort on non-linefetch (0x1008) at 0x908c604c Internal error: : 1008 [#1] ARM Modules linked in: CPU: 0 PID: 421 Comm: sh Not tainted 3.18.0-rc5-00135-g0689c67-dirty #1592 task: 8e03e800 ti: 8cad8000 task.ti: 8cad8000 PC is at snvs_rtc_irq_handler+0x14/0x74 LR is at handle_irq_event_percpu+0x3c/0x144 Fix this by using the .{suspend/resume}_noirq callbacks instead of .{suspend/resume} . Signed-off-by: Sanchayan Maity <maitysanchayan@gmail.com> Cc: Shawn Guo <shawn.guo@linaro.org> Cc: Stefan Agner <stefan@agner.ch> Cc: Alessandro Zummo <a.zummo@towertech.it> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10drivers/rtc/rtc-snvs: add clock supportSanchayan Maity1-2/+32
Add clock enable and disable support for the SNVS peripheral, which is required for using the RTC within the SNVS block. The clock is not strictly enforced, as this would break the i.MX devices. The clocking for the i.MX devices seems to be enabled elsewhere and enabling RTC SNVS for Vybrid results in a crash. This patch adds the clock support but also makes it optional so Vybrid platform can use the clock if defined while making sure not to break i.MX. Signed-off-by: Sanchayan Maity <maitysanchayan@gmail.com> Cc: Shawn Guo <shawn.guo@linaro.org> Acked-by: Stefan Agner <stefan@agner.ch> Acked-by: Alessandro Zummo <a.zummo@towertech.it> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10rtc: omap: drop vendor-prefix from power-controller dt propertyJohan Hovold3-4/+4
Drop the vendor-prefix from the "ti,system-power-controller" device-tree property name. It has been agreed to make "system-power-controller" a standard property and to drop the vendor-prefix that is currently used by several drivers. Note that drivers that have used "<vendor>,system-power-controller" in a released kernel will need to support both versions. Signed-off-by: Johan Hovold <johan@kernel.org> Cc: Tony Lindgren <tony@atomide.com> Cc: Benot Cousson <bcousson@baylibre.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: Felipe Balbi <balbi@ti.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10drivers/rtc/rtc-isl12057.c: report error code upon failure in dev_err() callsArnaud Ebalard1-8/+14
As pointed out by Mark, it is generally useful to log the error code when reporting a failure. This patch improves existing calls to dev_err() in ISL12057 driver to also report error code. Signed-off-by: Arnaud Ebalard <arno@natisbad.org> Suggested-by: Mark Brown <broonie@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: Peter Huewe <peter.huewe@infineon.com> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Thierry Reding <treding@nvidia.com> Cc: Grant Likely <grant.likely@linaro.org> Acked-by: Uwe Kleine-König <uwe@kleine-koenig.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10drivers/rtc/rtc-isl12057.c: add proper handling of oscillator failure bitArnaud Ebalard1-14/+33
As suggested by Uwe, instead of clearing oscillator failure bit unconditionally at driver load, this patch adds proper handling of the flag. The driver now returns -ENODATA when reading time from the device and oscillator failure bit is set. The flag is now cleared only when the a new time value is pushed to the device. Signed-off-by: Arnaud Ebalard <arno@natisbad.org> Reported-by: Uwe Kleine-König <uwe@kleine-koenig.org> Acked-by: Uwe Kleine-König <uwe@kleine-koenig.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: Peter Huewe <peter.huewe@infineon.com> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Thierry Reding <treding@nvidia.com> Cc: Mark Brown <broonie@kernel.org> Cc: Grant Likely <grant.likely@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10drivers/rtc/rtc-isl12057.c: add support for century bitArnaud Ebalard1-4/+16
The month register of ISL12057 RTC chip includes a century bit which reports overflow of year register from 99 to 0. This bit can also be written, which allows using it to extend the time interval the chip can support from 99 to 199 years. This patch adds support for century overflow bit in tm to regs and regs to tm helpers in ISL12057 driver. This was tested by putting a device 100 years in the future (using a specific kernel due to the inability of userland tools such as date or hwclock to pass year 2038), rebooting on a kernel w/ this patch applied and verifying the device was still 100 years in the future. Signed-off-by: Arnaud Ebalard <arno@natisbad.org> Suggested-by: Uwe Kleine-König <uwe@kleine-koenig.org> Acked-by: Uwe Kleine-König <uwe@kleine-koenig.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: Peter Huewe <peter.huewe@infineon.com> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Thierry Reding <treding@nvidia.com> Cc: Mark Brown <broonie@kernel.org> Cc: Grant Likely <grant.likely@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10drivers/rtc/rtc-isl12057.c: fix masking of register valuesArnaud Ebalard1-2/+2
When Intersil ISL12057 support was added by commit 70e123373c05 ("rtc: Add support for Intersil ISL12057 I2C RTC chip"), two masks for time registers values imported from the device were either wrong or omitted, leading to additional bits from those registers to impact read values: - mask for hour register value when reading it in AM/PM mode. As AM/PM mode is not the usual mode used by the driver, this error would only have an impact on an externally configured RTC hour later read by the driver. - mask for month value. The lack of masking would provide an erroneous value if century bit is set. This patch fixes those two masks. Fixes: 70e123373c05 ("rtc: Add support for Intersil ISL12057 I2C RTC chip") Signed-off-by: Arnaud Ebalard <arno@natisbad.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: Peter Huewe <peter.huewe@infineon.com> Cc: Linus Walleij <linus.walleij@linaro.org> Cc: Thierry Reding <treding@nvidia.com> Cc: Mark Brown <broonie@kernel.org> Cc: Grant Likely <grant.likely@linaro.org> Acked-by: Uwe Kleine-König <uwe@kleine-koenig.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>