aboutsummaryrefslogtreecommitdiffstats
path: root/lib/spinlock_debug.c (unfollow)
AgeCommit message (Collapse)AuthorFilesLines
2012-08-02um: Add arch/x86/um to MAINTAINERSRichard Weinberger1-0/+1
Signed-off-by: Richard Weinberger <richard@nod.at>
2012-08-02um: pass siginfo to guest processMartin Pärtel10-34/+71
UML guest processes now get correct siginfo_t for SIGTRAP, SIGFPE, SIGILL and SIGBUS. Specifically, si_addr and si_code are now correct where previously they were si_addr = NULL and si_code = 128. Signed-off-by: Martin Pärtel <martin.partel@gmail.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2012-08-02um: fix ubd_file_size for read-only filesMartin Pärtel1-1/+1
Made ubd_file_size not request write access. Fixes use of read-only images. Signed-off-by: Martin Pärtel <martin.partel@gmail.com> Signed-off-by: Richard Weinberger <richard@nod.at>
2012-08-02um: pull interrupt_end() into userspace()Al Viro2-8/+6
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Richard Weinberger <richard@nod.at>
2012-08-02um: split syscall_trace(), pass pt_regs to itAl Viro3-43/+34
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> [richard@nod.at: Fixed some minor build issues] Signed-off-by: Richard Weinberger <richard@nod.at>
2012-08-01um: switch UPT_SET_RETURN_VALUE and regs_return_value to pt_regsAl Viro3-5/+5
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Richard Weinberger <richard@nod.at>
2012-08-01MIPS: Loongson 2: Sort out clock managment.Ralf Baechle7-51/+33
For unexplainable reasons the Loongson 2 clock API was implemented in a module so fixing this involved shifting large amounts of code around. Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2012-08-01locks: remove unused lm_release_privateJ. Bruce Fields3-8/+1
In commit 3b6e2723f32d ("locks: prevent side-effects of locks_release_private before file_lock is initialized") we removed the last user of lm_release_private without removing the field itself. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-08-01MIPS: Loongson 1: more clk support and add select HAVE_CLKYoichi Yuasa2-0/+17
This fixes a redefinition of clk_*: arch/mips/loongson1/common/clock.c:23:13: error: redefinition of 'clk_get' include/linux/clk.h:281:27: note: previous definition of 'clk_get' was here arch/mips/loongson1/common/clock.c:41:15: error: redefinition of 'clk_get_rate' include/linux/clk.h:302:29: note: previous definition of 'clk_get_rate' was here make[3]: *** [arch/mips/loongson1/common/clock.o] Error 1 Signed-off-by: Yoichi Yuasa <yuasa@linux-mips.org> Cc: linux-mips@linux-mips.org Reviewed-by: John Crispin <blogic@openwrt.org> Acked-by: Kelvin Cheung <keguang.zhang@gmail.com> Patchwork: https://patchwork.linux-mips.org/patch/4143/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2012-08-01MIPS: txx9: Fix redefinition of clk_* by adding select HAVE_CLKYoichi Yuasa1-0/+1
arch/mips/txx9/generic/setup.c:87:13: error: redefinition of 'clk_get' include/linux/clk.h:281:27: note: previous definition of 'clk_get' was here arch/mips/txx9/generic/setup.c:97:5: error: redefinition of 'clk_enable' include/linux/clk.h:295:19: note: previous definition of 'clk_enable' was here arch/mips/txx9/generic/setup.c:103:6: error: redefinition of 'clk_disable' include/linux/clk.h:300:20: note: previous definition of 'clk_disable' was here arch/mips/txx9/generic/setup.c:108:15: error: redefinition of 'clk_get_rate' include/linux/clk.h:302:29: note: previous definition of 'clk_get_rate' was here arch/mips/txx9/generic/setup.c:114:6: error: redefinition of 'clk_put' include/linux/clk.h:291:20: note: previous definition of 'clk_put' was here make[3]: *** [arch/mips/txx9/generic/setup.o] Error 1 Signed-off-by: Yoichi Yuasa <yuasa@linux-mips.org> Cc: linux-mips@linux-mips.org Reviewed-by: John Crispin <blogic@openwrt.org> Patchwork: https://patchwork.linux-mips.org/patch/4142/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2012-08-01MIPS: BCM63xx: Fix redefinition of clk_* by adding select HAVE_CLKYoichi Yuasa1-0/+1
arch/mips/bcm63xx/clk.c:249:5: error: redefinition of 'clk_enable' include/linux/clk.h:295:19: note: previous definition of 'clk_enable' was here arch/mips/bcm63xx/clk.c:259:6: error: redefinition of 'clk_disable' include/linux/clk.h:300:20: note: previous definition of 'clk_disable' was here arch/mips/bcm63xx/clk.c:268:15: error: redefinition of 'clk_get_rate' include/linux/clk.h:302:29: note: previous definition of 'clk_get_rate' was here arch/mips/bcm63xx/clk.c:275:13: error: redefinition of 'clk_get' include/linux/clk.h:281:27: note: previous definition of 'clk_get' was here arch/mips/bcm63xx/clk.c:302:6: error: redefinition of 'clk_put' include/linux/clk.h:291:20: note: previous definition of 'clk_put' was here make[2]: *** [arch/mips/bcm63xx/clk.o] Error 1 Signed-off-by: Yoichi Yuasa <yuasa@linux-mips.org> Cc: linux-mips@linux-mips.org Reviewed-by: John Crispin <blogic@openwrt.org> Patchwork: https://patchwork.linux-mips.org/patch/4141/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2012-08-01MIPS: AR7: Fix redefinition of clk_* by adding select HAVE_CLKYoichi Yuasa1-0/+1
arch/mips/ar7/clock.c:420:5: error: redefinition of 'clk_enable' include/linux/clk.h:295:19: note: previous definition of 'clk_enable' was here arch/mips/ar7/clock.c:426:6: error: redefinition of 'clk_disable' include/linux/clk.h:300:20: note: previous definition of 'clk_disable' was here arch/mips/ar7/clock.c:431:15: error: redefinition of 'clk_get_rate' include/linux/clk.h:302:29: note: previous definition of 'clk_get_rate' was here arch/mips/ar7/clock.c:437:13: error: redefinition of 'clk_get' include/linux/clk.h:281:27: note: previous definition of 'clk_get' was here arch/mips/ar7/clock.c:454:6: error: redefinition of 'clk_put' include/linux/clk.h:291:20: note: previous definition of 'clk_put' was here make[2]: *** [arch/mips/ar7/clock.o] Error 1 Signed-off-by: Yoichi Yuasa <yuasa@linux-mips.org> Cc: linux-mips@linux-mips.org Reviewed-by: John Crispin <blogic@openwrt.org> Acked-by: Florian Fainelli <florian@openwrt.org> Patchwork: https://patchwork.linux-mips.org/patch/4140/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2012-08-01MIPS: Lantiq: Platform specific CLK fixupJohn Crispin1-0/+5
As we use CLKDEV_LOOKUP but dont have support for COMMON_CLK yet, we need to provide our own version of of_clk_get_from_provider(). Signed-off-by: John Crispin <blogic@openwrt.org> Cc: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/4117/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2012-08-01MIPS: Lantiq: Add device_tree_init functionJohn Crispin1-0/+22
Add a lantiq specific version of device_tree_init. The generic MIPS version was removed by. commit 594e966bc412d64eec9282d28ce511bdd62fea39 Author: David Daney <david.daney@cavium.com> Date: Thu Jul 5 18:12:38 2012 +0200 MIPS: Prune some target specific code out of prom.c Signed-off-by: John Crispin <blogic@openwrt.org> Cc: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/4116/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2012-08-01MIPS: Lantiq: Fix interface clock and PCI control register offsetJohn Crispin1-21/+28
The XRX200 based SoC have a different register offset for the interface clock and PCI control registers. This patch detects the SoC and sets the register offset at runtime. This make PCI work on the VR9 SoC. Signed-off-by: John Crispin <blogic@openwrt.org> Cc: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/4113/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2012-08-01delousing target_core_file a bitAl Viro1-27/+5
* set_fs(KERNEL_DS) + getname() is probably the weirdest implementation of strdup() I've seen. Especially since they don't to copy it at all... * filp_open() never returns NULL; it's ERR_PTR(-E...) on failure. * file->f_dentry is never going to be NULL, TYVM. * match_strdup() + snprintf() + kfree() is a bloody weird way to spell match_strlcpy(). Pox on cargo-cult programmers... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-08-01DM RAID: Add support for MD RAID10Jonathan Brassow2-5/+116
Support the MD RAID10 personality through dm-raid.c Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2012-08-01block: remove dead func declarationYuanhan Liu1-1/+0
__generic_unplug_device() function is removed with commit 7eaceaccab5f40bbfda044629a6298616aeaed50, which forgot to remove the declaration at meantime. Here remove it. Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-08-01block: add partition resize function to blkpg ioctlVivek Goyal5-9/+132
Add a new operation code (BLKPG_RESIZE_PARTITION) to the BLKPG ioctl that allows altering the size of an existing partition, even if it is currently in use. This patch converts hd_struct->nr_sects into sequence counter because One might extend a partition while IO is happening to it and update of nr_sects can be non-atomic on 32bit machines with 64bit sector_t. This can lead to issues like reading inconsistent size of a partition. Sequence counter have been used so that readers don't have to take bdev mutex lock as we call sector_in_part() very frequently. Now all the access to hd_struct->nr_sects should happen using sequence counter read/update helper functions part_nr_sects_read/part_nr_sects_write. There is one exception though, set_capacity()/get_capacity(). I think theoritically race should exist there too but this patch does not modify set_capacity()/get_capacity() due to sheer number of call sites and I am afraid that change might break something. I have left that as a TODO item. We can handle it later if need be. This patch does not introduce any new races as such w.r.t set_capacity()/get_capacity(). v2: Add CONFIG_LBDAF test to UP preempt case as suggested by Phillip. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Phillip Susi <psusi@ubuntu.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-08-01block: uninitialized ioc->nr_tasks triggers WARN_ONOlof Johansson1-0/+1
Hi, I'm using the old-fashioned 'dump' backup tool, and I noticed that it spews the below warning as of 3.5-rc1 and later (3.4 is fine): [ 10.886893] ------------[ cut here ]------------ [ 10.886904] WARNING: at include/linux/iocontext.h:140 copy_process+0x1488/0x1560() [ 10.886905] Hardware name: Bochs [ 10.886906] Modules linked in: [ 10.886908] Pid: 2430, comm: dump Not tainted 3.5.0-rc7+ #27 [ 10.886908] Call Trace: [ 10.886911] [<ffffffff8107ce8a>] warn_slowpath_common+0x7a/0xb0 [ 10.886912] [<ffffffff8107ced5>] warn_slowpath_null+0x15/0x20 [ 10.886913] [<ffffffff8107c088>] copy_process+0x1488/0x1560 [ 10.886914] [<ffffffff8107c244>] do_fork+0xb4/0x340 [ 10.886918] [<ffffffff8108effa>] ? recalc_sigpending+0x1a/0x50 [ 10.886919] [<ffffffff8108f6b2>] ? __set_task_blocked+0x32/0x80 [ 10.886920] [<ffffffff81091afa>] ? __set_current_blocked+0x3a/0x60 [ 10.886923] [<ffffffff81051db3>] sys_clone+0x23/0x30 [ 10.886925] [<ffffffff8179bd73>] stub_clone+0x13/0x20 [ 10.886927] [<ffffffff8179baa2>] ? system_call_fastpath+0x16/0x1b [ 10.886928] ---[ end trace 32a14af7ee6a590b ]--- Reproducing is easy, I can hit it on a KVM system with a very basic config (x86_64 make defconfig + enable the drivers needed). To hit it, just install dump (on debian/ubuntu, not sure what the package might be called on Fedora), and: dump -o -f /tmp/foo / You'll see the warning in dmesg once it forks off the I/O process and starts dumping filesystem contents. I bisected it down to the following commit: commit f6e8d01bee036460e03bd4f6a79d014f98ba712e Author: Tejun Heo <tj@kernel.org> Date: Mon Mar 5 13:15:26 2012 -0800 block: add io_context->active_ref Currently ioc->nr_tasks is used to decide two things - whether an ioc is done issuing IOs and whether it's shared by multiple tasks. This patch separate out the first into ioc->active_ref, which is acquired and released using {get|put}_io_context_active() respectively. This will be used to associate bio's with a given task. This patch doesn't introduce any visible behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> It seems like the init of ioc->nr_tasks was removed in that patch, so it starts out at 0 instead of 1. Tejun, is the right thing here to add back the init, or should something else be done? The below patch removes the warning, but I haven't done any more extensive testing on it. Signed-off-by: Olof Johansson <olof@lixom.net> Acked-by: Tejun Heo <tj@kernel.org> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-08-01block: do not artificially constrain max_sectors for stacking driversMike Snitzer1-2/+1
blk_set_stacking_limits is intended to allow stacking drivers to build up the limits of the stacked device based on the underlying devices' limits. But defaulting 'max_sectors' to BLK_DEF_MAX_SECTORS (1024) doesn't allow the stacking driver to inherit a max_sectors larger than 1024 -- due to blk_stack_limits' use of min_not_zero. It is now clear that this artificial limit is getting in the way so change blk_set_stacking_limits's max_sectors to UINT_MAX (which allows stacking drivers like dm-multipath to inherit 'max_sectors' from the underlying paths). Reported-by: Vijay Chauhan <vijay.chauhan@netapp.com> Tested-by: Vijay Chauhan <vijay.chauhan@netapp.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-08-01ALSA: snd-usb: fix clock source validity indexDaniel Mack1-1/+2
uac_clock_source_is_valid() uses the control selector value to access the bmControls bitmap of the clock source unit. This is wrong, as control selector values start from 1, while the bitmap uses all available bits. In other words, "Clock Validity Control" is stored in D3..2, not D5..4 of the clock selector unit's bmControls. Signed-off-by: Daniel Mack <zonque@gmail.com> Reported-by: Andreas Koch <andreas@akdesigninc.com> Cc: stable@kernel.org Signed-off-by: Takashi Iwai <tiwai@suse.de>
2012-07-31rtc/rtc-88pm80x: remove unneed devm_kfreeDevendra Naga1-2/+0
devm_kzalloc() doesn't need a matching devm_kfree(), the freeing mechanism will trigger when driver unloads. Signed-off-by: Devendra Naga <devendra.aaru@gmail.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: Ashish Jangam <ashish.jangam@kpitcummins.com> Cc: David Dajun Chen <dchen@diasemi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31rtc/rtc-88pm80x: assign ret only when rtc_register_driver failsDevendra Naga1-1/+1
At the probe we are assigning ret to return value of PTR_ERR right after the rtc_register_drive()r, as we would have done it in the if (IS_ERR(ptr)) check, since the function fails and goes inside that case Signed-off-by: Devendra Naga <devendra.aaru@gmail.com> Cc: Alessandro Zummo <a.zummo@towertech.it> Cc: Ashish Jangam <ashish.jangam@kpitcummins.com> Cc: David Dajun Chen <dchen@diasemi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31mm: hugetlbfs: close race during teardown of hugetlbfs shared page tablesMel Gorman3-3/+38
If a process creates a large hugetlbfs mapping that is eligible for page table sharing and forks heavily with children some of whom fault and others which destroy the mapping then it is possible for page tables to get corrupted. Some teardowns of the mapping encounter a "bad pmd" and output a message to the kernel log. The final teardown will trigger a BUG_ON in mm/filemap.c. This was reproduced in 3.4 but is known to have existed for a long time and goes back at least as far as 2.6.37. It was probably was introduced in 2.6.20 by [39dde65c: shared page table for hugetlb page]. The messages look like this; [ ..........] Lots of bad pmd messages followed by this [ 127.164256] mm/memory.c:391: bad pmd ffff880412e04fe8(80000003de4000e7). [ 127.164257] mm/memory.c:391: bad pmd ffff880412e04ff0(80000003de6000e7). [ 127.164258] mm/memory.c:391: bad pmd ffff880412e04ff8(80000003de0000e7). [ 127.186778] ------------[ cut here ]------------ [ 127.186781] kernel BUG at mm/filemap.c:134! [ 127.186782] invalid opcode: 0000 [#1] SMP [ 127.186783] CPU 7 [ 127.186784] Modules linked in: af_packet cpufreq_conservative cpufreq_userspace cpufreq_powersave acpi_cpufreq mperf ext3 jbd dm_mod coretemp crc32c_intel usb_storage ghash_clmulni_intel aesni_intel i2c_i801 r8169 mii uas sr_mod cdrom sg iTCO_wdt iTCO_vendor_support shpchp serio_raw cryptd aes_x86_64 e1000e pci_hotplug dcdbas aes_generic container microcode ext4 mbcache jbd2 crc16 sd_mod crc_t10dif i915 drm_kms_helper drm i2c_algo_bit ehci_hcd ahci libahci usbcore rtc_cmos usb_common button i2c_core intel_agp video intel_gtt fan processor thermal thermal_sys hwmon ata_generic pata_atiixp libata scsi_mod [ 127.186801] [ 127.186802] Pid: 9017, comm: hugetlbfs-test Not tainted 3.4.0-autobuild #53 Dell Inc. OptiPlex 990/06D7TR [ 127.186804] RIP: 0010:[<ffffffff810ed6ce>] [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160 [ 127.186809] RSP: 0000:ffff8804144b5c08 EFLAGS: 00010002 [ 127.186810] RAX: 0000000000000001 RBX: ffffea000a5c9000 RCX: 00000000ffffffc0 [ 127.186811] RDX: 0000000000000000 RSI: 0000000000000009 RDI: ffff88042dfdad00 [ 127.186812] RBP: ffff8804144b5c18 R08: 0000000000000009 R09: 0000000000000003 [ 127.186813] R10: 0000000000000000 R11: 000000000000002d R12: ffff880412ff83d8 [ 127.186814] R13: ffff880412ff83d8 R14: 0000000000000000 R15: ffff880412ff83d8 [ 127.186815] FS: 00007fe18ed2c700(0000) GS:ffff88042dce0000(0000) knlGS:0000000000000000 [ 127.186816] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 127.186817] CR2: 00007fe340000503 CR3: 0000000417a14000 CR4: 00000000000407e0 [ 127.186818] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 127.186819] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 127.186820] Process hugetlbfs-test (pid: 9017, threadinfo ffff8804144b4000, task ffff880417f803c0) [ 127.186821] Stack: [ 127.186822] ffffea000a5c9000 0000000000000000 ffff8804144b5c48 ffffffff810ed83b [ 127.186824] ffff8804144b5c48 000000000000138a 0000000000001387 ffff8804144b5c98 [ 127.186825] ffff8804144b5d48 ffffffff811bc925 ffff8804144b5cb8 0000000000000000 [ 127.186827] Call Trace: [ 127.186829] [<ffffffff810ed83b>] delete_from_page_cache+0x3b/0x80 [ 127.186832] [<ffffffff811bc925>] truncate_hugepages+0x115/0x220 [ 127.186834] [<ffffffff811bca43>] hugetlbfs_evict_inode+0x13/0x30 [ 127.186837] [<ffffffff811655c7>] evict+0xa7/0x1b0 [ 127.186839] [<ffffffff811657a3>] iput_final+0xd3/0x1f0 [ 127.186840] [<ffffffff811658f9>] iput+0x39/0x50 [ 127.186842] [<ffffffff81162708>] d_kill+0xf8/0x130 [ 127.186843] [<ffffffff81162812>] dput+0xd2/0x1a0 [ 127.186845] [<ffffffff8114e2d0>] __fput+0x170/0x230 [ 127.186848] [<ffffffff81236e0e>] ? rb_erase+0xce/0x150 [ 127.186849] [<ffffffff8114e3ad>] fput+0x1d/0x30 [ 127.186851] [<ffffffff81117db7>] remove_vma+0x37/0x80 [ 127.186853] [<ffffffff81119182>] do_munmap+0x2d2/0x360 [ 127.186855] [<ffffffff811cc639>] sys_shmdt+0xc9/0x170 [ 127.186857] [<ffffffff81410a39>] system_call_fastpath+0x16/0x1b [ 127.186858] Code: 0f 1f 44 00 00 48 8b 43 08 48 8b 00 48 8b 40 28 8b b0 40 03 00 00 85 f6 0f 88 df fe ff ff 48 89 df e8 e7 cb 05 00 e9 d2 fe ff ff <0f> 0b 55 83 e2 fd 48 89 e5 48 83 ec 30 48 89 5d d8 4c 89 65 e0 [ 127.186868] RIP [<ffffffff810ed6ce>] __delete_from_page_cache+0x15e/0x160 [ 127.186870] RSP <ffff8804144b5c08> [ 127.186871] ---[ end trace 7cbac5d1db69f426 ]--- The bug is a race and not always easy to reproduce. To reproduce it I was doing the following on a single socket I7-based machine with 16G of RAM. $ hugeadm --pool-pages-max DEFAULT:13G $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmmax $ echo $((18*1048576*1024)) > /proc/sys/kernel/shmall $ for i in `seq 1 9000`; do ./hugetlbfs-test; done On my particular machine, it usually triggers within 10 minutes but enabling debug options can change the timing such that it never hits. Once the bug is triggered, the machine is in trouble and needs to be rebooted. The machine will respond but processes accessing proc like "ps aux" will hang due to the BUG_ON. shutdown will also hang and needs a hard reset or a sysrq-b. The basic problem is a race between page table sharing and teardown. For the most part page table sharing depends on i_mmap_mutex. In some cases, it is also taking the mm->page_table_lock for the PTE updates but with shared page tables, it is the i_mmap_mutex that is more important. Unfortunately it appears to be also insufficient. Consider the following situation Process A Process B --------- --------- hugetlb_fault shmdt LockWrite(mmap_sem) do_munmap unmap_region unmap_vmas unmap_single_vma unmap_hugepage_range Lock(i_mmap_mutex) Lock(mm->page_table_lock) huge_pmd_unshare/unmap tables <--- (1) Unlock(mm->page_table_lock) Unlock(i_mmap_mutex) huge_pte_alloc ... Lock(i_mmap_mutex) ... vma_prio_walk, find svma, spte ... Lock(mm->page_table_lock) ... share spte ... Unlock(mm->page_table_lock) ... Unlock(i_mmap_mutex) ... hugetlb_no_page <--- (2) free_pgtables unlink_file_vma hugetlb_free_pgd_range remove_vma_list In this scenario, it is possible for Process A to share page tables with Process B that is trying to tear them down. The i_mmap_mutex on its own does not prevent Process A walking Process B's page tables. At (1) above, the page tables are not shared yet so it unmaps the PMDs. Process A sets up page table sharing and at (2) faults a new entry. Process B then trips up on it in free_pgtables. This patch fixes the problem by adding a new function __unmap_hugepage_range_final that is only called when the VMA is about to be destroyed. This function clears VM_MAYSHARE during unmap_hugepage_range() under the i_mmap_mutex. This makes the VMA ineligible for sharing and avoids the race. Superficially this looks like it would then be vunerable to truncate and madvise issues but hugetlbfs has its own truncate handlers so does not use unmap_mapping_range() and does not support madvise(DONTNEED). This should be treated as a -stable candidate if it is merged. Test program is as follows. The test case was mostly written by Michal Hocko with a few minor changes to reproduce this bug. ==== CUT HERE ==== static size_t huge_page_size = (2UL << 20); static size_t nr_huge_page_A = 512; static size_t nr_huge_page_B = 5632; unsigned int get_random(unsigned int max) { struct timeval tv; gettimeofday(&tv, NULL); srandom(tv.tv_usec); return random() % max; } static void play(void *addr, size_t size) { unsigned char *start = addr, *end = start + size, *a; start += get_random(size/2); /* we could itterate on huge pages but let's give it more time. */ for (a = start; a < end; a += 4096) *a = 0; } int main(int argc, char **argv) { key_t key = IPC_PRIVATE; size_t sizeA = nr_huge_page_A * huge_page_size; size_t sizeB = nr_huge_page_B * huge_page_size; int shmidA, shmidB; void *addrA = NULL, *addrB = NULL; int nr_children = 300, n = 0; if ((shmidA = shmget(key, sizeA, IPC_CREAT|SHM_HUGETLB|0660)) == -1) { perror("shmget:"); return 1; } if ((addrA = shmat(shmidA, addrA, SHM_R|SHM_W)) == (void *)-1UL) { perror("shmat"); return 1; } if ((shmidB = shmget(key, sizeB, IPC_CREAT|SHM_HUGETLB|0660)) == -1) { perror("shmget:"); return 1; } if ((addrB = shmat(shmidB, addrB, SHM_R|SHM_W)) == (void *)-1UL) { perror("shmat"); return 1; } fork_child: switch(fork()) { case 0: switch (n%3) { case 0: play(addrA, sizeA); break; case 1: play(addrB, sizeB); break; case 2: break; } break; case -1: perror("fork:"); break; default: if (++n < nr_children) goto fork_child; play(addrA, sizeA); break; } shmdt(addrA); shmdt(addrB); do { wait(NULL); } while (--n > 0); shmctl(shmidA, IPC_RMID, NULL); shmctl(shmidB, IPC_RMID, NULL); return 0; } [akpm@linux-foundation.org: name the declaration's args, fix CONFIG_HUGETLBFS=n build] Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31tmpfs: distribute interleave better across nodesNathan Zimmer1-2/+4
When tmpfs has the interleave memory policy, it always starts allocating for each file from node 0 at offset 0. When there are many small files, the lower nodes fill up disproportionately. This patch spreads out node usage by starting files at nodes other than 0, by using the inode number to bias the starting node for interleave. Signed-off-by: Nathan Zimmer <nzimmer@sgi.com> Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Nick Piggin <npiggin@gmail.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31mm: remove redundant initializationMinchan Kim2-12/+2
pg_data_t is zeroed before reaching free_area_init_core(), so remove the now unnecessary initializations. Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31mm: warn if pg_data_t isn't initialized with zeroMinchan Kim1-0/+3
Warn if memory-hotplug/boot code doesn't initialize pg_data_t with zero when it is allocated. Arch code and memory hotplug already initiailize pg_data_t. So this warning should never happen. I select fields randomly near the beginning, middle and end of pg_data_t for checking. This patch isn't for performance but for removing initialization code which is necessary to add whenever we adds new field to pg_data_t or zone. Firstly, Andrew suggested clearing out of pg_data_t in MM core part but Tejun doesn't like it because in the future, some archs can initialize some fields in arch code and pass them into general MM part so blindly clearing it out in mm core part would be very annoying. Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31mips: zero out pg_data_t when it's allocatedMinchan Kim1-0/+1
This patch is preparation for the next patch which removes the zeroing of the pg_data_t in core MM. All archs except MIPS already do this. Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31memcg: gix memory accounting scalability in shrink_page_listTim Chen1-0/+2
I noticed in a multi-process parallel files reading benchmark I ran on a 8 socket machine, throughput slowed down by a factor of 8 when I ran the benchmark within a cgroup container. I traced the problem to the following code path (see below) when we are trying to reclaim memory from file cache. The res_counter_uncharge function is called on every page that's reclaimed and created heavy lock contention. The patch below allows the reclaimed pages to be uncharged from the resource counter in batch and recovered the regression. Tim 40.67% usemem [kernel.kallsyms] [k] _raw_spin_lock | --- _raw_spin_lock | |--92.61%-- res_counter_uncharge | | | |--100.00%-- __mem_cgroup_uncharge_common | | | | | |--100.00%-- mem_cgroup_uncharge_cache_page | | | __remove_mapping | | | shrink_page_list | | | shrink_inactive_list | | | shrink_mem_cgroup_zone | | | shrink_zone | | | do_try_to_free_pages | | | try_to_free_pages | | | __alloc_pages_nodemask | | | alloc_pages_current Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31mm/sparse: remove index_init_lockGavin Shan1-13/+1
sparse_index_init() uses the index_init_lock spinlock to protect root mem_section assignment. The lock is not necessary anymore because the function is called only during boot (during paging init which is executed only from a single CPU) and from the hotplug code (by add_memory() via arch_add_memory()) which uses mem_hotplug_mutex. The lock was introduced by 28ae55c9 ("sparsemem extreme: hotplug preparation") and sparse_index_init() was used only during boot at that time. Later when the hotplug code (and add_memory()) was introduced there was no synchronization so it was possible to online more sections from the same root probably (though I am not 100% sure about that). The first synchronization has been added by 6ad696d2 ("mm: allow memory hotplug and hibernation in the same kernel") which was later replaced by the mem_hotplug_mutex - 20d6c96b ("mem-hotplug: introduce {un}lock_memory_hotplug()"). Let's remove the lock as it is not needed and it makes the code more confusing. [mhocko@suse.cz: changelog] Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31mm/sparse: more checks on mem_section numberGavin Shan1-0/+2
__section_nr() was implemented to retrieve the corresponding memory section number according to its descriptor. It's possible that the specified memory section descriptor doesn't exist in the global array. So add more checking on that and report an error for a wrong case. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Acked-by: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31mm/sparse: optimize sparse_index_allocGavin Shan1-6/+4
With CONFIG_SPARSEMEM_EXTREME, the two levels of memory section descriptors are allocated from slab or bootmem. When allocating from slab, let slab/bootmem allocator clear the memory chunk. We needn't clear it explicitly. Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> Reviewed-by: Michal Hocko <mhocko@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31memcg: add mem_cgroup_from_css() helperWanpeng Li1-8/+11
Add a mem_cgroup_from_css() helper to replace open-coded invokations of container_of(). To clarify the code and to add a little more type safety. [akpm@linux-foundation.org: fix extensive breakage] Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Gavin Shan <shangw@linux.vnet.ibm.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Gavin Shan <shangw@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31memcg: further prevent OOM with too many dirty pagesHugh Dickins1-9/+24
The may_enter_fs test turns out to be too restrictive: though I saw no problem with it when testing on 3.5-rc6, it very soon OOMed when I tested on 3.5-rc6-mm1. I don't know what the difference there is, perhaps I just slightly changed the way I started off the testing: dd if=/dev/zero of=/mnt/temp bs=1M count=1024; rm -f /mnt/temp; sync repeatedly, in 20M memory.limit_in_bytes cgroup to ext4 on USB stick. ext4 (and gfs2 and xfs) turn out to allocate new pages for writing with AOP_FLAG_NOFS: that seems a little worrying, and it's unclear to me why the transaction needs to be started even before allocating pagecache memory. But it may not be worth worrying about these days: if direct reclaim avoids FS writeback, does __GFP_FS now mean anything? Anyway, we insisted on the may_enter_fs test to avoid hangs with the loop device; but since that also masks off __GFP_IO, we can test for __GFP_IO directly, ignoring may_enter_fs and __GFP_FS. But even so, the test still OOMs sometimes: when originally testing on 3.5-rc6, it OOMed about one time in five or ten; when testing just now on 3.5-rc6-mm1, it OOMed on the first iteration. This residual problem comes from an accumulation of pages under ordinary writeback, not marked PageReclaim, so rightly not causing the memcg check to wait on their writeback: these too can prevent shrink_page_list() from freeing any pages, so many times that memcg reclaim fails and OOMs. Deal with these in the same way as direct reclaim now deals with dirty FS pages: mark them PageReclaim. It is appropriate to rotate these to tail of list when writepage completes, but more importantly, the PageReclaim flag makes memcg reclaim wait on them if encountered again. Increment NR_VMSCAN_IMMEDIATE? That's arguable: I chose not. Setting PageReclaim here may occasionally race with end_page_writeback() clearing it: lru_deactivate_fn() already faced the same race, and correctly concluded that the window is small and the issue non-critical. With these changes, the test runs indefinitely without OOMing on ext4, ext3 and ext2: I'll move on to test with other filesystems later. Trivia: invert conditions for a clearer block without an else, and goto keep_locked to do the unlock_page. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Ying Han <yinghan@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Fengguang Wu <fengguang.wu@intel.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31memcg: prevent OOM with too many dirty pagesMichal Hocko1-3/+20
The current implementation of dirty pages throttling is not memcg aware which makes it easy to have memcg LRUs full of dirty pages. Without throttling, these LRUs can be scanned faster than the rate of writeback, leading to memcg OOM conditions when the hard limit is small. This patch fixes the problem by throttling the allocating process (possibly a writer) during the hard limit reclaim by waiting on PageReclaim pages. We are waiting only for PageReclaim pages because those are the pages that made one full round over LRU and that means that the writeback is much slower than scanning. The solution is far from being ideal - long term solution is memcg aware dirty throttling - but it is meant to be a band aid until we have a real fix. We are seeing this happening during nightly backups which are placed into containers to prevent from eviction of the real working set. The change affects only memcg reclaim and only when we encounter PageReclaim pages which is a signal that the reclaim doesn't catch up on with the writers so somebody should be throttled. This could be potentially unfair because it could be somebody else from the group who gets throttled on behalf of the writer but as writers need to allocate as well and they allocate in higher rate the probability that only innocent processes would be penalized is not that high. I have tested this change by a simple dd copying /dev/zero to tmpfs or ext3 running under small memcg (1G copy under 5M, 60M, 300M and 2G containers) and dd got killed by OOM killer every time. With the patch I could run the dd with the same size under 5M controller without any OOM. The issue is more visible with slower devices for output. * With the patch ================ * tmpfs size=2G --------------- $ vim cgroup_cache_oom_test.sh $ ./cgroup_cache_oom_test.sh 5M using Limit 5M for group 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 30.4049 s, 34.5 MB/s $ ./cgroup_cache_oom_test.sh 60M using Limit 60M for group 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 31.4561 s, 33.3 MB/s $ ./cgroup_cache_oom_test.sh 300M using Limit 300M for group 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 20.4618 s, 51.2 MB/s $ ./cgroup_cache_oom_test.sh 2G using Limit 2G for group 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 1.42172 s, 738 MB/s * ext3 ------ $ ./cgroup_cache_oom_test.sh 5M using Limit 5M for group 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 27.9547 s, 37.5 MB/s $ ./cgroup_cache_oom_test.sh 60M using Limit 60M for group 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 30.3221 s, 34.6 MB/s $ ./cgroup_cache_oom_test.sh 300M using Limit 300M for group 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 24.5764 s, 42.7 MB/s $ ./cgroup_cache_oom_test.sh 2G using Limit 2G for group 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 3.35828 s, 312 MB/s * Without the patch =================== * tmpfs size=2G --------------- $ ./cgroup_cache_oom_test.sh 5M using Limit 5M for group ./cgroup_cache_oom_test.sh: line 46: 4668 Killed dd if=/dev/zero of=$OUT/zero bs=1M count=$count $ ./cgroup_cache_oom_test.sh 60M using Limit 60M for group 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 25.4989 s, 41.1 MB/s $ ./cgroup_cache_oom_test.sh 300M using Limit 300M for group 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 24.3928 s, 43.0 MB/s $ ./cgroup_cache_oom_test.sh 2G using Limit 2G for group 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 1.49797 s, 700 MB/s * ext3 ------ $ ./cgroup_cache_oom_test.sh 5M using Limit 5M for group ./cgroup_cache_oom_test.sh: line 46: 4689 Killed dd if=/dev/zero of=$OUT/zero bs=1M count=$count $ ./cgroup_cache_oom_test.sh 60M using Limit 60M for group ./cgroup_cache_oom_test.sh: line 46: 4692 Killed dd if=/dev/zero of=$OUT/zero bs=1M count=$count $ ./cgroup_cache_oom_test.sh 300M using Limit 300M for group 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 20.248 s, 51.8 MB/s $ ./cgroup_cache_oom_test.sh 2G using Limit 2G for group 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 2.85201 s, 368 MB/s [akpm@linux-foundation.org: tweak changelog, reordered the test to optimize for CONFIG_CGROUP_MEM_RES_CTLR=n] [hughd@google.com: fix deadlock with loop driver] Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Ying Han <yinghan@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Reviewed-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31mm: mmu_notifier: fix freed page still mapped in secondary MMUXiao Guangrong1-22/+23
mmu_notifier_release() is called when the process is exiting. It will delete all the mmu notifiers. But at this time the page belonging to the process is still present in page tables and is present on the LRU list, so this race will happen: CPU 0 CPU 1 mmu_notifier_release: try_to_unmap: hlist_del_init_rcu(&mn->hlist); ptep_clear_flush_notify: mmu nofifler not found free page !!!!!! /* * At the point, the page has been * freed, but it is still mapped in * the secondary MMU. */ mn->ops->release(mn, mm); Then the box is not stable and sometimes we can get this bug: [ 738.075923] BUG: Bad page state in process migrate-perf pfn:03bec [ 738.075931] page:ffffea00000efb00 count:0 mapcount:0 mapping: (null) index:0x8076 [ 738.075936] page flags: 0x20000000000014(referenced|dirty) The same issue is present in mmu_notifier_unregister(). We can call ->release before deleting the notifier to ensure the page has been unmapped from the secondary MMU before it is freed. Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Avi Kivity <avi@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31mm: memcg: only check anon swapin page charges for swap cacheJohannes Weiner1-8/+14
shmem knows for sure that the page is in swap cache when attempting to charge a page, because the cache charge entry function has a check for it. Only anon pages may be removed from swap cache already when trying to charge their swapin. Adjust the comment, though: '4969c11 mm: fix swapin race condition' added a stable PageSwapCache check under the page lock in the do_swap_page() before calling the memory controller, so it's unuse_pte()'s pte_same() that may fail. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Wanpeng Li <liwp.linux@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31mm: memcg: only check swap cache pages for repeated chargingJohannes Weiner1-5/+12
Only anon and shmem pages in the swap cache are attempted to be charged multiple times, from every swap pte fault or from shmem_unuse(). No other pages require checking PageCgroupUsed(). Charging pages in the swap cache is also serialized by the page lock, and since both the try_charge and commit_charge are called under the same page lock section, the PageCgroupUsed() check might as well happen before the counter charging, let alone reclaim. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Wanpeng Li <liwp.linux@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31mm: memcg: split swapin charge function into private and public partJohannes Weiner1-9/+15
When shmem is charged upon swapin, it does not need to check twice whether the memory controller is enabled. Also, shmem pages do not have to be checked for everything that regular anon pages have to be checked for, so let shmem use the internal version directly and allow future patches to move around checks that are only required when swapping in anon pages. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Wanpeng Li <liwp.linux@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31mm: memcg: remove needless !mm fixup to init_mm when chargingJohannes Weiner1-6/+1
It does not matter to __mem_cgroup_try_charge() if the passed mm is NULL or init_mm, it will charge the root memcg in either case. Also fix up the comment in __mem_cgroup_try_charge() that claimed the init_mm would be charged when no mm was passed. It's not really incorrect, but confusing. Clarify that the root memcg is charged in this case. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Wanpeng Li <liwp.linux@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31mm: memcg: remove unneeded shmem charge typeJohannes Weiner1-10/+1
shmem page charges have not needed a separate charge type to tell them from regular file pages since 08e552c ("memcg: synchronized LRU"). Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Wanpeng Li <liwp.linux@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31mm: memcg: move swapin charge functions above callsitesJohannes Weiner1-36/+32
Charging cache pages may require swapin in the shmem case. Save the forward declaration and just move the swapin functions above the cache charging functions. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Wanpeng Li <liwp.linux@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31mm: memcg: only check for PageSwapCache when uncharging anonJohannes Weiner1-9/+4
Only anon pages that are uncharged at the time of the last page table mapping vanishing may be in swapcache. When shmem pages, file pages, swap-freed anon pages, or just migrated pages are uncharged, they are known for sure to be not in swapcache. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Wanpeng Li <liwp.linux@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31mm: memcg: push down PageSwapCache check into uncharge entry functionsJohannes Weiner1-6/+12
Not all uncharge paths need to check if the page is swapcache, some of them can know for sure. Push down the check into all callsites of uncharge_common() so that the patch that removes some of them is more obvious. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Wanpeng Li <liwp.linux@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31mm: swapfile: clean up unuse_pte race handlingJohannes Weiner1-2/+1
The conditional mem_cgroup_cancel_charge_swapin() is a leftover from when the function would continue to reestablish the page even after mem_cgroup_try_charge_swapin() failed. After 85d9fc8 "memcg: fix refcnt handling at swapoff", the condition is always true when this code is reached. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Wanpeng Li <liwp.linux@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31mm: memcg: fix compaction/migration failing due to memcg limitsJohannes Weiner3-46/+43
Compaction (and page migration in general) can currently be hindered through pages being owned by memory cgroups that are at their limits and unreclaimable. The reason is that the replacement page is being charged against the limit while the page being replaced is also still charged. But this seems unnecessary, given that only one of the two pages will still be in use after migration finishes. This patch changes the memcg migration sequence so that the replacement page is not charged. Whatever page is still in use after successful or failed migration gets to keep the charge of the page that was going to be replaced. The replacement page will still show up temporarily in the rss/cache statistics, this can be fixed in a later patch as it's less urgent. Reported-by: David Rientjes <rientjes@google.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Wanpeng Li <liwp.linux@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31swapfile: avoid dereferencing bd_disk during swap_entry_free for network storageMel Gorman1-4/+6
Commit b3a27d ("swap: Add swap slot free callback to block_device_operations") dereferences p->bdev->bd_disk but this is a NULL dereference if using swap-over-NFS. This patch checks SWP_BLKDEV on the swap_info_struct before dereferencing. With reference to this callback, Christoph Hellwig stated "Please just remove the callback entirely. It has no user outside the staging tree and was added clearly against the rules for that staging tree". This would also be my preference but there was not an obvious way of keeping zram in staging/ happy. Signed-off-by: Xiaotian Feng <dfeng@redhat.com> Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: David S. Miller <davem@davemloft.net> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Paris <eparis@redhat.com> Cc: James Morris <jmorris@namei.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Neil Brown <neilb@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31nfs: prevent page allocator recursions with swap over NFS.Mel Gorman2-3/+3
GFP_NOFS is _more_ permissive than GFP_NOIO in that it will initiate IO, just not of any filesystem data. The problem is that previously NOFS was correct because that avoids recursion into the NFS code. With swap-over-NFS, it is no longer correct as swap IO can lead to this recursion. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: David S. Miller <davem@davemloft.net> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Paris <eparis@redhat.com> Cc: James Morris <jmorris@namei.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Neil Brown <neilb@suse.de> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Xiaotian Feng <dfeng@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-31nfs: enable swap on NFSMel Gorman9-34/+149
Implement the new swapfile a_ops for NFS and hook up ->direct_IO. This will set the NFS socket to SOCK_MEMALLOC and run socket reconnect under PF_MEMALLOC as well as reset SOCK_MEMALLOC before engaging the protocol ->connect() method. PF_MEMALLOC should allow the allocation of struct socket and related objects and the early (re)setting of SOCK_MEMALLOC should allow us to receive the packets required for the TCP connection buildup. [jlayton@redhat.com: Restore PF_MEMALLOC task flags in all cases] [dfeng@redhat.com: Fix handling of multiple swap files] [a.p.zijlstra@chello.nl: Original patch] Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: David S. Miller <davem@davemloft.net> Cc: Eric B Munson <emunson@mgebm.net> Cc: Eric Paris <eparis@redhat.com> Cc: James Morris <jmorris@namei.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Christie <michaelc@cs.wisc.edu> Cc: Neil Brown <neilb@suse.de> Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Xiaotian Feng <dfeng@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>