aboutsummaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2007-05-02[PATCH] x86-64: Inhibit machine from asserting an NMI when doing Alt-SysRq-M operation.Konrad Rzeszutek1-0/+6
This patch touches the NMI watchdog every MAX_ORDER_NR_PAGES to inhibit the machine from triggering an NMI while the CPUs are locked. This situation is happening on boxes with more than 64CPUs and 128GB of RAM when Alt-SysRq-m is performed. It has been succesfully tested for regression on uni, 2, 4, 8 32, and 64 CPU boxes with various memory configuration. Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02[PATCH] x86-64: vsyscall_gtod_data diet and vgettimeofday() fixEric Dumazet1-17/+36
Current vsyscall_gtod_data is large (3 or 4 cache lines dirtied at timer interrupt). We can shrink it to exactly 64 bytes (1 cache line on AMD64) Instead of copying a whole struct clocksource, we copy only needed fields. I deleted an unused field : offset_base This patch fixes one oddity in vgettimeofday(): It can returns a timeval with tv_usec = 1000000. Maybe not a bug, but why not doing the right thing ? Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02[PATCH] x86-64: fix vtime() vsyscallEric Dumazet1-3/+5
There is a tiny probability that the return value from vtime(time_t *t) is Signed-off-by: Andi Kleen <ak@suse.de> different than the value stored in *t Using a temporary variable solves the problem and gives a faster code. 17: 48 85 ff test %rdi,%rdi 1a: 48 8b 05 00 00 00 00 mov 0(%rip),%rax # __vsyscall_gtod_data.wall_time_tv.tv_sec 21: 74 03 je 26 23: 48 89 07 mov %rax,(%rdi) 26: c9 leaveq 27: c3 retq Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
2007-05-02[PATCH] x86: remove UNEXPECTED_IO_APIC()Adrian Bunk2-58/+0
Many years ago, UNEXPECTED_IO_APIC() contained printk()'s (but nothing more). Now that it's completely empty for years, we can as well remove it. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02[PATCH] x86: sys_ioperm() prototype cleanupAdrian Bunk3-1/+2
- there's no reason for duplicating the prototype from include/linux/syscalls.h in include/asm-x86_64/unistd.h - every file should #include the headers containing the prototypes for it's global functions Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02[PATCH] x86-64: use lru instead of page->index and page->private for pgd lists management.Christoph Lameter3-15/+6
x86_64 currently simulates a list using the index and private fields of the page struct. Seems that the code was inherited from i386. But x86_64 does not use the slab to allocate pgds and pmds etc. So the lru field is not used by the slab and therefore available. This patch uses standard list operations on page->lru to realize pgd tracking. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andi Kleen <ak@suse.de> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-05-02[PATCH] i386: remove the APM_RTC_IS_GMT config option.Parag Warudkar1-13/+0
Signed-off-by: Parag Warudkar <parag.warudkar@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02[PATCH] x86-64: Remove unused stext symbolAndi Kleen1-1/+0
suggested by Jan Beulich Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02[PATCH] i386: Use X86_EFLAGS_IF in irqflags.h.Andi Kleen3-22/+29
Move X86_EFLAGS_IF et al out to a new header: processor-flags.h, so we can include it from irqflags.h and use it in raw_irqs_disabled_flags(). As a side-effect, we could now use these flags in .S files. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02[PATCH] i386: get rid of unused variablesParag Warudkar1-7/+0
Signed-off-by: Parag Warudkar <parag.warudkar@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02[PATCH] x86: tighten kernel image page access rightsJan Beulich6-23/+40
On x86-64, kernel memory freed after init can be entirely unmapped instead of just getting 'poisoned' by overwriting with a debug pattern. On i386 and x86-64 (under CONFIG_DEBUG_RODATA), kernel text and bug table can also be write-protected. Compared to the first version, this one prevents re-creating deleted mappings in the kernel image range on x86-64, if those got removed previously. This, together with the original changes, prevents temporarily having inconsistent mappings when cacheability attributes are being changed on such pages (e.g. from AGP code). While on i386 such duplicate mappings don't exist, the same change is done there, too, both for consistency and because checking pte_present() before using various other pte_XXX functions is a requirement anyway. At once, i386 code gets adjusted to use pte_huge() instead of open coding this. AK: split out cpa() changes Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02[PATCH] x86: Improve handling of kernel mappings in change_page_attrJan Beulich3-6/+16
Fix various broken corner cases in i386 and x86-64 change_page_attr. AK: split off from tighten kernel image access rights Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02[PATCH] i386: rationalize paravirt wrappersRusty Russell8-475/+389
paravirt.c used to implement native versions of all low-level functions. Far cleaner is to have the native versions exposed in the headers and as inline native_XXX, and if !CONFIG_PARAVIRT, then simply #define XXX native_XXX. There are several nice side effects: 1) write_dt_entry() now takes the correct "struct Xgt_desc_struct *" not "void *". 2) load_TLS is reintroduced to the for loop, not manually unrolled with a #error in case the bounds ever change. 3) Macros become inlines, with type checking. 4) Access to the native versions is trivial for KVM, lguest, Xen and others who might want it. Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andi Kleen <ak@suse.de> Cc: Andi Kleen <ak@muc.de> Cc: Avi Kivity <avi@qumranet.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-05-02[PATCH] i386: Rename boot_gdt_table to boot_gdtSebastien Dugue2-11/+10
Rename boot_gdt_table to boot_gdt to avoid the duplicate T(able). Signed-off-by: Sebastien Dugue <sebastien.dugue@bull.net> Signed-off-by: Andi Kleen <ak@suse.de> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-05-02[PATCH] i386: clean up cpu_init()Rusty Russell3-30/+14
We now have cpu_init() and secondary_cpu_init() doing nothing but calling _cpu_init() with the same arguments. Rename _cpu_init() to cpu_init() and use it as a replcement for secondary_cpu_init(). Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andi Kleen <ak@suse.de> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-05-02[PATCH] i386: Use per-cpu GDT immediately upon bootRusty Russell6-120/+75
Now we are no longer dynamically allocating the GDT, we don't need the "cpu_gdt_table" at all: we can switch straight from "boot_gdt_table" to the per-cpu GDT. This means initializing the cpu_gdt array in C. The boot CPU uses the per-cpu var directly, then in smp_prepare_cpus() it switches to the per-cpu copy just allocated. For secondary CPUs, the early_gdt_descr is set to point directly to their per-cpu copy. For UP the code is very simple: it keeps using the "per-cpu" GDT as per SMP, but we never have to move. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andi Kleen <ak@suse.de> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-05-02[PATCH] i386: Use per-cpu variables for GDT, PDARusty Russell7-115/+21
Allocating PDA and GDT at boot is a pain. Using simple per-cpu variables adds happiness (although we need the GDT page-aligned for Xen, which we do in a followup patch). [akpm@linux-foundation.org: build fix] Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andi Kleen <ak@suse.de> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-05-02[PATCH] x86: add command line length to boot protocolBernhard Walle3-8/+29
Because the command line is increased to 2048 characters after 2.6.21, it's not possible for boot loaders and userspace tools to determine the length of the command line the kernel can understand. The benefit of knowing the length is that users can be warned if the command line size is too long which prevents surprise if things don't work after bootup. This patch updates the boot protocol to contain a field called "cmdline_size" that contain the length of the command line (excluding the terminating zero). The patch also adds missing fields (of protocol version 2.05) to the x86_64 setup code. Signed-off-by: Bernhard Walle <bwalle@suse.de> Signed-off-by: Andi Kleen <ak@suse.de> Cc: Alon Bar-Lev <alon.barlev@gmail.com> Acked-by: H. Peter Anvin <hpa@zytor.com> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-05-02[PATCH] i386: Allow i386 crash kernels to handle x86_64 dumpsIan Campbell3-1/+12
The specific case I am encountering is kdump under Xen with a 64 bit hypervisor and 32 bit kernel/userspace. The dump created is 64 bit due to the hypervisor but the dump kernel is 32 bit for maximum compatibility. It's possibly less likely to be useful in a purely native scenario but I see no reason to disallow it. [akpm@linux-foundation.org: build fix] Signed-off-by: Ian Campbell <ian.campbell@xensource.com> Signed-off-by: Andi Kleen <ak@suse.de> Acked-by: Vivek Goyal <vgoyal@in.ibm.com> Cc: Horms <horms@verge.net.au> Cc: Magnus Damm <magnus.damm@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-05-02[PATCH] x86-64: Introduce load_TLS to the "for" loop.Rusty Russell1-7/+4
GCC (4.1 at least) unrolls it anyway, but I can't believe this code was ever justifiable. (I've also submitted a patch which cleans up i386, which is even uglier). Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andi Kleen <ak@suse.de> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-05-02[PATCH] i386: Initialize esp0 properly all the timeRusty Russell1-0/+1
Whenever we schedule, __switch_to calls load_esp0 which does: tss->esp0 = thread->esp0; This is never initialized for the initial thread (ie "swapper"), so when we're scheduling that, we end up setting esp0 to 0. This is fine: the swapper never leaves ring 0, so this field is never used. lguest, however, gets upset that we're trying to used an unmapped page as our kernel stack. Rather than work around it there, let's initialize it. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andi Kleen <ak@suse.de> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-05-02[PATCH] i386: VDSO_PRELINK warning fixAndrew Morton2-3/+3
The lguest patches somehow managed to trigger this: In file included from arch/i386/lguest/lguest.c:38: include/asm/asm-offsets.h:67:1: warning: "VDSO_PRELINK" redefined In file included from include/linux/elf.h:7, from include/linux/module.h:15, from include/linux/device.h:21, from include/linux/interrupt.h:15, from arch/i386/lguest/lguest.c:27: include/asm/elf.h:140:1: warning: this is the location of the previous definition I assume that using the same identifier twice was a bad idea.. Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02[PATCH] x86-64: fake numa for cpusets documentDavid Rientjes1-0/+66
Create a document to explain how to use numa=fake in conjunction with cpusets for coarse memory resource management. An attempt to get more awareness and testing for this feature. Cc: Andi Kleen <ak@suse.de> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andi Kleen <ak@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Christoph Lameter <clameter@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-05-02[PATCH] x86: remove constant_tsc reporting from /proc/cpuinfo' power flagsJoerg Roedel2-5/+3
remove the reporting of the constant_tsc flag from the "power management" field in /proc/cpuinfo. The NULL value there was replaced by "" because the former would result in a printout of [8] if the flag is set. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02[PATCH] x86-64: fixed size remaining fake nodesDavid Rientjes2-15/+46
Extends the numa=fake x86_64 command-line option to split the remaining system memory into nodes of fixed size. Any leftover memory is allocated to a final node unless the command-line ends with a comma. For example: numa=fake=2*512,*128 gives two 512M nodes and the remaining system memory is split into nodes of 128M each. This is beneficial for systems where the exact size of RAM is unknown or not necessarily relevant, but the size of the remaining nodes to be allocated is known based on their capacity for resource management. Cc: Andi Kleen <ak@suse.de> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andi Kleen <ak@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Christoph Lameter <clameter@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-05-02[PATCH] x86-64: split remaining fake nodes equallyDavid Rientjes2-5/+21
Extends the numa=fake x86_64 command-line option to split the remaining system memory into equal-sized nodes. For example: numa=fake=2*512,4* gives two 512M nodes and the remaining system memory is split into four approximately equal chunks. This is beneficial for systems where the exact size of RAM is unknown or not necessarily relevant, but the granularity with which nodes shall be allocated is known. Cc: Andi Kleen <ak@suse.de> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andi Kleen <ak@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Christoph Lameter <clameter@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-05-02[PATCH] x86-64: configurable fake numa node sizesDavid Rientjes3-110/+161
Extends the numa=fake x86_64 command-line option to allow for configurable node sizes. These nodes can be used in conjunction with cpusets for coarse memory resource management. The old command-line option is still supported: numa=fake=32 gives 32 fake NUMA nodes, ignoring the NUMA setup of the actual machine. But now you may configure your system for the node sizes of your choice: numa=fake=2*512,1024,2*256 gives two 512M nodes, one 1024M node, two 256M nodes, and the rest of system memory to a sixth node. The existing hash function is maintained to support the various node sizes that are possible with this implementation. Each node of the same size receives roughly the same amount of available pages, regardless of any reserved memory with its address range. The total available pages on the system is calculated and divided by the number of equal nodes to allocate. These nodes are then dynamically allocated and their borders extended until such time as their number of available pages reaches the required size. Configurable node sizes are recommended when used in conjunction with cpusets for memory control because it eliminates the overhead associated with scanning the zonelists of many smaller full nodes on page_alloc(). Cc: Andi Kleen <ak@suse.de> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andi Kleen <ak@suse.de> Cc: Paul Jackson <pj@sgi.com> Cc: Christoph Lameter <clameter@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-05-02[PATCH] i386: fix GDT's number of quadwords in commentAhmed S. Darwish1-2/+2
Fix comments to represent the true number of quadwords in GDT. Signed-off-by: Ahmed S. Darwish <darwish.07@gmail.com> Signed-off-by: Andi Kleen <ak@suse.de> Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-05-02[PATCH] i386: vmi_pmd_clear() staticAdrian Bunk1-1/+1
This patch makes the needlessly global vmi_pmd_clear() static. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andi Kleen <ak@suse.de> Acked-by: Zachary Amsden <zach@vmware.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-05-02[PATCH] x86-64: make simnow_init() staticAdrian Bunk1-1/+1
Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02[PATCH] x86-64: remove extra smp_processor_id callingYinghai Lu1-2/+1
Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02[PATCH] x86-64: fix ia32_binfmt.c build errorRalf Baechle1-4/+6
Reorder code to avoid multiple inclusion of elf.h. #undef several symbols to avoid build errors over redefinitions. Signed-off-by: Ralf Baechle <ralf@linux-mips.org> Signed-off-by: Andi Kleen <ak@suse.de> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-05-02[PATCH] x86: Log reason why TSC was marked unstablejohn stultz9-13/+15
Change mark_tsc_unstable() so it takes a string argument, which holds the reason the TSC was marked unstable. This is then displayed the first time mark_tsc_unstable is called. This should help us better debug why the TSC was marked unstable on certain systems and allow us to make sure we're not being overly paranoid when throwing out this troublesome clocksource. Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02[PATCH] i386: workaround for a -Wmissing-prototypes warningAdrian Bunk1-0/+3
Work around a warning with -Wmissing-prototypes in arch/i386/kernel/asm-offsets.c The warning isn't gcc's fault - asm-offsets.c is simply a special file. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andi Kleen <ak@suse.de> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-05-02[PATCH] i386: type cast clean up for find_next_zero_bitKen Chen1-2/+2
clean up unneeded type cast by properly declare data type. Signed-off-by: Ken Chen <kenchen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02[PATCH] i386: make struct vmi_ops staticAdrian Bunk1-1/+1
Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andi Kleen <ak@suse.de> Cc: Andi Kleen <ak@suse.de> Cc: Zachary Amsden <zach@vmware.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-05-02[PATCH] i386: modpost apic related warning fixesVivek Goyal4-39/+42
o Modpost generates warnings for i386 if compiled with CONFIG_RELOCATABLE=y WARNING: vmlinux - Section mismatch: reference to .init.text:find_unisys_acpi_oem_table from .text between 'acpi_madt_oem_check' (at offset 0xc0101eda) and 'enable_apic_mode' WARNING: vmlinux - Section mismatch: reference to .init.text:acpi_get_table_header_early from .text between 'acpi_madt_oem_check' (at offset 0xc0101ef0) and 'enable_apic_mode' WARNING: vmlinux - Section mismatch: reference to .init.text:parse_unisys_oem from .text between 'acpi_madt_oem_check' (at offset 0xc0101f2e) and 'enable_apic_mode' WARNING: vmlinux - Section mismatch: reference to .init.text:setup_unisys from .text between 'acpi_madt_oem_check' (at offset 0xc0101f37) and 'enable_apic_mode'WARNING: vmlinux - Section mismatch: reference to .init.text:parse_unisys_oem from .text between 'mps_oem_check' (at offset 0xc0101ec7) and 'acpi_madt_oem_check' WARNING: vmlinux - Section mismatch: reference to .init.text:es7000_sw_apic from .text between 'enable_apic_mode' (at offset 0xc0101f48) and 'check_apicid_present' o Some functions which are inline (acpi_madt_oem_check) are not inlined by compiler as these functions are accessed using function pointer. These functions are put in .text section and they in-turn access __init type functions hence modpost generates warnings. o Do not iniline acpi_madt_oem_check, instead make it __init. Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com> Signed-off-by: Andi Kleen <ak@suse.de> Cc: Andi Kleen <ak@suse.de> Cc: Len Brown <lenb@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-05-02[PATCH] x86-64: Set HASHDIST_DEFAULT to 1 for x86_64 NUMARavikiran G Thirumalai1-2/+2
Enable system hashtable memory to be distributed among nodes on x86_64 NUMA Forcing the kernel to use node interleaved vmalloc instead of bootmem for the system hashtable memory (alloc_large_system_hash) reduces the memory imbalance on node 0 by around 40MB on a 8 node x86_64 NUMA box: Before the following patch, on bootup of a 8 node box: Node 0 MemTotal: 3407488 kB Node 0 MemFree: 3206296 kB Node 0 MemUsed: 201192 kB Node 0 Active: 7012 kB Node 0 Inactive: 512 kB Node 0 Dirty: 0 kB Node 0 Writeback: 0 kB Node 0 FilePages: 1912 kB Node 0 Mapped: 420 kB Node 0 AnonPages: 5612 kB Node 0 PageTables: 468 kB Node 0 NFS_Unstable: 0 kB Node 0 Bounce: 0 kB Node 0 Slab: 5408 kB Node 0 SReclaimable: 644 kB Node 0 SUnreclaim: 4764 kB After the patch (or using hashdist=1 on the kernel command line): Node 0 MemTotal: 3407488 kB Node 0 MemFree: 3247608 kB Node 0 MemUsed: 159880 kB Node 0 Active: 3012 kB Node 0 Inactive: 616 kB Node 0 Dirty: 0 kB Node 0 Writeback: 0 kB Node 0 FilePages: 2424 kB Node 0 Mapped: 380 kB Node 0 AnonPages: 1200 kB Node 0 PageTables: 396 kB Node 0 NFS_Unstable: 0 kB Node 0 Bounce: 0 kB Node 0 Slab: 6304 kB Node 0 SReclaimable: 1596 kB Node 0 SUnreclaim: 4708 kB I guess it is a good idea to keep HASHDIST_DEFAULT "on" for x86_64 NUMA since x86_64 has no dearth of vmalloc space? Or maybe enable hash distribution for all 64bit NUMA arches? The following patch does it only for x86_64. I ran a HPC MPI benchmark -- 'Ansys wingsolid', which takes up quite a bit of memory and uses up tlb entries. This was on a 4 way, 2 socket Tyan AMD box (non vsmp), with 8G total memory (4G pernode). The results with and without hash distribution are: 1. Vanilla - runtime of 1188.000s 2. With hashdist=1 runtime of 1154.000s Oprofile output for the duration of run is: 1. Vanilla: PU: AMD64 processors, speed 2411.16 MHz (estimated) Counted L1_AND_L2_DTLB_MISSES events (L1 and L2 DTLB misses) with a unit mask of 0x00 (No unit mask) count 500 samples % app name symbol name 163054 6.5513 libansys1.so MultiFront::decompose(int, int, Elemset *, int *, int, int, int) 162061 6.5114 libansys3.so blockSaxpy6L_fd 162042 6.5107 libansys3.so blockInnerProduct6L_fd 156286 6.2794 libansys3.so maxb33_ 87879 3.5309 libansys1.so elmatrixmultpcg_ 84857 3.4095 libansys4.so saxpy_pcg 58637 2.3560 libansys4.so .st4560 46612 1.8728 libansys4.so .st4282 43043 1.7294 vmlinux-t copy_user_generic_string 41326 1.6604 libansys3.so blockSaxpyBackSolve6L_fd 41288 1.6589 libansys3.so blockInnerProductBackSolve6L_fd 2. With hashdist=1 CPU: AMD64 processors, speed 2411.13 MHz (estimated) Counted L1_AND_L2_DTLB_MISSES events (L1 and L2 DTLB misses) with a unit mask of 0x00 (No unit mask) count 500 samples % app name symbol name 162993 6.9814 libansys1.so MultiFront::decompose(int, int, Elemset *, int *, int, int, int) 160799 6.8874 libansys3.so blockInnerProduct6L_fd 160459 6.8729 libansys3.so blockSaxpy6L_fd 156018 6.6826 libansys3.so maxb33_ 84700 3.6279 libansys4.so saxpy_pcg 83434 3.5737 libansys1.so elmatrixmultpcg_ 58074 2.4875 libansys4.so .st4560 46000 1.9703 libansys4.so .st4282 41166 1.7632 libansys3.so blockSaxpyBackSolve6L_fd 41033 1.7575 libansys3.so blockInnerProductBackSolve6L_fd 35762 1.5318 libansys1.so inner_product_sub 35591 1.5245 libansys1.so inner_product_sub2 28259 1.2104 libansys4.so addVectors Signed-off-by: Pravin B. Shelar <pravin.shelar@calsoftinc.com> Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Signed-off-by: Shai Fultheim <shai@scalex86.org> Signed-off-by: Andi Kleen <ak@suse.de> Acked-by: Christoph Lameter <clameter@engr.sgi.com> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-05-02[PATCH] x86-64: Minor white space cleanup in traps.cAndi Kleen1-3/+1
Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02[PATCH] x86-64: Allow sys_uselib unconditionallyAndi Kleen1-4/+0
Previously it wasn't enabled in the binfmt_aout is a module case. Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02[PATCH] x86-64: Don't disable basic block reorderingAndi Kleen1-3/+0
When compiling with -Os (which is default) the compiler defaults to it anyways. And with -O2 it probably generates somewhat better (although also larger) code. Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02[PATCH] x86-64: fix x86_64-mm-sched-clock-shareAndrew Morton1-3/+16
Fix for the following patch. Provide dummy cpufreq functions when CPUFREQ is not compiled in. Cc: Andi Kleen <ak@suse.de> Cc: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02[PATCH] x86-64: Move cpu verification code to common fileVivek Goyal5-123/+152
o This patch moves the code to verify long mode and SSE to a common file. This code is now shared by trampoline.S, wakeup.S, boot/setup.S and boot/compressed/head.S o So far we used to do very limited check in trampoline.S, wakeup.S and in 32bit entry point. Now all the entry paths are forced to do the exhaustive check, including SSE because verify_cpu is shared. o I am keeping this patch as last in the x86 relocatable series because previous patches have got quite some amount of testing done and don't want to distrub that. So that if there is problem introduced by this patch, at least it can be easily isolated. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com> Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02[PATCH] x86-64: Extend bzImage protocol for relocatable bzImageVivek Goyal1-2/+11
o Extend the bzImage protocol (same as i386) to allow bzImage loaders to load the protected mode kernel at non-1MB address. Now protected mode component is relocatable and can be loaded at non-1MB addresses. o As of today kdump uses it to run a second kernel from a reserved memory area. Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com> Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com> Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02[PATCH] x86-64: build-time checkingVivek Goyal3-1/+9
o X86_64 kernel should run from 2MB aligned address for two reasons. - Performance. - For relocatable kernels, page tables are updated based on difference between compile time address and load time physical address. This difference should be multiple of 2MB as kernel text and data is mapped using 2MB pages and PMD should be pointing to a 2MB aligned address. Life is simpler if both compile time and load time kernel addresses are 2MB aligned. o Flag the error at compile time if one is trying to build a kernel which does not meet alignment restrictions. Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com> Signed-off-by: Andi Kleen <ak@suse.de> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-05-02[PATCH] x86-64: Relocatable Kernel SupportVivek Goyal9-332/+597
This patch modifies the x86_64 kernel so that it can be loaded and run at any 2M aligned address, below 512G. The technique used is to compile the decompressor with -fPIC and modify it so the decompressor is fully relocatable. For the main kernel the page tables are modified so the kernel remains at the same virtual address. In addition a variable phys_base is kept that holds the physical address the kernel is loaded at. __pa_symbol is modified to add that when we take the address of a kernel symbol. When loaded with a normal bootloader the decompressor will decompress the kernel to 2M and it will run there. This both ensures the relocation code is always working, and makes it easier to use 2M pages for the kernel and the cpu. AK: changed to not make RELOCATABLE default in Kconfig Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com> Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02[PATCH] x86: __pa and __pa_symbol address space separationVivek Goyal10-47/+53
Currently __pa_symbol is for use with symbols in the kernel address map and __pa is for use with pointers into the physical memory map. But the code is implemented so you can usually interchange the two. __pa which is much more common can be implemented much more cheaply if it is it doesn't have to worry about any other kernel address spaces. This is especially true with a relocatable kernel as __pa_symbol needs to peform an extra variable read to resolve the address. There is a third macro that is added for the vsyscall data __pa_vsymbol for finding the physical addesses of vsyscall pages. Most of this patch is simply sorting through the references to __pa or __pa_symbol and using the proper one. A little of it is continuing to use a physical address when we have it instead of recalculating it several times. swapper_pgd is now NULL. leave_mm now uses init_mm.pgd and init_mm.pgd is initialized at boot (instead of compile time) to the physmem virtual mapping of init_level4_pgd. The physical address changed. Except for the for EMPTY_ZERO page all of the remaining references to __pa_symbol appear to be during kernel initialization. So this should reduce the cost of __pa in the common case, even on a relocated kernel. As this is technically a semantic change we need to be on the lookout for anything I missed. But it works for me (tm). Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com> Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02[PATCH] x86-64: do not use virt_to_page on kernel data addressVivek Goyal1-15/+27
o virt_to_page() call should be used on kernel linear addresses and not on kernel text and data addresses. Swsusp code uses it on kernel data (statically allocated swsusp_header). o Allocate swsusp_header dynamically so that virt_to_page() can be used safely. o I am changing this because in next few patches, __pa() on x86_64 will no longer support kernel text and data addresses and hibernation breaks. Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com> Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02[PATCH] x86: Move swsusp __pa() dependent code to arch portionVivek Goyal6-14/+55
o __pa() should be used only on kernel linearly mapped virtual addresses and not on kernel text and data addresses. o Hibernation code needs to determine the physical address associated with kernel symbol to mark a section boundary which contains pages which don't have to be saved and restored during hibernate/resume operation. o Move this piece of code in arch dependent section. So that architectures which don't have kernel text/data mapped into kernel linearly mapped region can come up with their own ways of determining physical addresses associated with a kernel text. Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com> Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02[PATCH] x86-64: Remove the identity mapping as early as possibleVivek Goyal7-61/+25
With the rewrite of the SMP trampoline and the early page allocator there is nothing that needs identity mapped pages, once we start executing C code. So add zap_identity_mappings into head64.c and remove zap_low_mappings() from much later in the code. The functions are subtly different thus the name change. This also kills boot_level4_pgt which was from an earlier attempt to move the identity mappings as early as possible, and is now no longer needed. Essentially I have replaced boot_level4_pgt with trampoline_level4_pgt in trampoline.S Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com> Signed-off-by: Andi Kleen <ak@suse.de>