aboutsummaryrefslogtreecommitdiffstats
path: root/arch/i386/mm (follow)
AgeCommit message (Collapse)AuthorFilesLines
2007-05-02[PATCH] i386: PARAVIRT: flush lazy mmu updates on kunmap_atomicJeremy Fitzhardinge1-0/+1
kunmap_atomic should flush any pending lazy mmu updates, mainly to be consistent with kmap_atomic, and to preserve its normal behaviour. Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02[PATCH] i386: PARAVIRT: add kmap_atomic_pte for mapping highpte pagesJeremy Fitzhardinge1-2/+7
Xen and VMI both have special requirements when mapping a highmem pte page into the kernel address space. These can be dealt with by adding a new kmap_atomic_pte() function for mapping highptes, and hooking it into the paravirt_ops infrastructure. Xen specifically wants to map the pte page RO, so this patch exposes a helper function, kmap_atomic_prot, which maps the page with the specified page protections. This also adds a kmap_flush_unused() function to clear out the cached kmap mappings. Xen needs this to clear out any potential stray RW mappings of pages which will become part of a pagetable. [ Zach - vmi.c will need some attention after this patch. It wasn't immediately obvious to me what needs to be done. ] Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Signed-off-by: Andi Kleen <ak@suse.de> Cc: Zachary Amsden <zach@vmware.com>
2007-05-02[PATCH] i386: PARAVIRT: Allow paravirt backend to choose kernel PMD sharingJeremy Fitzhardinge4-24/+89
Normally when running in PAE mode, the 4th PMD maps the kernel address space, which can be shared among all processes (since they all need the same kernel mappings). Xen, however, does not allow guests to have the kernel pmd shared between page tables, so parameterize pgtable.c to allow both modes of operation. There are several side-effects of this. One is that vmalloc will update the kernel address space mappings, and those updates need to be propagated into all processes if the kernel mappings are not intrinsically shared. In the non-PAE case, this is done by maintaining a pgd_list of all processes; this list is used when all process pagetables must be updated. pgd_list is threaded via otherwise unused entries in the page structure for the pgd, which means that the pgd must be page-sized for this to work. Normally the PAE pgd is only 4x64 byte entries large, but Xen requires the PAE pgd to page aligned anyway, so this patch forces the pgd to be page aligned+sized when the kernel pmd is unshared, to accomodate both these requirements. Also, since there may be several distinct kernel pmds (if the user/kernel split is below 3G), there's no point in allocating them from a slab cache; they're just allocated with get_free_page and initialized appropriately. (Of course the could be cached if there is just a single kernel pmd - which is the default with a 3G user/kernel split - but it doesn't seem worthwhile to add yet another case into this code). [ Many thanks to wli for review comments. ] Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Signed-off-by: William Lee Irwin III <wli@holomorphy.com> Signed-off-by: Andi Kleen <ak@suse.de> Cc: Zachary Amsden <zach@vmware.com> Cc: Christoph Lameter <clameter@sgi.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2007-05-02[PATCH] i386: PARAVIRT: Hooks to set up initial pagetableJeremy Fitzhardinge1-47/+91
This patch introduces paravirt_ops hooks to control how the kernel's initial pagetable is set up. In the case of a native boot, the very early bootstrap code creates a simple non-PAE pagetable to map the kernel and physical memory. When the VM subsystem is initialized, it creates a proper pagetable which respects the PAE mode, large pages, etc. When booting under a hypervisor, there are many possibilities for what paging environment the hypervisor establishes for the guest kernel, so the constructon of the kernel's pagetable depends on the hypervisor. In the case of Xen, the hypervisor boots the kernel with a fully constructed pagetable, which is already using PAE if necessary. Also, Xen requires particular care when constructing pagetables to make sure all pagetables are always mapped read-only. In order to make this easier, kernel's initial pagetable construction has been changed to only allocate and initialize a pagetable page if there's no page already present in the pagetable. This allows the Xen paravirt backend to make a copy of the hypervisor-provided pagetable, allowing the kernel to establish any more mappings it needs while keeping the existing ones. A slightly subtle point which is worth highlighting here is that Xen requires all kernel mappings to share the same pte_t pages between all pagetables, so that updating a kernel page's mapping in one pagetable is reflected in all other pagetables. This makes it possible to allocate a page and attach it to a pagetable without having to explicitly enumerate that page's mapping in all pagetables. And: +From: "Eric W. Biederman" <ebiederm@xmission.com> If we don't set the leaf page table entries it is quite possible that will inherit and incorrect page table entry from the initial boot page table setup in head.S. So we need to redo the effort here, so we pick up PSE, PGE and the like. Hypervisors like Xen require that their page tables be read-only, which is slightly incompatible with our low identity mappings, however I discussed this with Jeremy he has modified the Xen early set_pte function to avoid problems in this area. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Signed-off-by: Andi Kleen <ak@suse.de> Acked-by: William Irwin <bill.irwin@oracle.com> Cc: Ingo Molnar <mingo@elte.hu>
2007-05-02[PATCH] i386: Relocate VDSO ELF headers to match mapped location with COMPAT_VDSOJeremy Fitzhardinge1-6/+0
Some versions of libc can't deal with a VDSO which doesn't have its ELF headers matching its mapped address. COMPAT_VDSO maps the VDSO at a specific system-wide fixed address. Previously this was all done at build time, on the grounds that the fixed VDSO address is always at the top of the address space. However, a hypervisor may reserve some of that address space, pushing the fixmap address down. This patch does the adjustment dynamically at runtime, depending on the runtime location of the VDSO fixmap. [ Patch has been through several hands: Jan Beulich wrote the orignal version; Zach reworked it, and Jeremy converted it to relocate phdrs as well as sections. ] Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Signed-off-by: Andi Kleen <ak@suse.de> Cc: Zachary Amsden <zach@vmware.com> Cc: "Jan Beulich" <JBeulich@novell.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Andi Kleen <ak@suse.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Roland McGrath <roland@redhat.com>
2007-05-02[PATCH] x86: tighten kernel image page access rightsJan Beulich1-6/+19
On x86-64, kernel memory freed after init can be entirely unmapped instead of just getting 'poisoned' by overwriting with a debug pattern. On i386 and x86-64 (under CONFIG_DEBUG_RODATA), kernel text and bug table can also be write-protected. Compared to the first version, this one prevents re-creating deleted mappings in the kernel image range on x86-64, if those got removed previously. This, together with the original changes, prevents temporarily having inconsistent mappings when cacheability attributes are being changed on such pages (e.g. from AGP code). While on i386 such duplicate mappings don't exist, the same change is done there, too, both for consistency and because checking pte_present() before using various other pte_XXX functions is a requirement anyway. At once, i386 code gets adjusted to use pte_huge() instead of open coding this. AK: split out cpa() changes Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02[PATCH] x86: Improve handling of kernel mappings in change_page_attrJan Beulich1-2/+2
Fix various broken corner cases in i386 and x86-64 change_page_attr. AK: split off from tighten kernel image access rights Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02[PATCH] x86: __pa and __pa_symbol address space separationVivek Goyal1-7/+8
Currently __pa_symbol is for use with symbols in the kernel address map and __pa is for use with pointers into the physical memory map. But the code is implemented so you can usually interchange the two. __pa which is much more common can be implemented much more cheaply if it is it doesn't have to worry about any other kernel address spaces. This is especially true with a relocatable kernel as __pa_symbol needs to peform an extra variable read to resolve the address. There is a third macro that is added for the vsyscall data __pa_vsymbol for finding the physical addesses of vsyscall pages. Most of this patch is simply sorting through the references to __pa or __pa_symbol and using the proper one. A little of it is continuing to use a physical address when we have it instead of recalculating it several times. swapper_pgd is now NULL. leave_mm now uses init_mm.pgd and init_mm.pgd is initialized at boot (instead of compile time) to the physmem virtual mapping of init_level4_pgd. The physical address changed. Except for the for EMPTY_ZERO page all of the remaining references to __pa_symbol appear to be during kernel initialization. So this should reduce the cost of __pa in the common case, even on a relocated kernel. As this is technically a semantic change we need to be on the lookout for anything I missed. But it works for me (tm). Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com> Signed-off-by: Andi Kleen <ak@suse.de>
2007-05-02[PATCH] i386: adjustments to page table dump during oops (v4)Jan Beulich1-20/+35
- make the page table contents printing PAE capable - make sure the address stored in current->thread.cr2 is unmodified from what was read from CR2 - don't call oops_may_print() multiple times, when one time suffices - print pte even in highpte case, as long as the pte page isn't in actually in high memory (which is specifically the case for all page tables covering kernel space) (Changes to v3: Use sizeof()*2 rather than the suggested sizeof()*4 for printing width, use fixed 16-nibble width for PAE, and also apply the max_low_pfn range check to the middle level lookup on PAE.) Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Andi Kleen <ak@suse.de>
2007-04-08[PATCH] Proper fix for highmem kmap_atomic functions for VMI for 2.6.21Zachary Amsden1-0/+2
Since lazy MMU batching mode still allows interrupts to enter, it is possible for interrupt handlers to try to use kmap_atomic, which fails when lazy mode is active, since the PTE update to highmem will be delayed. The best workaround is to issue an explicit flush in kmap_atomic_functions case; this is the only way nested PTE updates can happen in the interrupt handler. Thanks to Jeremy Fitzhardinge for noting the bug and suggestions on a fix. This patch gets reverted again when we start 2.6.22 and the bug gets fixed differently. Signed-off-by: Zachary Amsden <zach@vmware.com> Cc: Andi Kleen <ak@muc.de> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-13[PATCH] i386: Remove extern declaration from mm/discontig.c, put in header.Rusty Russell1-1/+0
Extern declarations belong in headers. Times, they are a'changin. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andi Kleen <ak@suse.de> ===================================================================
2007-02-13[PATCH] x86: simplify notify_page_fault()Jan Beulich1-10/+8
Remove all parameters from this function that aren't really variable. Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Andi Kleen <ak@suse.de>
2007-02-13[PATCH] i386: vMI backend for paravirt-opsZachary Amsden1-0/+2
Fairly straightforward implementation of VMI backend for paravirt-ops. [Adrian Bunk: some cleanups] Signed-off-by: Zachary Amsden <zach@vmware.com> Signed-off-by: Andi Kleen <ak@suse.de> Cc: Andi Kleen <ak@suse.de> Cc: Jeremy Fitzhardinge <jeremy@xensource.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Andrew Morton <akpm@osdl.org>
2007-02-13[PATCH] MM: page allocation hooks for VMI backendZachary Amsden3-4/+26
The VMI backend uses explicit page type notification to track shadow page tables. The allocation of page table roots is especially tricky. We need to clone the root for non-PAE mode while it is protected under the pgd lock to correctly copy the shadow. We don't need to allocate pgds in PAE mode, (PDPs in Intel terminology) as they only have 4 entries, and are cached entirely by the processor, which makes shadowing them rather simple. For base page table level allocation, pmd_populate provides the exact hook point we need. Also, we need to allocate pages when splitting a large page, and we must release pages before returning the page to any free pool. Despite being required with these slightly odd semantics for VMI, Xen also uses these hooks to determine the exact moment when page tables are created or released. AK: All nops for other architectures Signed-off-by: Zachary Amsden <zach@vmware.com> Signed-off-by: Andi Kleen <ak@suse.de> Cc: Andi Kleen <ak@suse.de> Cc: Jeremy Fitzhardinge <jeremy@xensource.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Andrew Morton <akpm@osdl.org>
2007-02-11[PATCH] highmem: catch illegal nestingIngo Molnar1-3/+4
Catch illegally nested kmap_atomic()s even if the page that is mapped by the 'inner' instance is from lowmem. This avoids spuriously zapped kmap-atomic ptes and turns hard to find crashes into clear asserts at the bug site. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-11[PATCH] Consolidate bust_spinlocks()Kirill Korotaev1-26/+0
Part of long forgotten patch http://groups.google.com/group/fa.linux.kernel/msg/e98e941ce1cf29f6?dmode=source Since then, m32r grabbed two copies. Leave s390 copy because of important absence of CONFIG_VT, but remove references to non-existent timerlist_lock. ia64 also loses timerlist_lock. Signed-off-by: Alexey Dobriyan <adobriyan@openvz.org> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Andi Kleen <ak@muc.de> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Hirokazu Takata <takata@linux-m32r.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-02-09[PATCH] misc NULL noise removalAl Viro1-1/+1
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-11[PATCH] i386: Fix memory hotplug related MODPOST generated warningVivek Goyal1-2/+2
o Fix modpost generated warning. WARNING: vmlinux - Section mismatch: reference to .init.text: from .text between 'add_one_highpage_hotplug' (at offset 0xc0113d3f) and 'online_page' Signed-off-by: Vivek Goyal <vgoyal@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Andi Kleen <ak@suse.de>
2006-12-22[PATCH] memory hotplug: fix compile error for i386 with NUMA configYasunori Goto2-8/+30
Fix compile error when config memory hotplug with numa on i386. The cause of compile error was missing of arch_add_memory(), remove_memory(), and memory_add_physaddr_to_nid(). Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Acked-by: David Rientjes <rientjes@cs.washington.edu> Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07Merge branch 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6Linus Torvalds6-18/+30
* 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6: (156 commits) [PATCH] x86-64: Export smp_call_function_single [PATCH] i386: Clean up smp_tune_scheduling() [PATCH] unwinder: move .eh_frame to RODATA [PATCH] unwinder: fully support linker generated .eh_frame_hdr section [PATCH] x86-64: don't use set_irq_regs() [PATCH] x86-64: check vector in setup_ioapic_dest to verify if need setup_IO_APIC_irq [PATCH] x86-64: Make ix86 default to HIGHMEM4G instead of NOHIGHMEM [PATCH] i386: replace kmalloc+memset with kzalloc [PATCH] x86-64: remove remaining pc98 code [PATCH] x86-64: remove unused variable [PATCH] x86-64: Fix constraints in atomic_add_return() [PATCH] x86-64: fix asm constraints in i386 atomic_add_return [PATCH] x86-64: Correct documentation for bzImage protocol v2.05 [PATCH] x86-64: replace kmalloc+memset with kzalloc in MTRR code [PATCH] x86-64: Fix numaq build error [PATCH] x86-64: include/asm-x86_64/cpufeature.h isn't a userspace header [PATCH] unwinder: Add debugging output to the Dwarf2 unwinder [PATCH] x86-64: Clarify error message in GART code [PATCH] x86-64: Fix interrupt race in idle callback (3rd try) [PATCH] x86-64: Remove unwind stack pointer alignment forcing again ... Fixed conflict in include/linux/uaccess.h manually Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07[PATCH] slab: remove kmem_cache_tChristoph Lameter2-5/+5
Replace all uses of kmem_cache_t with struct kmem_cache. The patch was generated using the following script: #!/bin/sh # # Replace one string by another in all the kernel sources. # set -e for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do quilt add $file sed -e "1,\$s/$1/$2/g" $file >/tmp/$$ mv /tmp/$$ $file quilt refresh done The script was run like this sh replace kmem_cache_t "struct kmem_cache" Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07[PATCH] Fix kunmap_atomic's use of kpte_clear_flush()Jeremy Fitzhardinge1-10/+8
kunmap_atomic() will call kpte_clear_flush with vaddr/ptep arguments which don't correspond if the vaddr is just a normal lowmem address (ie, not in the KMAP area). This patch makes sure that the pte is only cleared if kmap area was actually used for the mapping. Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Zachary Amsden <zach@vmware.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07[PATCH] mm: pagefault_{disable,enable}()Peter Zijlstra1-6/+4
Introduce pagefault_{disable,enable}() and use these where previously we did manual preempt increments/decrements to make the pagefault handler do the atomic thing. Currently they still rely on the increased preempt count, but do not rely on the disabled preemption, this might go away in the future. (NOTE: the extra barrier() in pagefault_disable might fix some holes on machines which have too many registers for their own good) [heiko.carstens@de.ibm.com: s390 fix] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Nick Piggin <npiggin@suse.de> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07[PATCH] shared page table for hugetlb pageChen, Kenneth W1-1/+111
Following up with the work on shared page table done by Dave McCracken. This set of patch target shared page table for hugetlb memory only. The shared page table is particular useful in the situation of large number of independent processes sharing large shared memory segments. In the normal page case, the amount of memory saved from process' page table is quite significant. For hugetlb, the saving on page table memory is not the primary objective (as hugetlb itself already cuts down page table overhead significantly), instead, the purpose of using shared page table on hugetlb is to allow faster TLB refill and smaller cache pollution upon TLB miss. With PT sharing, pte entries are shared among hundreds of processes, the cache consumption used by all the page table is smaller and in return, application gets much higher cache hit ratio. One other effect is that cache hit ratio with hardware page walker hitting on pte in cache will be higher and this helps to reduce tlb miss latency. These two effects contribute to higher application performance. Signed-off-by: Ken Chen <kenneth.w.chen@intel.com> Acked-by: Hugh Dickins <hugh@veritas.com> Cc: Dave McCracken <dmccr@us.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Adam Litke <agl@us.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07[PATCH] i386: Preserve EFI run time regions with memmap parameterArtiom Myaskouvskey1-2/+0
When using memmap kernel parameter in EFI boot we should also add to memory map memory regions of runtime services to enable their mapping later. AK: merged and cleaned up the patch Signed-off-by: Artiom Myaskouvskey <artiom.myaskouvskey@intel.com> Signed-off-by: Andi Kleen <ak@suse.de>
2006-12-07[PATCH] i386: clear_fixmap() should not use set_pte()Jan Beulich1-2/+5
While not strictly required with the current code (as the upper half of page table entries generated by __set_fixmap() cannot be non-zero due to the second parameter of this function being 'unsigned long'), the use of set_pte() in __set_fixmap() in the context of clear_fixmap() is still improper with CONFIG_X86_PAE (see the respective comment in include/asm-i386/pgtable-3level.h) and would turn into a bug if that second parameter ever gets changed to a 64-bit type. Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Andi Kleen <ak@suse.de>
2006-12-07[PATCH] paravirt: Add MMU virtualization to paravirt_opsRusty Russell1-0/+1
Add the three bare TLB accessor functions to paravirt-ops. Most amusingly, flush_tlb is redefined on SMP, so I can't call the paravirt op flush_tlb. Instead, I chose to indicate the actual flush type, kernel (global) vs. user (non-global). Global in this sense means using the global bit in the page table entry, which makes TLB entries persistent across CR3 reloads, not global as in the SMP sense of invoking remote shootdowns, so the term is confusingly overloaded. AK: folded in fix from Zach for PAE compilation Signed-off-by: Zachary Amsden <zach@vmware.com> Signed-off-by: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Andi Kleen <ak@suse.de> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Andrew Morton <akpm@osdl.org>
2006-12-07[PATCH] i386: substitute __va lookup with pfn_to_kaddrDavid Rientjes1-1/+1
Substitutes allocate_pgdat virtual address lookup with pfn_to_kaddr macro. Signed-off-by: David Rientjes <rientjes@cs.washington.edu> Signed-off-by: Andi Kleen <ak@suse.de> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org>
2006-12-07[PATCH] i386: Use probe_kernel_address instead of __get_user in fault pathsAndi Kleen1-6/+6
Makes the intention of the code cleaner to read and avoids a potential deadlock on mmap_sem. Also change the types of the arguments to not include __user because they're really not user addresses. Signed-off-by: Andi Kleen <ak@suse.de>
2006-12-07[PATCH] i386: Use CLFLUSH instead of WBINVD in change_page_attrAndi Kleen1-7/+17
CLFLUSH is a lot faster than WBINVD so try to use that. Signed-off-by: Andi Kleen <ak@suse.de>
2006-10-11[PATCH] mm: use symbolic names instead of indices for zone initialisationMel Gorman1-5/+6
Arch-independent zone-sizing is using indices instead of symbolic names to offset within an array related to zones (max_zone_pfns). The unintended impact is that ZONE_DMA and ZONE_NORMAL is initialised on powerpc instead of ZONE_DMA and ZONE_HIGHMEM when CONFIG_HIGHMEM is set. As a result, the the machine fails to boot but will boot with CONFIG_HIGHMEM turned off. The following patch properly initialises the max_zone_pfns[] array and uses symbolic names instead of indices in each architecture using arch-independent zone-sizing. Two users have successfully booted their powerpcs with it (one an ibook G4). It has also been boot tested on x86, x86_64, ppc64 and ia64. Please merge for 2.6.19-rc2. Credit to Benjamin Herrenschmidt for identifying the bug and rolling the first fix. Additional credit to Johannes Berg and Andreas Schwab for reporting the problem and testing on powerpc. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-03BUG_ON cleanups in arch/i386Eric Sesterhenn2-4/+2
This changes a couple of if() BUG(); constructs to BUG_ON(); so it can be safely optimized away. Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de> Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-10-01[PATCH] paravirt: update pte hookZachary Amsden1-0/+1
Add a pte_update_hook which notifies about pte changes that have been made without using the set_pte / clear_pte interfaces. This allows shadow mode hypervisors which do not trap on page table access to maintain synchronized shadows. It also turns out, there was one pte update in PAE mode that wasn't using any accessor interface at all for setting NX protection. Considering it is PAE specific, and the accessor is i386 specific, I didn't want to add a generic encapsulation of this behavior yet. Signed-off-by: Zachary Amsden <zach@vmware.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jeremy Fitzhardinge <jeremy@xensource.com> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01[PATCH] paravirt: kpte flushZachary Amsden1-11/+7
Create a new PTE function which combines clearing a kernel PTE with the subsequent flush. This allows the two to be easily combined into a single hypercall or paravirt-op. More subtly, reverse the order of the flush for kmap_atomic. Instead of flushing on establishing a mapping, flush on clearing a mapping. This eliminates the possibility of leaving stale kmap entries which may still have valid TLB mappings. This is required for direct mode hypervisors, which need to reprotect all mappings of a given page when changing the page type from a normal page to a protected page (such as a page table or descriptor table page). But it also provides some nicer semantics for real hardware, by providing extra debug-proofing against using stale mappings, as well as ensuring that no stale mappings exist when changing the cacheability attributes of a page, which could lead to cache conflicts when two different types of mappings exist for the same page. Signed-off-by: Zachary Amsden <zach@vmware.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jeremy Fitzhardinge <jeremy@xensource.com> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-01[PATCH] Generic ioremap_page_range: i386 conversionHaavard Skinnemoen1-78/+6
Convert i386 to use generic ioremap_page_range() [bunk@stusta.de: build fix] Signed-off-by: Haavard Skinnemoen <hskinnemoen@atmel.com> Acked-by: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29[PATCH] pidspace: is_init()Sukadev Bhattiprolu1-1/+1
This is an updated version of Eric Biederman's is_init() patch. (http://lkml.org/lkml/2006/2/6/280). It applies cleanly to 2.6.18-rc3 and replaces a few more instances of ->pid == 1 with is_init(). Further, is_init() checks pid and thus removes dependency on Eric's other patches for now. Eric's original description: There are a lot of places in the kernel where we test for init because we give it special properties. Most significantly init must not die. This results in code all over the kernel test ->pid == 1. Introduce is_init to capture this case. With multiple pid spaces for all of the cases affected we are looking for only the first process on the system, not some other process that has pid == 1. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Serge Hallyn <serue@us.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: <lxc-devel@lists.sourceforge.net> Acked-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-29[PATCH] make PROT_WRITE imply PROT_READJason Baron1-1/+1
Make PROT_WRITE imply PROT_READ for a number of architectures which don't support write only in hardware. While looking at this, I noticed that some architectures which do not support write only mappings already take the exact same approach. For example, in arch/alpha/mm/fault.c: " if (cause < 0) { if (!(vma->vm_flags & VM_EXEC)) goto bad_area; } else if (!cause) { /* Allow reads even for write-only mappings */ if (!(vma->vm_flags & (VM_READ | VM_WRITE))) goto bad_area; } else { if (!(vma->vm_flags & VM_WRITE)) goto bad_area; } " Thus, this patch brings other architectures which do not support write only mappings in-line and consistent with the rest. I've verified the patch on ia64, x86_64 and x86. Additional discussion: Several architectures, including x86, can not support write-only mappings. The pte for x86 reserves a single bit for protection and its two states are read only or read/write. Thus, write only is not supported in h/w. Currently, if i 'mmap' a page write-only, the first read attempt on that page creates a page fault and will SEGV. That check is enforced in arch/blah/mm/fault.c. However, if i first write that page it will fault in and the pte will be set to read/write. Thus, any subsequent reads to the page will succeed. It is this inconsistency in behavior that this patch is attempting to address. Furthermore, if the page is swapped out, and then brought back the first read will also cause a SEGV. Thus, any arbitrary read on a page can potentially result in a SEGV. According to the SuSv3 spec, "if the application requests only PROT_WRITE, the implementation may also allow read access." Also as mentioned, some archtectures, such as alpha, shown above already take the approach that i am suggesting. The counter-argument to this raised by Arjan, is that the kernel is enforcing the write only mapping the best it can given the h/w limitations. This is true, however Alan Cox, and myself would argue that the inconsitency in behavior, that is applications can sometimes work/sometimes fails is highly undesireable. If you read through the thread, i think people, came to an agreement on the last patch i posted, as nobody has objected to it... Signed-off-by: Jason Baron <jbaron@redhat.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Roman Zippel <zippel@linux-m68k.org> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: Andi Kleen <ak@muc.de> Acked-by: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Arjan van de Ven <arjan@linux.intel.com> Acked-by: Paul Mundt <lethal@linux-sh.org> Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp> Cc: Ian Molton <spyro@f2s.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-27[PATCH] Have x86 use add_active_range() and free_area_init_nodesMel Gorman1-52/+17
Size zones and holes in an architecture independent manner for x86. [akpm@osdl.org: build fix] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Andi Kleen <ak@muc.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "Keith Mannthey" <kmannth@gmail.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26Merge branch 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6Linus Torvalds5-45/+27
* 'for-linus' of git://one.firstfloor.org/home/andi/git/linux-2.6: (225 commits) [PATCH] Don't set calgary iommu as default y [PATCH] i386/x86-64: New Intel feature flags [PATCH] x86: Add a cumulative thermal throttle event counter. [PATCH] i386: Make the jiffies compares use the 64bit safe macros. [PATCH] x86: Refactor thermal throttle processing [PATCH] Add 64bit jiffies compares (for use with get_jiffies_64) [PATCH] Fix unwinder warning in traps.c [PATCH] x86: Allow disabling early pci scans with pci=noearly or disallowing conf1 [PATCH] x86: Move direct PCI scanning functions out of line [PATCH] i386/x86-64: Make all early PCI scans dependent on CONFIG_PCI [PATCH] Don't leak NT bit into next task [PATCH] i386/x86-64: Work around gcc bug with noreturn functions in unwinder [PATCH] Fix some broken white space in ia32_signal.c [PATCH] Initialize argument registers for 32bit signal handlers. [PATCH] Remove all traces of signal number conversion [PATCH] Don't synchronize time reading on single core AMD systems [PATCH] Remove outdated comment in x86-64 mmconfig code [PATCH] Use string instructions for Core2 copy/clear [PATCH] x86: - restore i8259A eoi status on resume [PATCH] i386: Split multi-line printk in oops output. ...
2006-09-26[PATCH] x86: make __FIXADDR_TOP variable to allow it to make space for a hypervisorJeremy Fitzhardinge2-0/+68
Make __FIXADDR_TOP a variable, so that it can be set to not get in the way of address space a hypervisor may want to reserve. Original patch by Gerd Hoffmann <kraxel@suse.de> Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Signed-off-by: Chris Wright <chrisw@sous-sol.org> Cc: Gerd Hoffmann <kraxel@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26[PATCH] NUMA: Add zone_to_nid functionChristoph Lameter1-1/+1
There are many places where we need to determine the node of a zone. Currently we use a difficult to read sequence of pointer dereferencing. Put that into an inline function and use throughout VM. Maybe we can find a way to optimize the lookup in the future. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26[PATCH] ZVC: Support NR_SLAB_RECLAIMABLE / NR_SLAB_UNRECLAIMABLEChristoph Lameter1-1/+3
Remove the atomic counter for slab_reclaim_pages and replace the counter and NR_SLAB with two ZVC counter that account for unreclaimable and reclaimable slab pages: NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE. Change the check in vmscan.c to refer to to NR_SLAB_RECLAIMABLE. The intend seems to be to check for slab pages that could be freed. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26[PATCH] reduce MAX_NR_ZONES: fix MAX_NR_ZONES array initializationsChristoph Lameter1-1/+1
Fix array initialization in lots of arches The number of zones may now be reduced from 4 to 2 for many arches. Fix the array initialization for the zones array for all architectures so that it is not initializing a fixed number of elements. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26[PATCH] reduce MAX_NR_ZONES: remove two strange uses of MAX_NR_ZONESChristoph Lameter1-1/+1
I keep seeing zones on various platforms that are never used and wonder why we compile support for them into the kernel. Counters show up for HIGHMEM and DMA32 that are alway zero. This patch allows the removal of ZONE_DMA32 for non x86_64 architectures and it will get rid of ZONE_HIGHMEM for arches not using highmem (like 64 bit architectures). If an arch does not define CONFIG_HIGHMEM then ZONE_HIGHMEM will not be defined. Similarly if an arch does not define CONFIG_ZONE_DMA32 then ZONE_DMA32 will not be defined. No current architecture uses all the 4 zones (DMA,DMA32,NORMAL,HIGH) that we have now. The patchset will reduce the number of zones for all platforms. On many platforms that do not have DMA32 or HIGHMEM this will reduce the number of zones by 50%. F.e. ia64 only uses DMA and NORMAL. Large amounts of memory can be saved for larger systemss that may have a few hundred NUMA nodes. With ZONE_DMA32 and ZONE_HIGHMEM support optional MAX_NR_ZONES will be 2 for many non i386 platforms and even for i386 without CONFIG_HIGHMEM set. Tested on ia64, x86_64 and on i386 with and without highmem. The patchset consists of 11 patches that are following this message. One could go even further than this patchset and also make ZONE_DMA optional because some platforms do not need a separate DMA zone and can do DMA to all of memory. This could reduce MAX_NR_ZONES to 1. Such a patchset will hopefully follow soon. This patch: Fix strange uses of MAX_NR_ZONES Sometimes we use MAX_NR_ZONES - x to refer to a zone. Make that explicit. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26[PATCH] convert i386 NUMA KVA space to bootmemkeith mannthey1-8/+21
Address a long standing issue of booting with an initrd on an i386 numa system. Currently (and always) the numa kva area is mapped into low memory by finding the end of low memory and moving that mark down (thus creating space for the kva). The issue with this is that Grub loads initrds into this similar space so when the kernel check the initrd it finds it outside max_low_pfn and disables it (it thinks the initrd is not mapped into usable memory) thus initrd enabled kernels can't boot i386 numa :( My solution to the problem just converts the numa kva area to use the bootmem allocator to save it's area (instead of moving the end of low memory). Using bootmem allows the kva area to be mapped into more diverse addresses (not just the end of low memory) and enables the kva area to be mapped below the initrd if present. I have tested this patch on numaq(no initrd) and summit(initrd) i386 numa based systems. [akpm@osdl.org: cleanups] Signed-off-by: Keith Mannthey <kmannth@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-09-26[PATCH] i386: Allow a kernel not to be in ring 0Rusty Russell2-8/+5
We allow for the fact that the guest kernel may not run in ring 0. This requires some abstraction in a few places when setting %cs or checking privilege level (user vs kernel). This is Chris' [RFC PATCH 15/33] move segment checks to subarch, except rather than using #define USER_MODE_MASK which depends on a config option, we use Zach's more flexible approach of assuming ring 3 == userspace. I also used "get_kernel_rpl()" over "get_kernel_cs()" because I think it reads better in the code... 1) Remove the hardcoded 3 and introduce #define SEGMENT_RPL_MASK 3 2) Add a get_kernel_rpl() macro, and don't assume it's zero. And: Clean up of patch for letting kernel run other than ring 0: a. Add some comments about the SEGMENT_IS_*_CODE() macros. b. Add a USER_RPL macro. (Code was comparing a value to a mask in some places and to the magic number 3 in other places.) c. Add macros for table indicator field and use them. d. Change the entry.S tests for LDT stack segment to use the macros Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Zachary Amsden <zach@vmware.com> Signed-off-by: Jeremy Fitzhardinge <jeremy@xensource.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Andi Kleen <ak@suse.de>
2006-09-26[PATCH] i386: make fault notifier unconditional and export itAndi Kleen1-10/+4
It's needed for external debuggers and overhead is very small. Also make the actual notifier chain they use static Cc: jbeulich@novell.com Signed-off-by: Andi Kleen <ak@suse.de>
2006-09-26[PATCH] i386: Replace i386 open-coded cmdline parsing withRusty Russell1-6/+12
This patch replaces the open-coded early commandline parsing throughout the i386 boot code with the generic mechanism (already used by ppc, powerpc, ia64 and s390). The code was inconsistent with whether it deletes the option from the cmdline or not, meaning some of these will get passed through the environment into init. This transformation is mainly mechanical, but there are some notable parts: 1) Grammar: s/linux never set's it up/linux never sets it up/ 2) Remove hacked-in earlyprintk= option scanning. When someone actually implements CONFIG_EARLY_PRINTK, then they can use early_param(). [AK: actually it is implemented, but I'm adding the early_param it in the next x86-64 patch] 3) Move declaration of generic_apic_probe() from setup.c into asm/apic.h 4) Various parameters now moved into their appropriate files (thanks Andi). 5) All parse functions which examine arg need to check for NULL, except one where it has subtle humor value. AK: readded acpi_sci handling which was completely dropped AK: moved some more variables into acpi/boot.c Cc: len.brown@intel.com Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andi Kleen <ak@suse.de>
2006-09-26[PATCH] i386: initialize end-of-memory variables as early as possibleJan Beulich3-21/+6
Move initialization of all memory end variables to as early as possible, so that dependent code doesn't need to check whether these variables have already been set. Change the range check in kunmap_atomic to actually make use of this so that the no-mapping-estabished path (under CONFIG_DEBUG_HIGHMEM) gets used only when the address is inside the lowmem area (and BUG() otherwise). Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Andi Kleen <ak@suse.de>
2006-09-25[PATCH] i386 bootioremap / kexec fixkeith mannthey1-2/+5
With CONFIG_PHYSICAL_START set to a non default values the i386 boot_ioremap code calculated its pte index wrong and users of boot_ioremap have their areas incorrectly mapped (for me SRAT table not mapped during early boot). This patch removes the addr < BOOT_PTE_PTRS constraint. [ Keith says this is applicable to 2.6.16 and 2.6.17 as well ] Signed-off-by: Keith Mannthey<kmannth@us.ibm.com> Cc: Vivek Goyal <vgoyal@in.ibm.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: <stable@kernel.org> Cc: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>