<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-dev/fs/hugetlbfs, branch master</title>
<subtitle>Linux kernel development work - see feature branches</subtitle>
<id>https://git.zx2c4.com/linux-dev/atom/fs/hugetlbfs?h=master</id>
<link rel='self' href='https://git.zx2c4.com/linux-dev/atom/fs/hugetlbfs?h=master'/>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/linux-dev/'/>
<updated>2022-11-08T23:57:22Z</updated>
<entry>
<title>hugetlbfs: don't delete error page from pagecache</title>
<updated>2022-11-08T23:57:22Z</updated>
<author>
<name>James Houghton</name>
<email>jthoughton@google.com</email>
</author>
<published>2022-10-18T20:01:25Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/linux-dev/commit/?id=8625147cafaa9ba74713d682f5185eb62cb2aedb'/>
<id>urn:sha1:8625147cafaa9ba74713d682f5185eb62cb2aedb</id>
<content type='text'>
This change is very similar to the change that was made for shmem [1], and
it solves the same problem but for HugeTLBFS instead.

Currently, when poison is found in a HugeTLB page, the page is removed
from the page cache.  That means that attempting to map or read that
hugepage in the future will result in a new hugepage being allocated
instead of notifying the user that the page was poisoned.  As [1] states,
this is effectively memory corruption.

The fix is to leave the page in the page cache.  If the user attempts to
use a poisoned HugeTLB page with a syscall, the syscall will fail with
EIO, the same error code that shmem uses.  For attempts to map the page,
the thread will get a BUS_MCEERR_AR SIGBUS.

[1]: commit a76054266661 ("mm: shmem: don't truncate page if memory failure happens")

Link: https://lkml.kernel.org/r/20221018200125.848471-1-jthoughton@google.com
Signed-off-by: James Houghton &lt;jthoughton@google.com&gt;
Reviewed-by: Mike Kravetz &lt;mike.kravetz@oracle.com&gt;
Reviewed-by: Naoya Horiguchi &lt;naoya.horiguchi@nec.com&gt;
Tested-by: Naoya Horiguchi &lt;naoya.horiguchi@nec.com&gt;
Reviewed-by: Yang Shi &lt;shy828301@gmail.com&gt;
Cc: Axel Rasmussen &lt;axelrasmussen@google.com&gt;
Cc: James Houghton &lt;jthoughton@google.com&gt;
Cc: Miaohe Lin &lt;linmiaohe@huawei.com&gt;
Cc: Muchun Song &lt;songmuchun@bytedance.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>Merge tag 'pull-tmpfile' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs</title>
<updated>2022-10-11T02:45:17Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2022-10-11T02:45:17Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/linux-dev/commit/?id=f721d24e5dae8358b49b24399d27ba5d12a7e049'/>
<id>urn:sha1:f721d24e5dae8358b49b24399d27ba5d12a7e049</id>
<content type='text'>
Pull vfs tmpfile updates from Al Viro:
 "Miklos' -&gt;tmpfile() signature change; pass an unopened struct file to
  it, let it open the damn thing. Allows to add tmpfile support to FUSE"

* tag 'pull-tmpfile' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fuse: implement -&gt;tmpfile()
  vfs: open inside -&gt;tmpfile()
  vfs: move open right after -&gt;tmpfile()
  vfs: make vfs_tmpfile() static
  ovl: use vfs_tmpfile_open() helper
  cachefiles: use vfs_tmpfile_open() helper
  cachefiles: only pass inode to *mark_inode_inuse() helpers
  cachefiles: tmpfile error handling cleanup
  hugetlbfs: cleanup mknod and tmpfile
  vfs: add vfs_tmpfile_open() helper
</content>
</entry>
<entry>
<title>hugetlb: clean up code checking for fault/truncation races</title>
<updated>2022-10-03T21:03:17Z</updated>
<author>
<name>Mike Kravetz</name>
<email>mike.kravetz@oracle.com</email>
</author>
<published>2022-09-14T22:18:10Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/linux-dev/commit/?id=fa27759af4a6d7494c986c44695b13bcd6eaf46b'/>
<id>urn:sha1:fa27759af4a6d7494c986c44695b13bcd6eaf46b</id>
<content type='text'>
With the new hugetlb vma lock in place, it can also be used to handle page
fault races with file truncation.  The lock is taken at the beginning of
the code fault path in read mode.  During truncation, it is taken in write
mode for each vma which has the file mapped.  The file's size (i_size) is
modified before taking the vma lock to unmap.

How are races handled?

The page fault code checks i_size early in processing after taking the vma
lock.  If the fault is beyond i_size, the fault is aborted.  If the fault
is not beyond i_size the fault will continue and a new page will be added
to the file.  It could be that truncation code modifies i_size after the
check in fault code.  That is OK, as truncation code will soon remove the
page.  The truncation code will wait until the fault is finished, as it
must obtain the vma lock in write mode.

This patch cleans up/removes late checks in the fault paths that try to
back out pages racing with truncation.  As noted above, we just let the
truncation code remove the pages.

[mike.kravetz@oracle.com: fix reserve_alloc set but not used compiler warning]
  Link: https://lkml.kernel.org/r/Yyj7HsJWfHDoU24U@monkey
Link: https://lkml.kernel.org/r/20220914221810.95771-10-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz &lt;mike.kravetz@oracle.com&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: "Aneesh Kumar K.V" &lt;aneesh.kumar@linux.vnet.ibm.com&gt;
Cc: Axel Rasmussen &lt;axelrasmussen@google.com&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Davidlohr Bueso &lt;dave@stgolabs.net&gt;
Cc: James Houghton &lt;jthoughton@google.com&gt;
Cc: "Kirill A. Shutemov" &lt;kirill.shutemov@linux.intel.com&gt;
Cc: Miaohe Lin &lt;linmiaohe@huawei.com&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Mina Almasry &lt;almasrymina@google.com&gt;
Cc: Muchun Song &lt;songmuchun@bytedance.com&gt;
Cc: Naoya Horiguchi &lt;naoya.horiguchi@linux.dev&gt;
Cc: Pasha Tatashin &lt;pasha.tatashin@soleen.com&gt;
Cc: Peter Xu &lt;peterx@redhat.com&gt;
Cc: Prakash Sangappa &lt;prakash.sangappa@oracle.com&gt;
Cc: Sven Schnelle &lt;svens@linux.ibm.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>hugetlb: use new vma_lock for pmd sharing synchronization</title>
<updated>2022-10-03T21:03:17Z</updated>
<author>
<name>Mike Kravetz</name>
<email>mike.kravetz@oracle.com</email>
</author>
<published>2022-09-14T22:18:09Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/linux-dev/commit/?id=40549ba8f8e0ed1f8b235979563f619e9aa34fdf'/>
<id>urn:sha1:40549ba8f8e0ed1f8b235979563f619e9aa34fdf</id>
<content type='text'>
The new hugetlb vma lock is used to address this race:

Faulting thread                                 Unsharing thread
...                                                  ...
ptep = huge_pte_offset()
      or
ptep = huge_pte_alloc()
...
                                                i_mmap_lock_write
                                                lock page table
ptep invalid   &lt;------------------------        huge_pmd_unshare()
Could be in a previously                        unlock_page_table
sharing process or worse                        i_mmap_unlock_write
...

The vma_lock is used as follows:
- During fault processing. The lock is acquired in read mode before
  doing a page table lock and allocation (huge_pte_alloc).  The lock is
  held until code is finished with the page table entry (ptep).
- The lock must be held in write mode whenever huge_pmd_unshare is
  called.

Lock ordering issues come into play when unmapping a page from all
vmas mapping the page.  The i_mmap_rwsem must be held to search for the
vmas, and the vma lock must be held before calling unmap which will
call huge_pmd_unshare.  This is done today in:
- try_to_migrate_one and try_to_unmap_ for page migration and memory
  error handling.  In these routines we 'try' to obtain the vma lock and
  fail to unmap if unsuccessful.  Calling routines already deal with the
  failure of unmapping.
- hugetlb_vmdelete_list for truncation and hole punch.  This routine
  also tries to acquire the vma lock.  If it fails, it skips the
  unmapping.  However, we can not have file truncation or hole punch
  fail because of contention.  After hugetlb_vmdelete_list, truncation
  and hole punch call remove_inode_hugepages.  remove_inode_hugepages
  checks for mapped pages and call hugetlb_unmap_file_page to unmap them.
  hugetlb_unmap_file_page is designed to drop locks and reacquire in the
  correct order to guarantee unmap success.

Link: https://lkml.kernel.org/r/20220914221810.95771-9-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz &lt;mike.kravetz@oracle.com&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: "Aneesh Kumar K.V" &lt;aneesh.kumar@linux.vnet.ibm.com&gt;
Cc: Axel Rasmussen &lt;axelrasmussen@google.com&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Davidlohr Bueso &lt;dave@stgolabs.net&gt;
Cc: James Houghton &lt;jthoughton@google.com&gt;
Cc: "Kirill A. Shutemov" &lt;kirill.shutemov@linux.intel.com&gt;
Cc: Miaohe Lin &lt;linmiaohe@huawei.com&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Mina Almasry &lt;almasrymina@google.com&gt;
Cc: Muchun Song &lt;songmuchun@bytedance.com&gt;
Cc: Naoya Horiguchi &lt;naoya.horiguchi@linux.dev&gt;
Cc: Pasha Tatashin &lt;pasha.tatashin@soleen.com&gt;
Cc: Peter Xu &lt;peterx@redhat.com&gt;
Cc: Prakash Sangappa &lt;prakash.sangappa@oracle.com&gt;
Cc: Sven Schnelle &lt;svens@linux.ibm.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>hugetlb: create hugetlb_unmap_file_folio to unmap single file folio</title>
<updated>2022-10-03T21:03:17Z</updated>
<author>
<name>Mike Kravetz</name>
<email>mike.kravetz@oracle.com</email>
</author>
<published>2022-09-14T22:18:08Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/linux-dev/commit/?id=378397ccb8e5a695a42e819df545ccd28641b683'/>
<id>urn:sha1:378397ccb8e5a695a42e819df545ccd28641b683</id>
<content type='text'>
Create the new routine hugetlb_unmap_file_folio that will unmap a single
file folio.  This is refactored code from hugetlb_vmdelete_list.  It is
modified to do locking within the routine itself and check whether the
page is mapped within a specific vma before unmapping.

This refactoring will be put to use and expanded upon in a subsequent
patch adding vma specific locking.

Link: https://lkml.kernel.org/r/20220914221810.95771-8-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz &lt;mike.kravetz@oracle.com&gt;
Reviewed-by: Miaohe Lin &lt;linmiaohe@huawei.com&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: "Aneesh Kumar K.V" &lt;aneesh.kumar@linux.vnet.ibm.com&gt;
Cc: Axel Rasmussen &lt;axelrasmussen@google.com&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Davidlohr Bueso &lt;dave@stgolabs.net&gt;
Cc: James Houghton &lt;jthoughton@google.com&gt;
Cc: "Kirill A. Shutemov" &lt;kirill.shutemov@linux.intel.com&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Mina Almasry &lt;almasrymina@google.com&gt;
Cc: Muchun Song &lt;songmuchun@bytedance.com&gt;
Cc: Naoya Horiguchi &lt;naoya.horiguchi@linux.dev&gt;
Cc: Pasha Tatashin &lt;pasha.tatashin@soleen.com&gt;
Cc: Peter Xu &lt;peterx@redhat.com&gt;
Cc: Prakash Sangappa &lt;prakash.sangappa@oracle.com&gt;
Cc: Sven Schnelle &lt;svens@linux.ibm.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>hugetlb: create remove_inode_single_folio to remove single file folio</title>
<updated>2022-10-03T21:03:16Z</updated>
<author>
<name>Mike Kravetz</name>
<email>mike.kravetz@oracle.com</email>
</author>
<published>2022-09-14T22:18:05Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/linux-dev/commit/?id=c86272287bc65cb3d698a95c19651265e9f287cd'/>
<id>urn:sha1:c86272287bc65cb3d698a95c19651265e9f287cd</id>
<content type='text'>
Create the new routine remove_inode_single_folio that will remove a single
folio from a file.  This is refactored code from remove_inode_hugepages. 
It checks for the uncommon case in which the folio is still mapped and
unmaps.

No functional change.  This refactoring will be put to use and expanded
upon in a subsequent patches.

Link: https://lkml.kernel.org/r/20220914221810.95771-5-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz &lt;mike.kravetz@oracle.com&gt;
Reviewed-by: Miaohe Lin &lt;linmiaohe@huawei.com&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: "Aneesh Kumar K.V" &lt;aneesh.kumar@linux.vnet.ibm.com&gt;
Cc: Axel Rasmussen &lt;axelrasmussen@google.com&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Davidlohr Bueso &lt;dave@stgolabs.net&gt;
Cc: James Houghton &lt;jthoughton@google.com&gt;
Cc: "Kirill A. Shutemov" &lt;kirill.shutemov@linux.intel.com&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Mina Almasry &lt;almasrymina@google.com&gt;
Cc: Muchun Song &lt;songmuchun@bytedance.com&gt;
Cc: Naoya Horiguchi &lt;naoya.horiguchi@linux.dev&gt;
Cc: Pasha Tatashin &lt;pasha.tatashin@soleen.com&gt;
Cc: Peter Xu &lt;peterx@redhat.com&gt;
Cc: Prakash Sangappa &lt;prakash.sangappa@oracle.com&gt;
Cc: Sven Schnelle &lt;svens@linux.ibm.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>hugetlb: rename remove_huge_page to hugetlb_delete_from_page_cache</title>
<updated>2022-10-03T21:03:16Z</updated>
<author>
<name>Mike Kravetz</name>
<email>mike.kravetz@oracle.com</email>
</author>
<published>2022-09-14T22:18:04Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/linux-dev/commit/?id=7e1813d48dd30e6c6f235f6661d1bc108fcab528'/>
<id>urn:sha1:7e1813d48dd30e6c6f235f6661d1bc108fcab528</id>
<content type='text'>
remove_huge_page removes a hugetlb page from the page cache.  Change to
hugetlb_delete_from_page_cache as it is a more descriptive name. 
huge_add_to_page_cache is global in scope, but only deals with hugetlb
pages.  For consistency and clarity, rename to hugetlb_add_to_page_cache.

Link: https://lkml.kernel.org/r/20220914221810.95771-4-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz &lt;mike.kravetz@oracle.com&gt;
Reviewed-by: Miaohe Lin &lt;linmiaohe@huawei.com&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: "Aneesh Kumar K.V" &lt;aneesh.kumar@linux.vnet.ibm.com&gt;
Cc: Axel Rasmussen &lt;axelrasmussen@google.com&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Davidlohr Bueso &lt;dave@stgolabs.net&gt;
Cc: James Houghton &lt;jthoughton@google.com&gt;
Cc: "Kirill A. Shutemov" &lt;kirill.shutemov@linux.intel.com&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Mina Almasry &lt;almasrymina@google.com&gt;
Cc: Muchun Song &lt;songmuchun@bytedance.com&gt;
Cc: Naoya Horiguchi &lt;naoya.horiguchi@linux.dev&gt;
Cc: Pasha Tatashin &lt;pasha.tatashin@soleen.com&gt;
Cc: Peter Xu &lt;peterx@redhat.com&gt;
Cc: Prakash Sangappa &lt;prakash.sangappa@oracle.com&gt;
Cc: Sven Schnelle &lt;svens@linux.ibm.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization</title>
<updated>2022-10-03T21:03:16Z</updated>
<author>
<name>Mike Kravetz</name>
<email>mike.kravetz@oracle.com</email>
</author>
<published>2022-09-14T22:18:03Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/linux-dev/commit/?id=3a47c54f09c4c89128d8f67d49296b1c25b317d0'/>
<id>urn:sha1:3a47c54f09c4c89128d8f67d49296b1c25b317d0</id>
<content type='text'>
Commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") added code to take i_mmap_rwsem in read mode for the
duration of fault processing.  However, this has been shown to cause
performance/scaling issues.  Revert the code and go back to only taking
the semaphore in huge_pmd_share during the fault path.

Keep the code that takes i_mmap_rwsem in write mode before calling
try_to_unmap as this is required if huge_pmd_unshare is called.

NOTE: Reverting this code does expose the following race condition.

Faulting thread                                 Unsharing thread
...                                                  ...
ptep = huge_pte_offset()
      or
ptep = huge_pte_alloc()
...
                                                i_mmap_lock_write
                                                lock page table
ptep invalid   &lt;------------------------        huge_pmd_unshare()
Could be in a previously                        unlock_page_table
sharing process or worse                        i_mmap_unlock_write
...
ptl = huge_pte_lock(ptep)
get/update pte
set_pte_at(pte, ptep)

It is unknown if the above race was ever experienced by a user.  It was
discovered via code inspection when initially addressed.

In subsequent patches, a new synchronization mechanism will be added to
coordinate pmd sharing and eliminate this race.

Link: https://lkml.kernel.org/r/20220914221810.95771-3-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz &lt;mike.kravetz@oracle.com&gt;
Reviewed-by: Miaohe Lin &lt;linmiaohe@huawei.com&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: "Aneesh Kumar K.V" &lt;aneesh.kumar@linux.vnet.ibm.com&gt;
Cc: Axel Rasmussen &lt;axelrasmussen@google.com&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Davidlohr Bueso &lt;dave@stgolabs.net&gt;
Cc: James Houghton &lt;jthoughton@google.com&gt;
Cc: "Kirill A. Shutemov" &lt;kirill.shutemov@linux.intel.com&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Mina Almasry &lt;almasrymina@google.com&gt;
Cc: Muchun Song &lt;songmuchun@bytedance.com&gt;
Cc: Naoya Horiguchi &lt;naoya.horiguchi@linux.dev&gt;
Cc: Pasha Tatashin &lt;pasha.tatashin@soleen.com&gt;
Cc: Peter Xu &lt;peterx@redhat.com&gt;
Cc: Prakash Sangappa &lt;prakash.sangappa@oracle.com&gt;
Cc: Sven Schnelle &lt;svens@linux.ibm.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race</title>
<updated>2022-10-03T21:03:16Z</updated>
<author>
<name>Mike Kravetz</name>
<email>mike.kravetz@oracle.com</email>
</author>
<published>2022-09-14T22:18:02Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/linux-dev/commit/?id=188a39725ad7ded2d13e752a1a620152b0750175'/>
<id>urn:sha1:188a39725ad7ded2d13e752a1a620152b0750175</id>
<content type='text'>
Patch series "hugetlb: Use new vma lock for huge pmd sharing
synchronization", v2.

hugetlb fault scalability regressions have recently been reported [1]. 
This is not the first such report, as regressions were also noted when
commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") was added [2] in v5.7.  At that time, a proposal to
address the regression was suggested [3] but went nowhere.

The regression and benefit of this patch series is not evident when
using the vm_scalability benchmark reported in [2] on a recent kernel.
Results from running,
"./usemem -n 48 --prealloc --prefault -O -U 3448054972"

			48 sample Avg
next-20220913		next-20220913			next-20220913
unmodified	revert i_mmap_sema locking	vma sema locking, this series
-----------------------------------------------------------------------------
498150 KB/s		501934 KB/s			504793 KB/s

The recent regression report [1] notes page fault and fork latency of
shared hugetlb mappings.  To measure this, I created two simple programs:
1) map a shared hugetlb area, write fault all pages, unmap area
   Do this in a continuous loop to measure faults per second
2) map a shared hugetlb area, write fault a few pages, fork and exit
   Do this in a continuous loop to measure forks per second
These programs were run on a 48 CPU VM with 320GB memory.  The shared
mapping size was 250GB.  For comparison, a single instance of the program
was run.  Then, multiple instances were run in parallel to introduce
lock contention.  Changing the locking scheme results in a significant
performance benefit.

test		instances	unmodified	revert		vma
--------------------------------------------------------------------------
faults per sec	1		393043		395680		389932
faults per sec  24		 71405		 81191		 79048
forks per sec   1		  2802		  2747		  2725
forks per sec   24		   439		   536		   500
Combined faults 24		  1621		 68070		 53662
Combined forks  24		   358		    67		   142

Combined test is when running both faulting program and forking program
simultaneously.

Patches 1 and 2 of this series revert c0d0381ade79 and 87bf91d39bb5 which
depends on c0d0381ade79.  Acquisition of i_mmap_rwsem is still required in
the fault path to establish pmd sharing, so this is moved back to
huge_pmd_share.  With c0d0381ade79 reverted, this race is exposed:

Faulting thread                                 Unsharing thread
...                                                  ...
ptep = huge_pte_offset()
      or
ptep = huge_pte_alloc()
...
                                                i_mmap_lock_write
                                                lock page table
ptep invalid   &lt;------------------------        huge_pmd_unshare()
Could be in a previously                        unlock_page_table
sharing process or worse                        i_mmap_unlock_write
...
ptl = huge_pte_lock(ptep)
get/update pte
set_pte_at(pte, ptep)

Reverting 87bf91d39bb5 exposes races in page fault/file truncation.  When
the new vma lock is put to use in patch 8, this will handle the fault/file
truncation races.  This is explained in patch 9 where code associated with
these races is cleaned up.

Patches 3 - 5 restructure existing code in preparation for using the new
vma lock (rw semaphore) for pmd sharing synchronization.  The idea is that
this semaphore will be held in read mode for the duration of fault
processing, and held in write mode for unmap operations which may call
huge_pmd_unshare.  Acquiring i_mmap_rwsem is also still required to
synchronize huge pmd sharing.  However it is only required in the fault
path when setting up sharing, and will be acquired in huge_pmd_share().

Patch 6 adds the new vma lock and all supporting routines, but does not
actually change code to use the new lock.

Patch 7 refactors code in preparation for using the new lock.  And, patch
8 finally adds code to make use of this new vma lock.  Unfortunately, the
fault code and truncate/hole punch code would naturally take locks in the
opposite order which could lead to deadlock.  Since the performance of
page faults is more important, the truncation/hole punch code is modified
to back out and take locks in the correct order if necessary.

[1] https://lore.kernel.org/linux-mm/43faf292-245b-5db5-cce9-369d8fb6bd21@infradead.org/
[2] https://lore.kernel.org/lkml/20200622005551.GK5535@shao2-debian/
[3] https://lore.kernel.org/linux-mm/20200706202615.32111-1-mike.kravetz@oracle.com/


This patch (of 9):

Commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization") added code to take i_mmap_rwsem in read mode for the
duration of fault processing.  The use of i_mmap_rwsem to prevent
fault/truncate races depends on this.  However, this has been shown to
cause performance/scaling issues.  As a result, that code will be
reverted.  Since the use i_mmap_rwsem to address page fault/truncate races
depends on this, it must also be reverted.

In a subsequent patch, code will be added to detect the fault/truncate
race and back out operations as required.

Link: https://lkml.kernel.org/r/20220914221810.95771-1-mike.kravetz@oracle.com
Link: https://lkml.kernel.org/r/20220914221810.95771-2-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz &lt;mike.kravetz@oracle.com&gt;
Reviewed-by: Miaohe Lin &lt;linmiaohe@huawei.com&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: "Aneesh Kumar K.V" &lt;aneesh.kumar@linux.vnet.ibm.com&gt;
Cc: Axel Rasmussen &lt;axelrasmussen@google.com&gt;
Cc: David Hildenbrand &lt;david@redhat.com&gt;
Cc: Davidlohr Bueso &lt;dave@stgolabs.net&gt;
Cc: James Houghton &lt;jthoughton@google.com&gt;
Cc: "Kirill A. Shutemov" &lt;kirill.shutemov@linux.intel.com&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Mina Almasry &lt;almasrymina@google.com&gt;
Cc: Muchun Song &lt;songmuchun@bytedance.com&gt;
Cc: Naoya Horiguchi &lt;naoya.horiguchi@linux.dev&gt;
Cc: Pasha Tatashin &lt;pasha.tatashin@soleen.com&gt;
Cc: Peter Xu &lt;peterx@redhat.com&gt;
Cc: Prakash Sangappa &lt;prakash.sangappa@oracle.com&gt;
Cc: Sven Schnelle &lt;svens@linux.ibm.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>vfs: open inside -&gt;tmpfile()</title>
<updated>2022-09-24T05:00:00Z</updated>
<author>
<name>Miklos Szeredi</name>
<email>mszeredi@redhat.com</email>
</author>
<published>2022-09-24T05:00:00Z</published>
<link rel='alternate' type='text/html' href='https://git.zx2c4.com/linux-dev/commit/?id=863f144f12add1f4eab80b70561a90857c524a8b'/>
<id>urn:sha1:863f144f12add1f4eab80b70561a90857c524a8b</id>
<content type='text'>
This is in preparation for adding tmpfile support to fuse, which requires
that the tmpfile creation and opening are done as a single operation.

Replace the 'struct dentry *' argument of i_op-&gt;tmpfile with
'struct file *'.

Call finish_open_simple() as the last thing in -&gt;tmpfile() instances (may
be omitted in the error case).

Change d_tmpfile() argument to 'struct file *' as well to make callers more
readable.

Reviewed-by: Christian Brauner (Microsoft) &lt;brauner@kernel.org&gt;
Signed-off-by: Miklos Szeredi &lt;mszeredi@redhat.com&gt;
</content>
</entry>
</feed>
