]> www.infradead.org Git - users/willy/pagecache.git/log
users/willy/pagecache.git
2 years agophyr: Add basic definition phyr
Matthew Wilcox (Oracle) [Sat, 2 Jul 2022 19:45:32 +0000 (15:45 -0400)]
phyr: Add basic definition

Add the data structure and the documentation.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agoscatterlist: Extract get_user_sgtable() from ib_umem_get()
Matthew Wilcox (Oracle) [Sat, 2 Jul 2022 19:14:01 +0000 (15:14 -0400)]
scatterlist: Extract get_user_sgtable() from ib_umem_get()

This is a nice simplification on its own, as well as being a preparatory
step for the phyr.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm/debug: Convert __dump_page() to use a folio throughout
Matthew Wilcox (Oracle) [Sat, 2 Jul 2022 18:26:34 +0000 (14:26 -0400)]
mm/debug: Convert __dump_page() to use a folio throughout

We have to be a little careful here because a corrupt page may appear
to be a tail page, and that can trip some debugging assertions leading
to an infinite loop of dumping pages.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agoslub: Allocate frozen pages
Matthew Wilcox (Oracle) [Tue, 31 May 2022 13:36:14 +0000 (09:36 -0400)]
slub: Allocate frozen pages

Since slub does not use the page refcount, it can allocate and
free frozen pages, saving one atomic operation per free.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2 years agoslab: Allocate frozen pages
Matthew Wilcox (Oracle) [Tue, 31 May 2022 03:02:44 +0000 (23:02 -0400)]
slab: Allocate frozen pages

Since slab does not use the page refcount, it can allocate and
free frozen pages, saving one atomic operation per free.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2 years agomm/mempolicy: Add alloc_frozen_pages()
Matthew Wilcox (Oracle) [Tue, 31 May 2022 03:00:54 +0000 (23:00 -0400)]
mm/mempolicy: Add alloc_frozen_pages()

Provide an interface to allocate pages from the page allocator without
incrementing their refcount.  This saves an atomic operation on free,
which may be beneficial to some users (eg slab).

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2 years agomm/page_alloc: Add __alloc_frozen_pages()
Matthew Wilcox (Oracle) [Tue, 14 Jun 2022 20:47:57 +0000 (16:47 -0400)]
mm/page_alloc: Add __alloc_frozen_pages()

Defer the initialisation of the page refcount to the new __alloc_pages()
wrapper and turn the old __alloc_pages() into __alloc_frozen_pages().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm/page_alloc: Move set_page_refcounted() to end of __alloc_pages()
Matthew Wilcox (Oracle) [Tue, 14 Jun 2022 20:38:29 +0000 (16:38 -0400)]
mm/page_alloc: Move set_page_refcounted() to end of __alloc_pages()

Remove some code duplication by calling set_page_refcounted() at the
end of __alloc_pages() instead of after each call that can allocate
a page.  That means that we free a frozen page if we've exceeded the
allowed memcg memory.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm/page_alloc: Move set_page_refcounted() to callers of __alloc_pages_slowpath()
Matthew Wilcox (Oracle) [Tue, 14 Jun 2022 20:33:11 +0000 (16:33 -0400)]
mm/page_alloc: Move set_page_refcounted() to callers of __alloc_pages_slowpath()

In preparation for allocating frozen pages, stop initialising the page
refcount in __alloc_pages_slowpath().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm/page_alloc: Move set_page_refcounted() to callers of __alloc_pages_direct_reclaim()
Matthew Wilcox (Oracle) [Tue, 14 Jun 2022 20:27:38 +0000 (16:27 -0400)]
mm/page_alloc: Move set_page_refcounted() to callers of __alloc_pages_direct_reclaim()

In preparation for allocating frozen pages, stop initialising the page
refcount in __alloc_pages_direct_reclaim().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm/page_alloc: Move set_page_refcounted() to callers of __alloc_pages_direct_compact()
Matthew Wilcox (Oracle) [Tue, 14 Jun 2022 20:24:46 +0000 (16:24 -0400)]
mm/page_alloc: Move set_page_refcounted() to callers of __alloc_pages_direct_compact()

In preparation for allocating frozen pages, stop initialising the page
refcount in __alloc_pages_direct_compact().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm/page_alloc: Move set_page_refcounted() to callers of __alloc_pages_may_oom()
Matthew Wilcox (Oracle) [Tue, 14 Jun 2022 20:22:55 +0000 (16:22 -0400)]
mm/page_alloc: Move set_page_refcounted() to callers of __alloc_pages_may_oom()

In preparation for allocating frozen pages, stop initialising the page
refcount in __alloc_pages_may_oom().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm/page_alloc: Move set_page_refcounted() to callers of __alloc_pages_cpuset_fallback()
Matthew Wilcox (Oracle) [Tue, 14 Jun 2022 20:19:48 +0000 (16:19 -0400)]
mm/page_alloc: Move set_page_refcounted() to callers of __alloc_pages_cpuset_fallback()

In preparation for allocating frozen pages, stop initialising the page
refcount in __alloc_pages_cpuset_fallback().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm/page_alloc: Move set_page_refcounted() to callers of get_page_from_freelist()
Matthew Wilcox (Oracle) [Tue, 14 Jun 2022 20:11:50 +0000 (16:11 -0400)]
mm/page_alloc: Move set_page_refcounted() to callers of get_page_from_freelist()

In preparation for allocating frozen pages, stop initialising the page
refcount in get_page_from_freelist().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm/page_alloc: Move set_page_refcounted() to callers of prep_new_page()
Matthew Wilcox (Oracle) [Tue, 14 Jun 2022 20:05:45 +0000 (16:05 -0400)]
mm/page_alloc: Move set_page_refcounted() to callers of prep_new_page()

In preparation for allocating frozen pages, stop initialising the page
refcount in prep_new_page().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm/page_alloc: Move set_page_refcounted() to callers of post_alloc_hook()
Matthew Wilcox (Oracle) [Tue, 14 Jun 2022 19:56:10 +0000 (15:56 -0400)]
mm/page_alloc: Move set_page_refcounted() to callers of post_alloc_hook()

In preparation for allocating frozen pages, stop initialising
the page refcount in post_alloc_hook().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm/page_alloc: Export free_frozen_pages() instead of free_unref_page()
Matthew Wilcox (Oracle) [Mon, 30 May 2022 22:14:32 +0000 (18:14 -0400)]
mm/page_alloc: Export free_frozen_pages() instead of free_unref_page()

This API makes more sense for slab to use and it works perfectly
well for swap.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2 years agomm/page_alloc: Rename free_the_page() to free_frozen_pages()
Matthew Wilcox (Oracle) [Mon, 30 May 2022 22:03:42 +0000 (18:03 -0400)]
mm/page_alloc: Rename free_the_page() to free_frozen_pages()

In preparation for making this function available outside page_alloc,
rename it to free_frozen_pages(), which fits better with the other
memory allocation/free functions.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
2 years agomm/page_alloc: Cache page_zone() result in free_unref_page()
Matthew Wilcox (Oracle) [Tue, 9 Aug 2022 15:46:56 +0000 (11:46 -0400)]
mm/page_alloc: Cache page_zone() result in free_unref_page()

Save 17 bytes of text by calculating page_zone() once instead of twice.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agokhugepaged: Convert collapse_file() to use folios more
Matthew Wilcox (Oracle) [Tue, 3 Jan 2023 04:52:56 +0000 (23:52 -0500)]
khugepaged: Convert collapse_file() to use folios more

The newly allocated hpage is now always referred to as a folio.
Removes a number of calls to compound_head().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agokhugepaged: Convert alloc_charge_hpage() to take a folio
Matthew Wilcox (Oracle) [Tue, 3 Jan 2023 04:43:38 +0000 (23:43 -0500)]
khugepaged: Convert alloc_charge_hpage() to take a folio

Convert collapse_huge_page() to use a folio throughout.  collapse_file()
is more complex, so convert only the minimum.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomemcg: Convert count_memcg_page_event() to count_memcg_folio_event()
Matthew Wilcox (Oracle) [Tue, 3 Jan 2023 04:19:20 +0000 (23:19 -0500)]
memcg: Convert count_memcg_page_event() to count_memcg_folio_event()

The only caller now has a folio so convert the function to take a folio.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agokhugepaged: Convert hpage_collapse_alloc_page() to hpage_collapse_alloc_folio()
Matthew Wilcox (Oracle) [Tue, 3 Jan 2023 04:16:40 +0000 (23:16 -0500)]
khugepaged: Convert hpage_collapse_alloc_page() to hpage_collapse_alloc_folio()

Return the folio directly instead of a boolean success.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm/khugepaged: Allocate folios
Matthew Wilcox (Oracle) [Sat, 8 May 2021 03:40:19 +0000 (23:40 -0400)]
mm/khugepaged: Allocate folios

khugepaged only wants to deal in terms of folios, so switch to
using the folio allocation functions.  This eliminates the calls to
prep_transhuge_page() and saves dozens of bytes of text.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm: Convert deferred_split_huge_page() to deferred_split_folio()
Matthew Wilcox (Oracle) [Tue, 6 Sep 2022 13:39:24 +0000 (09:39 -0400)]
mm: Convert deferred_split_huge_page() to deferred_split_folio()

Now that both callers use a folio, pass the folio in and save a
call to compound_head().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm/huge_memory: Convert get_deferred_split_queue() to take a folio
Matthew Wilcox (Oracle) [Tue, 6 Sep 2022 14:26:39 +0000 (10:26 -0400)]
mm/huge_memory: Convert get_deferred_split_queue() to take a folio

Removes a few calls to compound_head().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm/huge_memory: Remove page_deferred_list()
Matthew Wilcox (Oracle) [Tue, 6 Sep 2022 12:57:28 +0000 (08:57 -0400)]
mm/huge_memory: Remove page_deferred_list()

Use folio->_deferred_list directly.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm: Move page->deferred_list to folio->_deferred_list
Matthew Wilcox (Oracle) [Tue, 6 Sep 2022 12:36:36 +0000 (08:36 -0400)]
mm: Move page->deferred_list to folio->_deferred_list

Remove the entire block of definitions for the second tail page,
and add the deferred list to the struct folio.  This actually moves
_deferred_list to a different offset in struct folio because I don't
see a need to include the padding.

This lets us use list_for_each_entry_safe() in deferred_split_scan()
and avoid a number of calls to compound_head().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agodoc: Correct struct folio kernel-doc
Matthew Wilcox (Oracle) [Tue, 3 Jan 2023 03:56:31 +0000 (22:56 -0500)]
doc: Correct struct folio kernel-doc

Insert appropriate public: and private: markers to make the generated
kernel-doc look right.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm: Remove 'First tail page' members from struct page
Matthew Wilcox (Oracle) [Mon, 5 Sep 2022 00:59:57 +0000 (20:59 -0400)]
mm: Remove 'First tail page' members from struct page

All former users now use the folio equivalents, so remove them from
the definition of struct page.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agohugetlb: Remove uses of compound_dtor and compound_nr
Matthew Wilcox (Oracle) [Mon, 5 Sep 2022 00:02:59 +0000 (20:02 -0400)]
hugetlb: Remove uses of compound_dtor and compound_nr

Convert the entire file to use the folio equivalents.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm: Convert destroy_large_folio() to use folio_dtor
Matthew Wilcox (Oracle) [Sun, 4 Sep 2022 23:58:56 +0000 (19:58 -0400)]
mm: Convert destroy_large_folio() to use folio_dtor

Replace a use of compound_dtor.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm: Convert is_transparent_hugepage() to use a folio
Matthew Wilcox (Oracle) [Sun, 4 Sep 2022 23:57:49 +0000 (19:57 -0400)]
mm: Convert is_transparent_hugepage() to use a folio

Replace a use of page->compound_dtor with its folio equivalent.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm: Convert set_compound_page_dtor() and set_compound_order() to folios
Matthew Wilcox (Oracle) [Sun, 4 Sep 2022 23:52:05 +0000 (19:52 -0400)]
mm: Convert set_compound_page_dtor() and set_compound_order() to folios

Replace uses of compound_dtor, compound_order and compound_nr by
their folio equivalents.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm: Reimplement compound_nr()
Matthew Wilcox (Oracle) [Sun, 4 Sep 2022 23:36:34 +0000 (19:36 -0400)]
mm: Reimplement compound_nr()

Turn compound_nr() into a wrapper around folio_nr_pages().  Similarly
to compound_order(), casting the struct page directly to struct folio
preserves the existing behaviour, while calling page_folio() would change
the behaviour.  Move thp_nr_pages() down in the file so that compound_nr()
can be after folio_nr_pages().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm: Reimplement compound_order()
Matthew Wilcox (Oracle) [Sun, 4 Sep 2022 23:27:16 +0000 (19:27 -0400)]
mm: Reimplement compound_order()

Make compound_order() use struct folio.  It can't be turned into a wrapper
around folio_order() as a page can be turned into a tail page between
a check in compound_order() and the assertion in folio_test_large().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm: Remove head_compound_mapcount() and _ptr functions
Matthew Wilcox (Oracle) [Fri, 30 Dec 2022 22:08:57 +0000 (17:08 -0500)]
mm: Remove head_compound_mapcount() and _ptr functions

folio_mapcount_ptr(), compound_mapcount_ptr() and subpages_mapcount_ptr()
are all now unused.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm: Convert page_mapcount() to use folio_entire_mapcount()
Matthew Wilcox (Oracle) [Fri, 30 Dec 2022 20:30:44 +0000 (15:30 -0500)]
mm: Convert page_mapcount() to use folio_entire_mapcount()

Remove a use of head_compound_mapcount().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agohugetlb: Remove uses of folio_mapcount_ptr
Matthew Wilcox (Oracle) [Fri, 30 Dec 2022 20:22:11 +0000 (15:22 -0500)]
hugetlb: Remove uses of folio_mapcount_ptr

Use the entire_mapcount field directly.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm/debug: Remove call to head_compound_mapcount()
Matthew Wilcox (Oracle) [Fri, 30 Dec 2022 20:18:23 +0000 (15:18 -0500)]
mm/debug: Remove call to head_compound_mapcount()

Call folio_entire_mapcount() instead.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm: Use entire_mapcount in __page_dup_rmap()
Matthew Wilcox (Oracle) [Fri, 30 Dec 2022 20:16:03 +0000 (15:16 -0500)]
mm: Use entire_mapcount in __page_dup_rmap()

Remove the use of the compound_mapcount_ptr() wrapper, and add an
assertion that we're not passing a tail page if we're duplicating a PMD.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm: Use a folio in hugepage_add_anon_rmap() and hugepage_add_new_anon_rmap()
Matthew Wilcox (Oracle) [Fri, 30 Dec 2022 19:52:48 +0000 (14:52 -0500)]
mm: Use a folio in hugepage_add_anon_rmap() and hugepage_add_new_anon_rmap()

Remove uses of compound_mapcount_ptr()

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agopage_alloc: Use folio fields directly
Matthew Wilcox (Oracle) [Fri, 30 Dec 2022 19:41:32 +0000 (14:41 -0500)]
page_alloc: Use folio fields directly

Rmove the uses of compound_mapcount_ptr(), head_compound_mapcount()
and subpages_mapcount_ptr()

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm: Add folio_add_new_anon_rmap()
Matthew Wilcox (Oracle) [Fri, 30 Dec 2022 19:34:55 +0000 (14:34 -0500)]
mm: Add folio_add_new_anon_rmap()

In contrast to other rmap functions, page_add_new_anon_rmap() is always
called with a freshly allocated page.  That means it can't be called with
a tail page.  Turn page_add_new_anon_rmap() into folio_add_new_anon_rmap()
and add a page_add_new_anon_rmap() wrapper.  Callers can be converted
individually.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm: Convert page_add_file_rmap() to use a folio internally
Matthew Wilcox (Oracle) [Fri, 30 Dec 2022 19:02:10 +0000 (14:02 -0500)]
mm: Convert page_add_file_rmap() to use a folio internally

The API for page_add_file_rmap() needs to be page-based, because we can
add mappings of individual pages.  But inside the function, we want to
only call compound_head() once and then use the folio APIs instead of
the page APIs that each call compound_head().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm: Convert page_add_anon_rmap() to use a folio internally
Matthew Wilcox (Oracle) [Fri, 30 Dec 2022 19:02:10 +0000 (14:02 -0500)]
mm: Convert page_add_anon_rmap() to use a folio internally

The API for page_add_anon_rmap() needs to be page-based, because we can
add mappings of individual pages.  But inside the function, we want to
only call compound_head() once and then use the folio APIs instead of
the page APIs that each call compound_head().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm: Convert page_remove_rmap() to use a folio internally
Matthew Wilcox (Oracle) [Fri, 30 Dec 2022 18:48:39 +0000 (13:48 -0500)]
mm: Convert page_remove_rmap() to use a folio internally

The API for page_remove_rmap() needs to be page-based, because we can
remove mappings of pages individually.  But inside the function, we want
to only call compound_head() once and then use the folio APIs instead
of the page APIs that each call compound_head().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm: Convert total_compound_mapcount() to folio_total_mapcount()
Matthew Wilcox (Oracle) [Fri, 30 Dec 2022 18:20:58 +0000 (13:20 -0500)]
mm: Convert total_compound_mapcount() to folio_total_mapcount()

Instead of enforcing that the argument must be a head page by naming,
enforce it with the compiler by making it a folio.  Also rename the
counter in struct folio from _compound_mapcount to _entire_mapcount.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agodoc: Clarify refcount section by referring to folios & pages
Matthew Wilcox (Oracle) [Fri, 30 Dec 2022 18:11:50 +0000 (13:11 -0500)]
doc: Clarify refcount section by referring to folios & pages

Include the rename of subpages_mapcount to _nr_pages_mapped.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm: Convert head_subpages_mapcount() into folio_nr_pages_mapped()
Matthew Wilcox (Oracle) [Fri, 30 Dec 2022 17:55:34 +0000 (12:55 -0500)]
mm: Convert head_subpages_mapcount() into folio_nr_pages_mapped()

Calling this 'mapcount' is confusing since mapcount is usually the number
of times something is mapped; instead this is the number of mapped pages.
It's also better to enforce that this is a folio rather than a head page.

Move folio_nr_pages_mapped() into mm/internal.h since this is not
something we want device drivers or filesystems poking at.  Get rid of
folio_subpages_mapcount_ptr() and use folio->_nr_pages_mapped directly.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agomm: Remove folio_pincount_ptr() and head_compound_pincount()
Matthew Wilcox (Oracle) [Fri, 30 Dec 2022 15:38:14 +0000 (10:38 -0500)]
mm: Remove folio_pincount_ptr() and head_compound_pincount()

We can use folio->_pincount directly, since all users are guarded by
tests of compound/large.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agofilemap: i don't know if we even need this
Matthew Wilcox (Oracle) [Fri, 30 Dec 2022 11:46:46 +0000 (06:46 -0500)]
filemap: i don't know if we even need this

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
2 years agoMerge tag 'block-6.2-2022-12-29' of git://git.kernel.dk/linux
Linus Torvalds [Fri, 30 Dec 2022 00:57:29 +0000 (16:57 -0800)]
Merge tag 'block-6.2-2022-12-29' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:
 "Mostly just NVMe, but also a single fixup for BFQ for a regression
  that happened during the merge window. In detail:

   - NVMe pull requests via Christoph:
      - Fix doorbell buffer value endianness (Klaus Jensen)
      - Fix Linux vs NVMe page size mismatch (Keith Busch)
      - Fix a potential use memory access beyong the allocation limit
        (Keith Busch)
      - Fix a multipath vs blktrace NULL pointer dereference (Yanjun
        Zhang)
      - Fix various problems in handling the Command Supported and
        Effects log (Christoph Hellwig)
      - Don't allow unprivileged passthrough of commands that don't
        transfer data but modify logical block content (Christoph
        Hellwig)
      - Add a features and quirks policy document (Christoph Hellwig)
      - Fix some really nasty code that was correct but made smatch
        complain (Sagi Grimberg)

   - Use-after-free regression in BFQ from this merge window (Yu)"

* tag 'block-6.2-2022-12-29' of git://git.kernel.dk/linux:
  nvme-auth: fix smatch warning complaints
  nvme: consult the CSE log page for unprivileged passthrough
  nvme: also return I/O command effects from nvme_command_effects
  nvmet: don't defer passthrough commands with trivial effects to the workqueue
  nvmet: set the LBCC bit for commands that modify data
  nvmet: use NVME_CMD_EFFECTS_CSUPP instead of open coding it
  nvme: fix the NVME_CMD_EFFECTS_CSE_MASK definition
  docs, nvme: add a feature and quirk policy document
  nvme-pci: update sqsize when adjusting the queue depth
  nvme: fix setting the queue depth in nvme_alloc_io_tag_set
  block, bfq: fix uaf for bfqq in bfq_exit_icq_bfqq
  nvme: fix multipath crash caused by flush request when blktrace is enabled
  nvme-pci: fix page size checks
  nvme-pci: fix mempool alloc size
  nvme-pci: fix doorbell buffer value endianness

2 years agoMerge tag 'io_uring-6.2-2022-12-29' of git://git.kernel.dk/linux
Linus Torvalds [Fri, 30 Dec 2022 00:48:21 +0000 (16:48 -0800)]
Merge tag 'io_uring-6.2-2022-12-29' of git://git.kernel.dk/linux

Pull io_uring fixes from Jens Axboe:

 - Two fixes for mutex grabbing when the task state is != TASK_RUNNING
   (me)

 - Check for invalid opcode in io_uring_register() a bit earlier, to
   avoid going through the quiesce machinery just to return -EINVAL
   later in the process (me)

 - Fix for the uapi io_uring header, skipping including time_types.h
   when necessary (Stefan)

* tag 'io_uring-6.2-2022-12-29' of git://git.kernel.dk/linux:
  uapi:io_uring.h: allow linux/time_types.h to be skipped
  io_uring: check for valid register opcode earlier
  io_uring/cancel: re-grab ctx mutex after finishing wait
  io_uring: finish waiting before flushing overflow entries

2 years agoMerge tag 'linux-kselftest-kunit-fixes-6.2-rc2' of git://git.kernel.org/pub/scm/linux...
Linus Torvalds [Fri, 30 Dec 2022 00:43:25 +0000 (16:43 -0800)]
Merge tag 'linux-kselftest-kunit-fixes-6.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest

Pull KUnit fix from Shuah Khan:

 - alloc_string_stream_fragment() error path fix to free before
   returning a failure.

* tag 'linux-kselftest-kunit-fixes-6.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
  kunit: alloc_string_stream_fragment error handling bug fix

2 years agoMerge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Linus Torvalds [Thu, 29 Dec 2022 18:56:13 +0000 (10:56 -0800)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull kvm fixes from Paolo Bonzini:
 "Changes that were posted too late for 6.1, or after the release.

  x86:

   - several fixes to nested VMX execution controls

   - fixes and clarification to the documentation for Xen emulation

   - do not unnecessarily release a pmu event with zero period

   - MMU fixes

   - fix Coverity warning in kvm_hv_flush_tlb()

  selftests:

   - fixes for the ucall mechanism in selftests

   - other fixes mostly related to compilation with clang"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (41 commits)
  KVM: selftests: restore special vmmcall code layout needed by the harness
  Documentation: kvm: clarify SRCU locking order
  KVM: x86: fix deadlock for KVM_XEN_EVTCHN_RESET
  KVM: x86/xen: Documentation updates and clarifications
  KVM: x86/xen: Add KVM_XEN_INVALID_GPA and KVM_XEN_INVALID_GFN to uapi
  KVM: x86/xen: Simplify eventfd IOCTLs
  KVM: x86/xen: Fix SRCU/RCU usage in readers of evtchn_ports
  KVM: x86/xen: Use kvm_read_guest_virt() instead of open-coding it badly
  KVM: x86/xen: Fix memory leak in kvm_xen_write_hypercall_page()
  KVM: Delete extra block of "};" in the KVM API documentation
  kvm: x86/mmu: Remove duplicated "be split" in spte.h
  kvm: Remove the unused macro KVM_MMU_READ_{,UN}LOCK()
  MAINTAINERS: adjust entry after renaming the vmx hyperv files
  KVM: selftests: Mark correct page as mapped in virt_map()
  KVM: arm64: selftests: Don't identity map the ucall MMIO hole
  KVM: selftests: document the default implementation of vm_vaddr_populate_bitmap
  KVM: selftests: Use magic value to signal ucall_alloc() failure
  KVM: selftests: Disable "gnu-variable-sized-type-not-at-end" warning
  KVM: selftests: Include lib.mk before consuming $(CC)
  KVM: selftests: Explicitly disable builtins for mem*() overrides
  ...

2 years agoMerge tag 'nvme-6.2-2022-12-29' of git://git.infradead.org/nvme into block-6.2
Jens Axboe [Thu, 29 Dec 2022 18:31:45 +0000 (11:31 -0700)]
Merge tag 'nvme-6.2-2022-12-29' of git://git.infradead.org/nvme into block-6.2

Pull NVMe fixes from Christoph:

"nvme fixes for Linux 6.2

 - fix various problems in handling the Command Supported and Effects log
   (Christoph Hellwig)
 - don't allow unprivileged passthrough of commands that don't transfer
   data but modify logical block content (Christoph Hellwig)
 - add a features and quirks policy document (Christoph Hellwig)
 - fix some really nasty code that was correct but made smatch complain
   (Sagi Grimberg)"

* tag 'nvme-6.2-2022-12-29' of git://git.infradead.org/nvme:
  nvme-auth: fix smatch warning complaints
  nvme: consult the CSE log page for unprivileged passthrough
  nvme: also return I/O command effects from nvme_command_effects
  nvmet: don't defer passthrough commands with trivial effects to the workqueue
  nvmet: set the LBCC bit for commands that modify data
  nvmet: use NVME_CMD_EFFECTS_CSUPP instead of open coding it
  nvme: fix the NVME_CMD_EFFECTS_CSE_MASK definition
  docs, nvme: add a feature and quirk policy document

2 years agonvme-auth: fix smatch warning complaints
Sagi Grimberg [Sun, 25 Dec 2022 11:28:51 +0000 (13:28 +0200)]
nvme-auth: fix smatch warning complaints

When initializing auth context, there may be no secrets passed
by the user. Make return code explicit when returning successfully.

smatch warnings:
drivers/nvme/host/auth.c:950 nvme_auth_init_ctrl() warn: missing error code? 'ret'

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <error27@gmail.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
2 years agonvme: consult the CSE log page for unprivileged passthrough
Christoph Hellwig [Tue, 13 Dec 2022 15:13:38 +0000 (16:13 +0100)]
nvme: consult the CSE log page for unprivileged passthrough

Commands like Write Zeros can change the contents of a namespaces without
actually transferring data.  To protect against this, check the Commands
Supported and Effects log is supported by the controller for any
unprivileg command passthrough and refuse unprivileged passthrough if the
command has any effects that can change data or metadata.

Note: While the Commands Support and Effects log page has only been
mandatory since NVMe 2.0, it is widely supported because Windows requires
it for any command passthrough from userspace.

Fixes: e4fbcf32c860 ("nvme: identify-namespace without CAP_SYS_ADMIN")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
2 years agonvme: also return I/O command effects from nvme_command_effects
Christoph Hellwig [Wed, 21 Dec 2022 09:12:17 +0000 (10:12 +0100)]
nvme: also return I/O command effects from nvme_command_effects

To be able to use the Commands Supported and Effects Log for allowing
unprivileged passtrough, it needs to be corretly reported for I/O
commands as well.  Return the I/O command effects from
nvme_command_effects, and also add a default list of effects for the
NVM command set.  For other command sets, the Commands Supported and
Effects log is required to be present already.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
2 years agonvmet: don't defer passthrough commands with trivial effects to the workqueue
Christoph Hellwig [Wed, 21 Dec 2022 08:51:19 +0000 (09:51 +0100)]
nvmet: don't defer passthrough commands with trivial effects to the workqueue

Mask out the "Command Supported" and "Logical Block Content Change" bits
and only defer execution of commands that have non-trivial effects to
the workqueue for synchronous execution.  This allows to execute admin
commands asynchronously on controllers that provide a Command Supported
and Effects log page, and will keep allowing to execute Write commands
asynchronously once command effects on I/O commands are taken into
account.

Fixes: c1fef73f793b ("nvmet: add passthru code to process commands")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
2 years agonvmet: set the LBCC bit for commands that modify data
Christoph Hellwig [Mon, 12 Dec 2022 14:20:56 +0000 (15:20 +0100)]
nvmet: set the LBCC bit for commands that modify data

Write, Write Zeroes, Zone append and a Zone Reset through
Zone Management Send modify the logical block content of a namespace,
so make sure the LBCC bit is reported for them.

Fixes: b5d0b38c0475 ("nvmet: add Command Set Identifier support")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2 years agonvmet: use NVME_CMD_EFFECTS_CSUPP instead of open coding it
Christoph Hellwig [Mon, 12 Dec 2022 14:20:04 +0000 (15:20 +0100)]
nvmet: use NVME_CMD_EFFECTS_CSUPP instead of open coding it

Use NVME_CMD_EFFECTS_CSUPP instead of open coding it and assign a
single value to multiple array entries instead of repeated assignments.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
2 years agonvme: fix the NVME_CMD_EFFECTS_CSE_MASK definition
Christoph Hellwig [Wed, 21 Dec 2022 09:30:45 +0000 (10:30 +0100)]
nvme: fix the NVME_CMD_EFFECTS_CSE_MASK definition

3 << 16 does not generate the correct mask for bits 16, 17 and 18.
Use the GENMASK macro to generate the correct mask instead.

Fixes: 84fef62d135b ("nvme: check admin passthru command effects")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
2 years agodocs, nvme: add a feature and quirk policy document
Christoph Hellwig [Mon, 12 Dec 2022 10:09:55 +0000 (11:09 +0100)]
docs, nvme: add a feature and quirk policy document

This adds a document about what specification features are supported by
the Linux NVMe driver, and what qualifies for a quirk if an implementation
has problems following the specification.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Jonathan Corbet <corbet@lwn.net>
2 years agoMerge branch 'kvm-late-6.1-fixes' into HEAD
Paolo Bonzini [Wed, 28 Dec 2022 11:26:36 +0000 (06:26 -0500)]
Merge branch 'kvm-late-6.1-fixes' into HEAD

x86:

* several fixes to nested VMX execution controls

* fixes and clarification to the documentation for Xen emulation

* do not unnecessarily release a pmu event with zero period

* MMU fixes

* fix Coverity warning in kvm_hv_flush_tlb()

selftests:

* fixes for the ucall mechanism in selftests

* other fixes mostly related to compilation with clang

2 years agoKVM: selftests: restore special vmmcall code layout needed by the harness
Paolo Bonzini [Wed, 30 Nov 2022 18:11:47 +0000 (13:11 -0500)]
KVM: selftests: restore special vmmcall code layout needed by the harness

Commit 8fda37cf3d41 ("KVM: selftests: Stuff RAX/RCX with 'safe' values
in vmmcall()/vmcall()", 2022-11-21) broke the svm_nested_soft_inject_test
because it placed a "pop rbp" instruction after vmmcall.  While this is
correct and mimics what is done in the VMX case, this particular test
expects a ud2 instruction right after the vmmcall, so that it can skip
over it in the L1 part of the test.

Inline a suitably-modified version of vmmcall() to restore the
functionality of the test.

Fixes: 8fda37cf3d41 ("KVM: selftests: Stuff RAX/RCX with 'safe' values in vmmcall()/vmcall()"
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20221130181147.9911-1-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoDocumentation: kvm: clarify SRCU locking order
Paolo Bonzini [Wed, 28 Dec 2022 11:00:22 +0000 (06:00 -0500)]
Documentation: kvm: clarify SRCU locking order

Currently only the locking order of SRCU vs kvm->slots_arch_lock
and kvm->slots_lock is documented.  Extend this to kvm->lock
since Xen emulation got it terribly wrong.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoKVM: x86: fix deadlock for KVM_XEN_EVTCHN_RESET
Paolo Bonzini [Wed, 28 Dec 2022 10:33:41 +0000 (05:33 -0500)]
KVM: x86: fix deadlock for KVM_XEN_EVTCHN_RESET

While KVM_XEN_EVTCHN_RESET is usually called with no vCPUs running,
if that happened it could cause a deadlock.  This is due to
kvm_xen_eventfd_reset() doing a synchronize_srcu() inside
a kvm->lock critical section.

To avoid this, first collect all the evtchnfd objects in an
array and free all of them once the kvm->lock critical section
is over and th SRCU grace period has expired.

Reported-by: Michal Luczaj <mhal@rbox.co>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agouapi:io_uring.h: allow linux/time_types.h to be skipped
Stefan Metzmacher [Wed, 16 Nov 2022 20:25:24 +0000 (21:25 +0100)]
uapi:io_uring.h: allow linux/time_types.h to be skipped

include/uapi/linux/io_uring.h is synced 1:1 into
liburing:src/include/liburing/io_uring.h.

liburing has a configure check to detect the need for
linux/time_types.h. It can opt-out by defining
UAPI_LINUX_IO_URING_H_SKIP_LINUX_TIME_TYPES_H

Fixes: 78a861b94959 ("io_uring: add sync cancelation API through io_uring_register()")
Link: https://github.com/axboe/liburing/issues/708
Link: https://github.com/axboe/liburing/pull/709
Link: https://lore.kernel.org/io-uring/20221115212614.1308132-1-ammar.faizi@intel.com/T/#m9f5dd571cd4f6a5dee84452dbbca3b92ba7a4091
CC: Jens Axboe <axboe@kernel.dk>
Cc: Ammar Faizi <ammarfaizi2@gnuweeb.org>
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Reviewed-by: Ammar Faizi <ammarfaizi2@gnuweeb.org>
Link: https://lore.kernel.org/r/7071a0a1d751221538b20b63f9160094fc7e06f4.1668630247.git.metze@samba.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoKVM: x86/xen: Documentation updates and clarifications
David Woodhouse [Mon, 26 Dec 2022 12:03:20 +0000 (12:03 +0000)]
KVM: x86/xen: Documentation updates and clarifications

Most notably, the KVM_XEN_EVTCHN_RESET feature had escaped documentation
entirely. Along with how to turn most stuff off on SHUTDOWN_soft_reset.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20221226120320.1125390-6-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoKVM: x86/xen: Add KVM_XEN_INVALID_GPA and KVM_XEN_INVALID_GFN to uapi
David Woodhouse [Mon, 26 Dec 2022 12:03:19 +0000 (12:03 +0000)]
KVM: x86/xen: Add KVM_XEN_INVALID_GPA and KVM_XEN_INVALID_GFN to uapi

These are (uint64_t)-1 magic values are a userspace ABI, allowing the
shared info pages and other enlightenments to be disabled. This isn't
a Xen ABI because Xen doesn't let the guest turn these off except with
the full SHUTDOWN_soft_reset mechanism. Under KVM, the userspace VMM is
expected to handle soft reset, and tear down the kernel parts of the
enlightenments accordingly.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20221226120320.1125390-5-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoKVM: x86/xen: Simplify eventfd IOCTLs
Michal Luczaj [Mon, 26 Dec 2022 12:03:18 +0000 (12:03 +0000)]
KVM: x86/xen: Simplify eventfd IOCTLs

Port number is validated in kvm_xen_setattr_evtchn().
Remove superfluous checks in kvm_xen_eventfd_assign() and
kvm_xen_eventfd_update().

Signed-off-by: Michal Luczaj <mhal@rbox.co>
Message-Id: <20221222203021.1944101-3-mhal@rbox.co>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20221226120320.1125390-4-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoKVM: x86/xen: Fix SRCU/RCU usage in readers of evtchn_ports
Paolo Bonzini [Mon, 26 Dec 2022 12:03:17 +0000 (12:03 +0000)]
KVM: x86/xen: Fix SRCU/RCU usage in readers of evtchn_ports

The evtchnfd structure itself must be protected by either kvm->lock or
SRCU. Use the former in kvm_xen_eventfd_update(), since the lock is
being taken anyway; kvm_xen_hcall_evtchn_send() instead is a reader and
does not need kvm->lock, and is called in SRCU critical section from the
kvm_x86_handle_exit function.

It is also important to use rcu_read_{lock,unlock}() in
kvm_xen_hcall_evtchn_send(), because idr_remove() will *not*
use synchronize_srcu() to wait for readers to complete.

Remove a superfluous if (kvm) check before calling synchronize_srcu()
in kvm_xen_eventfd_deassign() where kvm has been dereferenced already.

Co-developed-by: Michal Luczaj <mhal@rbox.co>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20221226120320.1125390-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoKVM: x86/xen: Use kvm_read_guest_virt() instead of open-coding it badly
David Woodhouse [Mon, 26 Dec 2022 12:03:16 +0000 (12:03 +0000)]
KVM: x86/xen: Use kvm_read_guest_virt() instead of open-coding it badly

In particular, we shouldn't assume that being contiguous in guest virtual
address space means being contiguous in guest *physical* address space.

In dropping the manual calls to kvm_mmu_gva_to_gpa_system(), also drop
the srcu_read_lock() that was around them. All call sites are reached
from kvm_xen_hypercall() which is called from the handle_exit function
with the read lock already held.

       536395260 ("KVM: x86/xen: handle PV timers oneshot mode")
       1a65105a5 ("KVM: x86/xen: handle PV spinlocks slowpath")

Fixes: 2fd6df2f2 ("KVM: x86/xen: intercept EVTCHNOP_send from guests")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20221226120320.1125390-2-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoKVM: x86/xen: Fix memory leak in kvm_xen_write_hypercall_page()
Michal Luczaj [Mon, 26 Dec 2022 12:03:15 +0000 (12:03 +0000)]
KVM: x86/xen: Fix memory leak in kvm_xen_write_hypercall_page()

Release page irrespectively of kvm_vcpu_write_guest() return value.

Suggested-by: Paul Durrant <paul@xen.org>
Fixes: 23200b7a30de ("KVM: x86/xen: intercept xen hypercalls if enabled")
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Message-Id: <20221220151454.712165-1-mhal@rbox.co>
Reviewed-by: Paul Durrant <paul@xen.org>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20221226120320.1125390-1-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoKVM: Delete extra block of "};" in the KVM API documentation
Sean Christopherson [Wed, 7 Dec 2022 00:36:37 +0000 (00:36 +0000)]
KVM: Delete extra block of "};" in the KVM API documentation

Delete an extra block of code/documentation that snuck in when KVM's
documentation was converted to ReST format.

Fixes: 106ee47dc633 ("docs: kvm: Convert api.txt to ReST format")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221207003637.2041211-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agokvm: x86/mmu: Remove duplicated "be split" in spte.h
Lai Jiangshan [Wed, 7 Dec 2022 12:05:05 +0000 (20:05 +0800)]
kvm: x86/mmu: Remove duplicated "be split" in spte.h

"be split be split" -> "be split"

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Message-Id: <20221207120505.9175-1-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agokvm: Remove the unused macro KVM_MMU_READ_{,UN}LOCK()
Lai Jiangshan [Wed, 7 Dec 2022 12:06:16 +0000 (20:06 +0800)]
kvm: Remove the unused macro KVM_MMU_READ_{,UN}LOCK()

No code is using KVM_MMU_READ_LOCK() or KVM_MMU_READ_UNLOCK().  They
used to be in virt/kvm/pfncache.c:

                KVM_MMU_READ_LOCK(kvm);
                retry = mmu_notifier_retry_hva(kvm, mmu_seq, uhva);
                KVM_MMU_READ_UNLOCK(kvm);

However, since 58cd407ca4c6 ("KVM: Fix multiple races in gfn=>pfn cache
refresh", 2022-05-25) the code is only relying on the MMU notifier's
invalidation count and sequence number.

Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Message-Id: <20221207120617.9409-1-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoMAINTAINERS: adjust entry after renaming the vmx hyperv files
Lukas Bulwahn [Mon, 5 Dec 2022 08:20:44 +0000 (09:20 +0100)]
MAINTAINERS: adjust entry after renaming the vmx hyperv files

Commit a789aeba4196 ("KVM: VMX: Rename "vmx/evmcs.{ch}" to
"vmx/hyperv.{ch}"") renames the VMX specific Hyper-V files, but does not
adjust the entry in MAINTAINERS.

Hence, ./scripts/get_maintainer.pl --self-test=patterns complains about a
broken reference.

Repair this file reference in KVM X86 HYPER-V (KVM/hyper-v).

Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Fixes: a789aeba4196 ("KVM: VMX: Rename "vmx/evmcs.{ch}" to "vmx/hyperv.{ch}"")
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221205082044.10141-1-lukas.bulwahn@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoKVM: selftests: Mark correct page as mapped in virt_map()
Oliver Upton [Fri, 9 Dec 2022 01:53:02 +0000 (01:53 +0000)]
KVM: selftests: Mark correct page as mapped in virt_map()

The loop marks vaddr as mapped after incrementing it by page size,
thereby marking the *next* page as mapped. Set the bit in vpages_mapped
first instead.

Fixes: 56fc7732031d ("KVM: selftests: Fill in vm->vpages_mapped bitmap in virt_map() too")
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Message-Id: <20221209015307.1781352-4-oliver.upton@linux.dev>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoKVM: arm64: selftests: Don't identity map the ucall MMIO hole
Oliver Upton [Fri, 9 Dec 2022 01:53:04 +0000 (01:53 +0000)]
KVM: arm64: selftests: Don't identity map the ucall MMIO hole

Currently the ucall MMIO hole is placed immediately after slot0, which
is a relatively safe address in the PA space. However, it is possible
that the same address has already been used for something else (like the
guest program image) in the VA space. At least in my own testing,
building the vgic_irq test with clang leads to the MMIO hole appearing
underneath gicv3_ops.

Stop identity mapping the MMIO hole and instead find an unused VA to map
to it. Yet another subtle detail of the KVM selftests library is that
virt_pg_map() does not update vm->vpages_mapped. Switch over to
virt_map() instead to guarantee that the chosen VA isn't to something
else.

Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Message-Id: <20221209015307.1781352-6-oliver.upton@linux.dev>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoKVM: selftests: document the default implementation of vm_vaddr_populate_bitmap
Paolo Bonzini [Mon, 12 Dec 2022 10:36:53 +0000 (05:36 -0500)]
KVM: selftests: document the default implementation of vm_vaddr_populate_bitmap

Explain the meaning of the bit manipulations of vm_vaddr_populate_bitmap.
These correspond to the "canonical addresses" of x86 and other
architectures, but that is not obvious.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoKVM: selftests: Use magic value to signal ucall_alloc() failure
Sean Christopherson [Fri, 9 Dec 2022 20:55:44 +0000 (12:55 -0800)]
KVM: selftests: Use magic value to signal ucall_alloc() failure

Use a magic value to signal a ucall_alloc() failure instead of simply
doing GUEST_ASSERT().  GUEST_ASSERT() relies on ucall_alloc() and so a
failure puts the guest into an infinite loop.

Use -1 as the magic value, as a real ucall struct should never wrap.

Reported-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoKVM: selftests: Disable "gnu-variable-sized-type-not-at-end" warning
Sean Christopherson [Tue, 13 Dec 2022 00:16:50 +0000 (00:16 +0000)]
KVM: selftests: Disable "gnu-variable-sized-type-not-at-end" warning

Disable gnu-variable-sized-type-not-at-end so that tests and libraries
can create overlays of variable sized arrays at the end of structs when
using a fixed number of entries, e.g. to get/set a single MSR.

It's possible to fudge around the warning, e.g. by defining a custom
struct that hardcodes the number of entries, but that is a burden for
both developers and readers of the code.

lib/x86_64/processor.c:664:19: warning: field 'header' with variable sized type 'struct kvm_msrs'
not at the end of a struct or class is a GNU extension [-Wgnu-variable-sized-type-not-at-end]
                struct kvm_msrs header;
                                ^
lib/x86_64/processor.c:772:19: warning: field 'header' with variable sized type 'struct kvm_msrs'
not at the end of a struct or class is a GNU extension [-Wgnu-variable-sized-type-not-at-end]
                struct kvm_msrs header;
                                ^
lib/x86_64/processor.c:787:19: warning: field 'header' with variable sized type 'struct kvm_msrs'
not at the end of a struct or class is a GNU extension [-Wgnu-variable-sized-type-not-at-end]
                struct kvm_msrs header;
                                ^
3 warnings generated.

x86_64/hyperv_tlb_flush.c:54:18: warning: field 'hv_vp_set' with variable sized type 'struct hv_vpset'
not at the end of a struct or class is a GNU extension [-Wgnu-variable-sized-type-not-at-end]
        struct hv_vpset hv_vp_set;
                        ^
1 warning generated.

x86_64/xen_shinfo_test.c:137:25: warning: field 'info' with variable sized type 'struct kvm_irq_routing'
not at the end of a struct or class is a GNU extension [-Wgnu-variable-sized-type-not-at-end]
        struct kvm_irq_routing info;
                               ^
1 warning generated.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221213001653.3852042-12-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoKVM: selftests: Include lib.mk before consuming $(CC)
Sean Christopherson [Tue, 13 Dec 2022 00:16:49 +0000 (00:16 +0000)]
KVM: selftests: Include lib.mk before consuming $(CC)

Include lib.mk before consuming $(CC) and document that lib.mk overwrites
$(CC) unless make was invoked with -e or $(CC) was specified after make
(which makes the environment override the Makefile).  Including lib.mk
after using it for probing, e.g. for -no-pie, can lead to weirdness.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221213001653.3852042-11-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoKVM: selftests: Explicitly disable builtins for mem*() overrides
Sean Christopherson [Tue, 13 Dec 2022 00:16:48 +0000 (00:16 +0000)]
KVM: selftests: Explicitly disable builtins for mem*() overrides

Explicitly disable the compiler's builtin memcmp(), memcpy(), and
memset().  Because only lib/string_override.c is built with -ffreestanding,
the compiler reserves the right to do what it wants and can try to link the
non-freestanding code to its own crud.

  /usr/bin/x86_64-linux-gnu-ld: /lib/x86_64-linux-gnu/libc.a(memcmp.o): in function `memcmp_ifunc':
  (.text+0x0): multiple definition of `memcmp'; tools/testing/selftests/kvm/lib/string_override.o:
  tools/testing/selftests/kvm/lib/string_override.c:15: first defined here
  clang: error: linker command failed with exit code 1 (use -v to see invocation)

Fixes: 6b6f71484bf4 ("KVM: selftests: Implement memcmp(), memcpy(), and memset() for guest use")
Reported-by: Aaron Lewis <aaronlewis@google.com>
Reported-by: Raghavendra Rao Ananta <rananta@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221213001653.3852042-10-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoKVM: selftests: Probe -no-pie with actual CFLAGS used to compile
Sean Christopherson [Tue, 13 Dec 2022 00:16:47 +0000 (00:16 +0000)]
KVM: selftests: Probe -no-pie with actual CFLAGS used to compile

Probe -no-pie with the actual set of CFLAGS used to compile the tests,
clang whines about -no-pie being unused if the tests are compiled with
-static.

  clang: warning: argument unused during compilation: '-no-pie'
  [-Wunused-command-line-argument]

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221213001653.3852042-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoKVM: selftests: Use proper function prototypes in probing code
Sean Christopherson [Tue, 13 Dec 2022 00:16:46 +0000 (00:16 +0000)]
KVM: selftests: Use proper function prototypes in probing code

Make the main() functions in the probing code proper prototypes so that
compiling the probing code with more strict flags won't generate false
negatives.

  <stdin>:1:5: error: function declaration isn’t a prototype [-Werror=strict-prototypes]

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221213001653.3852042-8-seanjc@google.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoKVM: selftests: Rename UNAME_M to ARCH_DIR, fill explicitly for x86
Sean Christopherson [Tue, 13 Dec 2022 00:16:45 +0000 (00:16 +0000)]
KVM: selftests: Rename UNAME_M to ARCH_DIR, fill explicitly for x86

Rename UNAME_M to ARCH_DIR and explicitly set it directly for x86.  At
this point, the name of the arch directory really doesn't have anything
to do with `uname -m`, and UNAME_M is unnecessarily confusing given that
its purpose is purely to identify the arch specific directory.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221213001653.3852042-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoKVM: selftests: Fix a typo in x86-64's kvm_get_cpu_address_width()
Sean Christopherson [Tue, 13 Dec 2022 00:16:44 +0000 (00:16 +0000)]
KVM: selftests: Fix a typo in x86-64's kvm_get_cpu_address_width()

Fix a == vs. = typo in kvm_get_cpu_address_width() that results in
@pa_bits being left unset if the CPU doesn't support enumerating its
MAX_PHY_ADDR.  Flagged by clang's unusued-value warning.

lib/x86_64/processor.c:1034:51: warning: expression result unused [-Wunused-value]
                *pa_bits == kvm_cpu_has(X86_FEATURE_PAE) ? 36 : 32;

Fixes: 3bd396353d18 ("KVM: selftests: Add X86_FEATURE_PAE and use it calc "fallback" MAXPHYADDR")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-Id: <20221213001653.3852042-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoKVM: selftests: Use pattern matching in .gitignore
Sean Christopherson [Tue, 13 Dec 2022 00:16:43 +0000 (00:16 +0000)]
KVM: selftests: Use pattern matching in .gitignore

Use pattern matching to exclude everything except .c, .h, .S, and .sh
files from Git.  Manually adding every test target has an absurd
maintenance cost, is comically error prone, and leads to bikeshedding
over whether or not the targets should be listed in alphabetical order.

Deliberately do not include the one-off assets, e.g. config, settings,
.gitignore itself, etc as Git doesn't ignore files that are already in
the repository.  Adding the one-off assets won't prevent mistakes where
developers forget to --force add files that don't match the "allowed".

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221213001653.3852042-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoKVM: selftests: Fix divide-by-zero bug in memslot_perf_test
Sean Christopherson [Tue, 13 Dec 2022 00:16:42 +0000 (00:16 +0000)]
KVM: selftests: Fix divide-by-zero bug in memslot_perf_test

Check that the number of pages per slot is non-zero in get_max_slots()
prior to computing the remaining number of pages.  clang generates code
that uses an actual DIV for calculating the remaining, which causes a #DE
if the total number of pages is less than the number of slots.

  traps: memslot_perf_te[97611] trap divide error ip:4030c4 sp:7ffd18ae58f0
         error:0 in memslot_perf_test[401000+cb000]

Fixes: a69170c65acd ("KVM: selftests: memslot_perf_test: Report optimal memory slots")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-Id: <20221213001653.3852042-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoKVM: selftests: Delete dead code in x86_64/vmx_tsc_adjust_test.c
Sean Christopherson [Tue, 13 Dec 2022 00:16:41 +0000 (00:16 +0000)]
KVM: selftests: Delete dead code in x86_64/vmx_tsc_adjust_test.c

Delete an unused struct definition in x86_64/vmx_tsc_adjust_test.c.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221213001653.3852042-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agoKVM: selftests: Define literal to asm constraint in aarch64 as unsigned long
Sean Christopherson [Tue, 13 Dec 2022 00:16:40 +0000 (00:16 +0000)]
KVM: selftests: Define literal to asm constraint in aarch64 as unsigned long

Define a literal '0' asm input constraint to aarch64/page_fault_test's
guest_cas() as an unsigned long to make clang happy.

  tools/testing/selftests/kvm/aarch64/page_fault_test.c:120:16: error:
    value size does not match register size specified by the constraint
    and modifier [-Werror,-Wasm-operand-widths]
                       :: "r" (0), "r" (TEST_DATA), "r" (guest_test_memory));
                               ^
  tools/testing/selftests/kvm/aarch64/page_fault_test.c:119:15: note:
    use constraint modifier "w"
                       "casal %0, %1, [%2]\n"
                              ^~
                              %w0

Fixes: 35c581015712 ("KVM: selftests: aarch64: Add aarch64/page_fault_test")
Cc: Ricardo Koller <ricarkol@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-Id: <20221213001653.3852042-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2 years agokunit: alloc_string_stream_fragment error handling bug fix
YoungJun.park [Fri, 28 Oct 2022 14:42:41 +0000 (07:42 -0700)]
kunit: alloc_string_stream_fragment error handling bug fix

When it fails to allocate fragment, it does not free and return error.
And check the pointer inappropriately.

Fixed merge conflicts with
commit 618887768bb7 ("kunit: update NULL vs IS_ERR() tests")
Shuah Khan <skhan@linuxfoundation.org>

Signed-off-by: YoungJun.park <her0gyugyu@gmail.com>
Reviewed-by: David Gow <davidgow@google.com>
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
2 years agonvme-pci: update sqsize when adjusting the queue depth
Christoph Hellwig [Sun, 25 Dec 2022 10:32:32 +0000 (11:32 +0100)]
nvme-pci: update sqsize when adjusting the queue depth

Update the core sqsize field in addition to the PCIe-specific
q_depth field as the core tagset allocation helpers rely on it.

Fixes: 0da7feaa5913 ("nvme-pci: use the tagset alloc/free helpers")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Hugh Dickins <hughd@google.com>
Link: https://lore.kernel.org/r/20221225103234.226794-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agonvme: fix setting the queue depth in nvme_alloc_io_tag_set
Christoph Hellwig [Sun, 25 Dec 2022 10:32:31 +0000 (11:32 +0100)]
nvme: fix setting the queue depth in nvme_alloc_io_tag_set

While the CAP.MQES field in NVMe is a 0s based filed with a natural one
off, we also need to account for the queue wrap condition and fix undo
the one off again in nvme_alloc_io_tag_set.  This was never properly
done by the fabrics drivers, but they don't seem to care because there
is no actual physical queue that can wrap around, but it became a
problem when converting over the PCIe driver.  Also add back the
BLK_MQ_MAX_DEPTH check that was lost in the same commit.

Fixes: 0da7feaa5913 ("nvme-pci: use the tagset alloc/free helpers")
Reported-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Hugh Dickins <hughd@google.com>
Link: https://lore.kernel.org/r/20221225103234.226794-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoblock, bfq: fix uaf for bfqq in bfq_exit_icq_bfqq
Yu Kuai [Mon, 26 Dec 2022 03:06:05 +0000 (11:06 +0800)]
block, bfq: fix uaf for bfqq in bfq_exit_icq_bfqq

Commit 64dc8c732f5c ("block, bfq: fix possible uaf for 'bfqq->bic'")
will access 'bic->bfqq' in bic_set_bfqq(), however, bfq_exit_icq_bfqq()
can free bfqq first, and then call bic_set_bfqq(), which will cause uaf.

Fix the problem by moving bfq_exit_bfqq() behind bic_set_bfqq().

Fixes: 64dc8c732f5c ("block, bfq: fix possible uaf for 'bfqq->bic'")
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20221226030605.1437081-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 years agoLinux 6.2-rc1
Linus Torvalds [Sun, 25 Dec 2022 21:41:39 +0000 (13:41 -0800)]
Linux 6.2-rc1