]> www.infradead.org Git - users/jedix/linux-maple.git/commitdiff
memcg/hugetlb: adding hugeTLB counters to memcg
authorJoshua Hahn <joshua.hahnjy@gmail.com>
Mon, 28 Oct 2024 21:05:05 +0000 (14:05 -0700)
committerAndrew Morton <akpm@linux-foundation.org>
Fri, 1 Nov 2024 04:29:29 +0000 (21:29 -0700)
This patch introduces a new counter to memory.stat that tracks hugeTLB
usage, only if hugeTLB accounting is done to memory.current.  This feature
is enabled the same way hugeTLB accounting is enabled, via the
memory_hugetlb_accounting mount flag for cgroupsv2.

1. Why is this patch necessary?
Currently, memcg hugeTLB accounting is an opt-in feature [1] that adds
hugeTLB usage to memory.current.  However, the metric is not reported in
memory.stat.  Given that users often interpret memory.stat as a breakdown
of the value reported in memory.current, the disparity between the two
reports can be confusing.  This patch solves this problem by including the
metric in memory.stat as well, but only if it is also reported in
memory.current (it would also be confusing if the value was reported in
memory.stat, but not in memory.current)

Aside from the consistency between the two files, we also see benefits in
observability.  Userspace might be interested in the hugeTLB footprint of
cgroups for many reasons.  For instance, system admins might want to
verify that hugeTLB usage is distributed as expected across tasks: i.e.
memory-intensive tasks are using more hugeTLB pages than tasks that don't
consume a lot of memory, or are seen to fault frequently.  Note that this
is separate from wanting to inspect the distribution for limiting purposes
(in which case, hugeTLB controller makes more sense).

2.  We already have a hugeTLB controller.  Why not use that?  It is true
that hugeTLB tracks the exact value that we want.  In fact, by enabling
the hugeTLB controller, we get all of the observability benefits that I
mentioned above, and users can check the total hugeTLB usage, verify if it
is distributed as expected, etc.

3.  Implementation Details:
In the alloc / free hugetlb functions, we call lruvec_stat_mod_folio
regardless of whether memcg accounts hugetlb.  mem_cgroup_commit_charge
which is called from alloc_hugetlb_folio will set memcg for the folio
only if the CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING cgroup mount option is
used, so lruvec_stat_mod_folio accounts per-memcg hugetlb counters only
if the feature is enabled.  Regardless of whether memcg accounts for
hugetlb, the newly added global counter is updated and shown in
/proc/vmstat.

The global counter is added because vmstats is the preferred framework
for cgroup stats.  It makes stat items consistent between global and
cgroups.  It also provides a per-node breakdown, which is useful.
Because it does not use cgroup-specific hooks, we also keep generic MM
code separate from memcg code.

With this said, there are 2 problems:
(a) They are still not reported in memory.stat, which means the
    disparity between the memcg reports are still there.
(b) We cannot reasonably expect users to enable the hugeTLB controller
    just for the sake of hugeTLB usage reporting, especially since
    they don't have any use for hugeTLB usage enforcing [2].

[1] https://lore.kernel.org/all/20231006184629.155543-1-nphamcs@gmail.com/
[2] Of course, we can't make a new patch for every feature that can be
    duplicated. However, since the existing solution of enabling the
    hugeTLB controller is an imperfect solution that still leaves a
    discrepancy between memory.stat and memory.curent, I think that it
    is reasonable to isolate the feature in this case.

Link: https://lkml.kernel.org/r/20241028210505.1950884-1-joshua.hahnjy@gmail.com
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Suggested-by: Nhat Pham <nphamcs@gmail.com>
Suggested-by: Shakeel Butt <shakeel.butt@linux.dev>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Chris Down <chris@chrisdown.name>
Michal Hocko <mhocko@suse.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Documentation/admin-guide/cgroup-v2.rst
include/linux/mmzone.h
mm/hugetlb.c
mm/memcontrol.c
mm/vmstat.c

index 69af2173555fb6b7d8d536e70ed327ff30330ea3..bd7e81c2aa2b8700c44f5f834277c604648a8c55 100644 (file)
@@ -1646,6 +1646,11 @@ The following nested keys are defined.
          pgdemote_khugepaged
                Number of pages demoted by khugepaged.
 
+         hugetlb
+               Amount of memory used by hugetlb pages. This metric only shows
+               up if hugetlb usage is accounted for in memory.current (i.e.
+               cgroup is mounted with the memory_hugetlb_accounting option).
+
   memory.numa_stat
        A read-only nested-keyed file which exists on non-root cgroups.
 
index 5e8f567753bdd18922dbd8f143ad3f8f836994fc..b36124145a16f2778b0466a1a7fa73557f79f2ad 100644 (file)
@@ -220,6 +220,9 @@ enum node_stat_item {
        PGDEMOTE_KSWAPD,
        PGDEMOTE_DIRECT,
        PGDEMOTE_KHUGEPAGED,
+#ifdef CONFIG_HUGETLB_PAGE
+       NR_HUGETLB,
+#endif
        NR_VM_NODE_STAT_ITEMS
 };
 
index 906294ac85dc85ca8535cce4b57a53d274d169b7..1caf29482da1c670e04ce2d8ce4dff01717c2831 100644 (file)
@@ -1925,6 +1925,7 @@ void free_huge_folio(struct folio *folio)
                                     pages_per_huge_page(h), folio);
        hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
                                          pages_per_huge_page(h), folio);
+       lruvec_stat_mod_folio(folio, NR_HUGETLB, -pages_per_huge_page(h));
        mem_cgroup_uncharge(folio);
        if (restore_reserve)
                h->resv_huge_pages++;
@@ -3093,6 +3094,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 
        if (!memcg_charge_ret)
                mem_cgroup_commit_charge(folio, memcg);
+       lruvec_stat_mod_folio(folio, NR_HUGETLB, pages_per_huge_page(h));
        mem_cgroup_put(memcg);
 
        return folio;
index 53ac7a64a2b2f77d5f8c6738e66ad54c8528d644..fbd5f55b3b68c56ed7ae42cdac2c27bed0d43624 100644 (file)
@@ -315,6 +315,9 @@ static const unsigned int memcg_node_stat_items[] = {
        PGDEMOTE_KSWAPD,
        PGDEMOTE_DIRECT,
        PGDEMOTE_KHUGEPAGED,
+#ifdef CONFIG_HUGETLB_PAGE
+       NR_HUGETLB,
+#endif
 };
 
 static const unsigned int memcg_stat_items[] = {
@@ -1355,6 +1358,9 @@ static const struct memory_stat memory_stats[] = {
        { "unevictable",                NR_UNEVICTABLE                  },
        { "slab_reclaimable",           NR_SLAB_RECLAIMABLE_B           },
        { "slab_unreclaimable",         NR_SLAB_UNRECLAIMABLE_B         },
+#ifdef CONFIG_HUGETLB_PAGE
+       { "hugetlb",                    NR_HUGETLB                      },
+#endif
 
        /* The memory events */
        { "workingset_refault_anon",    WORKINGSET_REFAULT_ANON         },
@@ -1450,6 +1456,11 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
        for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
                u64 size;
 
+#ifdef CONFIG_HUGETLB_PAGE
+               if (unlikely(memory_stats[i].idx == NR_HUGETLB) &&
+                   !(cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_HUGETLB_ACCOUNTING))
+                       continue;
+#endif
                size = memcg_page_state_output(memcg, memory_stats[i].idx);
                seq_buf_printf(s, "%s %llu\n", memory_stats[i].name, size);
 
index 7b62bfb19afabd9d5285166a878b32acf24237d6..22a294556b584ac30b244da2b9f50fc01d8d1788 100644 (file)
@@ -1273,6 +1273,9 @@ const char * const vmstat_text[] = {
        "pgdemote_kswapd",
        "pgdemote_direct",
        "pgdemote_khugepaged",
+#ifdef CONFIG_HUGETLB_PAGE
+       "nr_hugetlb",
+#endif
        /* system-wide enum vm_stat_item counters */
        "nr_dirty_threshold",
        "nr_dirty_background_threshold",