runsc: don't enable MADV_HUGEPAGE with direct compaction #11484
+285
−135
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
runsc: don't enable MADV_HUGEPAGE with direct compaction
By default, private anonymous memory mappings in Linux use transparent
hugepages when available (/sys/kernel/mm/transparent_hugepage/enabled=always);
when hugepages are not available (typically due to fragmentation), such
mappings fall back to small pages instead
(/sys/kernel/mm/transparent_hugepage/defrag=madvise). See Linux's
Documentation/admin-guide/mm/transhuge.rst.
In gVisor, application private anonymous memory cannot generally be backed by
host private anonymous memory, since in many platforms, applications and the
sentry (which also needs to map application memory) run in different host
processes. Instead, application private anonymous memory is backed by a host
memfd, managed by pgalloc.MemoryFile. By default, Linux memfds never use
transparent hugepages
(/sys/kernel/mm/transparent_hugepage/shmem_enabled=never); thus
runsc/hostsettings sets shmem_enabled=madvise, allowing pgalloc to request
transparent hugepages using madvise(MADV_HUGEPAGE).
However, since the default value of /sys/kernel/mm/transparent_hugepage/defrag
is madvise, MADV_HUGEPAGE has the unintended side effect of enabling direct
compaction: when hugepages are not available, the host kernel will attempt to
form free hugepages by defragmenting small pages. This can be very expensive,
which is why (as described above) Linux doesn't do it by default.
Thus:
When MADV_HUGEPAGE is required for application THP, only enable it if
/sys/kernel/mm/transparent_hugepage/defrag specifies an operating mode that
does not result in more direct compaction than a normal private anonymous
mapping would.
Adjust /sys/kernel/mm/transparent_hugepage/defrag accordingly in
runsc/hostsettings.
FUTURE_COPYBARA_INTEGRATE_REVIEW=#11473 from Champ-Goblem:shim-add-cgroup-v2-metrics-support b602afb