db: shared compaction concurrency limit across multiple Pebble instances #3813

sumeerbhola · 2024-07-31T12:51:54Z

In CockroachDB multi-store deployments, especially with large numbers of stores (8 or more), the CPU consumed by compactions can be significant. We do need a per store compaction concurrency limit, since disk bandwidth is a per-store resource, but we should additionally have a shared compaction concurrency limiter.

This shared limiter should fairly adjudicate on which compaction gets to run next based on a score of how important it is. For example,

level score driven compactions could be compared based on the score, or additionally take into account the number of sub-levels if this is a compaction out of L0.
a read-driven compaction in one store is strictly less important than a score driven compaction in another store.
delete-only and move compactions are cheap, so don't need to respect this concurrency limit.

Jira issue: PEBBLE-230

Epic CRDB-41111

sumeerbhola · 2024-07-31T12:58:23Z

NB: this is orthogonal to #1329 which would adjust the node-level concurrency limit based on the available CPU. We would still need a limiter component that would decide which among the queue of potential compactions gets to run next.

itsbilal · 2024-08-06T19:49:28Z

Cockroach side issue, if we explore a solution higher up: cockroachdb/cockroach#74697

compaction concurrency in a multi-store configuration. Each Pebble store (i.e. an instance of *DB) still maintains its own per-store compaction concurrency which is controlled by `opts.MaxConcurrentCompactions`. However, in a multi-store configuration, disk I/O is a per-store resource while CPU is shared across stores. A significant portion of compaction is CPU-intensive, and so this ensures that excessive compactions don't interrupt foreground CPU tasks even if the disks are capable of handling the additional throughput from those compactions. The shared compaction concurrency only applies to automatic compactions This means that delete-only compactions are excluded because they are expected to be cheap, as are flushes because they should never be blocked. Fixes: cockroachdb#3813 Informs: cockroachdb/cockroach#74697

This change adds a new compaction pool which enforces a global max compaction concurrency in a multi-store configuration. Each Pebble store (i.e. an instance of *DB) still maintains its own per-store compaction concurrency which is controlled by `opts.MaxConcurrentCompactions`. However, in a multi-store configuration, disk I/O is a per-store resource while CPU is shared across stores. A significant portion of compaction is CPU-intensive, and so this ensures that excessive compactions don't interrupt foreground CPU tasks even if the disks are capable of handling the additional throughput from those compactions. The shared compaction concurrency only applies to automatic compactions This means that delete-only compactions are excluded because they are expected to be cheap, as are flushes because they should never be blocked. Fixes: cockroachdb#3813 Informs: cockroachdb/cockroach#74697

CompactionScheduler is an interface that encompasses (a) our current compaction scheduling behavior, (b) compaction scheduling in a multi-store setting that adds a per-node limit in addition to the per-store limit, and prioritizes across stores, (c) compaction scheduling that includes (b) plus is aware of resource usage and can prioritize across stores and across other long-lived work in addition to compactions (e.g. range snapshot reception). CompactionScheduler calls into DB and the DB calls into the CompactionScheduler. This requires some care in specification of the synchronization expectations, to avoid deadlock. For the most part, the complexity is borne by the CompactionScheduler -- see the code comments for details. ConcurrencyLimitScheduler is an implementation for (a), and is paired with a single DB. It has no knowledge of delete-only compactions, so we have redefined the meaning of Options.MaxConcurrentCompactions, as discussed in the code comment. CompactionScheduler has some related interfaces/structs: - CompactionGrantHandle is used to report the start and end of the compaction, and frequently report the written bytes, and CPU consumption. In the implementation of CompactionGrantHandle provided by CockroachDB's AC component, the CPU consumption will use the grunning package. - WaitingCompaction is a struct used to prioritize the DB's compaction relative to other long-lived work (including compactions by other DBs). makeWaitingCompaction is a helper that constructs this struct. For integrating the CompactionScheduler with DB, there are a number of changes: - The entry paths to ask to schedule a compaction are reduced to 1, by removing DB.maybeScheduleCompactionPicker. The only path is DB.maybeScheduleCompaction. - versionSet.{curCompactionConcurrency,pickedCompactionCache} are added to satisfy the interface expected by CompactionScheduler. Specifically, pickedCompactionCache allows us to safely cache a pickedCompaction that cannot be run. There is some commentary on the worst-case waste in compaction picking -- with the default ConcurrencyLimitScheduler on average there should be no wastage. - versionSet.curCompactionConcurrency and DB.mu.compact.manualLen are two atomic variables introduced to implement DB.GetAllowedWithoutPermission, which allows the DB to adjust what minimum compaction concurrency it desires based on the backlog of automatic and manual compactions. The encoded logic is meant to be equivalent to our current logic. The CompactionSlot and CompactionLimiter introduced in a recent PR are deleted. CompactionGrantHandle is analogous to CompactionSlot, and allows for measuring of CPU usage since the implementation of CompactionScheduler in AC will explicitly monitor usage and capacity. CompactionScheduler is analogous to CompactionLimiter. CompactionLimiter had a non-queueing interface in that it never called into the DB. This worked since the only events that allowed another compaction to run were also ones that caused another call to maybeScheduleCompaction. This is not true when a CompactionScheduler is scheduling across multiple DBs, or managing a compaction and other long-lived work (snapshot reception), since something unrelated to the DB can cause resources to become available to run a compaction. There is a partial implementation of a resource aware scheduler in https://github.com/sumeerbhola/cockroach/tree/long_lived_granter/pkg/util/admission/admit_long. Informs cockroachdb#3813, cockroachdb/cockroach#74697, cockroachdb#1329

blathers-crl bot added A-storage T-storage labels Jul 31, 2024

github-project-automation bot added this to [Deprecated] Storage Jul 31, 2024

github-project-automation bot moved this to Incoming in [Deprecated] Storage Jul 31, 2024

nicktrav moved this from Incoming to Backlog in [Deprecated] Storage Aug 6, 2024

anish-shanbhag linked a pull request Aug 23, 2024 that will close this issue

compact: add shared compaction pool for multiple stores #3880

Open

sumeerbhola self-assigned this Jan 14, 2025

sumeerbhola mentioned this issue Jan 27, 2025

db: introduce CompactionScheduler and integrate with DB #4297

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

db: shared compaction concurrency limit across multiple Pebble instances #3813

db: shared compaction concurrency limit across multiple Pebble instances #3813

sumeerbhola commented Jul 31, 2024 •

edited by exalate-issue-sync bot

Loading

sumeerbhola commented Jul 31, 2024

itsbilal commented Aug 6, 2024

db: shared compaction concurrency limit across multiple Pebble instances #3813

db: shared compaction concurrency limit across multiple Pebble instances #3813

Comments

sumeerbhola commented Jul 31, 2024 • edited by exalate-issue-sync bot Loading

sumeerbhola commented Jul 31, 2024

itsbilal commented Aug 6, 2024

sumeerbhola commented Jul 31, 2024 •

edited by exalate-issue-sync bot

Loading