[Feature Request] Hot-plug vCPUs #2609

jeromegn · 2021-06-03T15:25:50Z

Feature Request

We'd like to be able to add (and remove) vCPUs from running firecracker microvms. It appears to be possible to do that with KVM. Examples I've seen define how many vCPUs at most a VM might use and then the actual number it will be using at boot. Then you can add vCPUs via virsh.

https://www.unixarena.com/2015/12/linux-kvm-how-to-add-remove-vcpu-to-guest-on-fly.html/

This would allow use to add a "burst" feature when CPU usage spikes.

Describe the desired solution

An API to modify a running microvm's vCPUs count.

This should notify the guest VM of the change:

Dec 16 12:48:28 UA-KVM1 kernel: CPU1 has been hot-added
Dec 16 12:48:28 UA-KVM1 kernel: SMP alternatives: switching to SMP code
Dec 16 12:48:57 UA-KVM1 kernel: smpboot: Booting Node 0 Processor 1 APIC 0x1
Dec 16 12:48:57 UA-KVM1 kernel: kvm-clock: cpu 1, msr 0:3ff87041, secondary cpu clock
Dec 16 12:48:57 UA-KVM1 kernel: TSC synchronization [CPU#0 -> CPU#1]:
Dec 16 12:48:57 UA-KVM1 kernel: Measured 906183720569 cycles TSC warp between CPUs, turning off TSC clock.
Dec 16 12:48:57 UA-KVM1 kernel: tsc: Marking TSC unstable due to check_tsc_sync_source failed
Dec 16 12:48:57 UA-KVM1 kernel: KVM setup async PF for cpu 1
Dec 16 12:48:57 UA-KVM1 kernel: kvm-stealtime: cpu 1, msr 3fd0d240
Dec 16 12:48:57 UA-KVM1 kernel: microcode: CPU1 sig=0x206c1, pf=0x1, revision=0x1
Dec 16 12:48:57 UA-KVM1 kernel: Will online and init hotplugged CPU: 1

Describe possible alternatives

We could give every firecracker microvm access to all cores and only use cgroups to limit actual scheduling time. This is not great though as it might create a lot of CPU steal. We prefer to give full cores when possible.

Checks

Have you searched the Firecracker Issues database for similar requests?
Have you read all the existing relevant Firecracker documentation?
Have you read and understood Firecracker's core tenets?

The text was updated successfully, but these errors were encountered:

KarthikNedunchezhiyan · 2021-06-03T16:10:53Z

@jeromegn does vertical scaling is the only solution to the problem you are to trying to fix? Does horizontal scaling like multiple instances not helped. Just curious to understand the usecase.

raduiliescu · 2021-06-03T16:51:40Z

Hi @jeromegn!

You might also want to look at cpu online/offline kernel feature - https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-devices-system-cpu.
On the downside you will always need to start the microVM with the max number of CPUs, and you will need an agent inside the microVM to write the values in /proc, but the overhead of keeping CPUs offline is better than in the cgroup case.

jeromegn · 2021-06-03T17:03:23Z

@KarthikNedunchezhiyan we need to support a large variety of workloads. More VMs isn't always the solution, but we're already doing that.

@raduiliescu thanks! That could work, but our users have root access to the VM and could bring up any number of CPUs.

AlexandruCihodaru · 2021-06-09T08:57:14Z

We need to think about it for a bit, we will get back to you.

wearyzen · 2023-11-13T12:05:46Z

marking this as parked right now but we will track this as part of our roadmap. We consider this as well while working on this.

roypat · 2024-11-13T15:15:27Z

Hey all,
Just to leave a quick update here, we've prototyped vCPU hotplug in the feature/vcpu-hotplug feature branch. The results of performance testing there was that the end-to-end latency of hotplugging and on-lining an addition vCPU was ~30ms per vCPU, which is prohibitively expensive for the post-restore scenario (where we want to be able to resume guest execution within single digit milliseconds). As such, we're moving this ticket back to the backlog for now :(

wjordan · 2024-11-24T22:44:45Z

Thanks for the update on this. Just to clarify, 30ms per vCPU sounds like an acceptable amount of latency for a vCPU hotplug process for most use cases, but it sounds like there's an issue in the prototype that requires re-adding previously-hotplugged vCPUs after a snapshot-restore, which is why this latency applies to the post-restore scenario.

In the linked branch, it sounds like the issue has something to do with the vCPU config not persisting after a restore:

firecracker/docs/vcpu-hotplug.md

Lines 127 to 132 in c41ceb5

    
           ### Snapshotting 
        
           vCPU hot-plugging after snapshot restore is currently disabled. This is due to 
        
           the vCPU config not persisting after restore. Trying to hot-plug vCPUs after a 
        
           snapshot restore will return `HotplugVcpuError::RestoredFromSnapshot` until this 
        
           has been fixed.

Is this understanding correct? Does the next step in the implementation involve updating the snapshot state format to include the relevant vCPU hotplug info in the microVM state, or was there some other issue in the implementation that requires vCPU hotplug latency in the snapshot restore process?

roypat · 2024-11-25T07:27:39Z

Hi Will,

Just to clarify, 30ms per vCPU sounds like an acceptable amount of latency for a vCPU hotplug process for most use cases, but it sounds like there's an issue in the prototype that requires re-adding previously-hotplugged vCPUs after a snapshot-restore, which is why this latency applies to the post-restore scenario.

We only implemented pre-snapshot hotplugging back in July to get an idea of the latencies involved, but for the usecase we were looking at, we actually want to hotplug vCPUs only after restore (e.g. no hotplug ever happens before a snapshot is taken).

Is this understanding correct? Does the next step in the implementation involve updating the snapshot state format to include the relevant vCPU hotplug info in the microVM state, or was there some other issue in the implementation that requires vCPU hotplug latency in the snapshot restore process?

Essentially, the entire ACPI hotplug device would need to be persisted in the snapshot, but generally there shouldn't be much of an issue with that.

dianpopa added Priority: Low Indicates that an issue or pull request should be resolved behind issues or pull requests labelled ` Roadmap: New Request labels Jun 30, 2021

luben mentioned this issue Feb 4, 2022

[Feature Request] Memory hot-plug #2890

Open

3 tasks

wearyzen added the Status: Parked Indicates that an issues or pull request will be revisited later label Nov 13, 2023

zulinx86 added this to Firecracker Roadmap Jul 1, 2024

zulinx86 moved this to Researching in Firecracker Roadmap Jul 1, 2024

zulinx86 added Status: WIP Indicates that an issue is currently being worked on or triaged and removed Status: Parked Indicates that an issues or pull request will be revisited later labels Jul 1, 2024

This was referenced Jul 1, 2024

Add vCPUs to VMM via new endpoint #4638

Merged

Create ACPI Devices for Hotplugging #4657

Merged

roypat mentioned this issue Jul 29, 2024

Hotplug Integration Tests #4704

Open

9 tasks

roypat added Status: Parked Indicates that an issues or pull request will be revisited later and removed Status: WIP Indicates that an issue is currently being worked on or triaged labels Nov 13, 2024

kalyazin mentioned this issue Feb 5, 2025

Change seccomp_filters to be passed by value, rather than by reference. #4647

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Hot-plug vCPUs #2609

[Feature Request] Hot-plug vCPUs #2609

jeromegn commented Jun 3, 2021

KarthikNedunchezhiyan commented Jun 3, 2021

raduiliescu commented Jun 3, 2021

jeromegn commented Jun 3, 2021

AlexandruCihodaru commented Jun 9, 2021

wearyzen commented Nov 13, 2023 •

edited

Loading

roypat commented Nov 13, 2024

wjordan commented Nov 24, 2024

roypat commented Nov 25, 2024

[Feature Request] Hot-plug vCPUs #2609

[Feature Request] Hot-plug vCPUs #2609

Comments

jeromegn commented Jun 3, 2021

Feature Request

Describe the desired solution

Describe possible alternatives

Checks

KarthikNedunchezhiyan commented Jun 3, 2021

raduiliescu commented Jun 3, 2021

jeromegn commented Jun 3, 2021

AlexandruCihodaru commented Jun 9, 2021

wearyzen commented Nov 13, 2023 • edited Loading

roypat commented Nov 13, 2024

wjordan commented Nov 24, 2024

roypat commented Nov 25, 2024

wearyzen commented Nov 13, 2023 •

edited

Loading