Post

Eventlet removal - Gazpacho status

Eventlet removal - Gazpacho status

Bah, it was a long time. I published my last update 10 months ago. I’m sorry. It was not the plan. But as any real plan, mine was also died soon after meeting the enemy, Time.

Due to the size of the time gap to cover I won’t try to make this as a chronological update, but rather a status report focusing where we are at the end of the Gazpacho release. Then, I will tell some selected stories from the past one and a half cycles to show not just the result but a bit of the journey as well.

Hang on this will be a long one.

Nova’s Eventlet status at Gazpacho (2026.1) release

One can say that we are almost done. (I love famous last words.) If I look at it in a certain angle and from a great distance then sure, we are almost done. You can take the Gazpacho release, configure your services to run with native threading and boom, no more Eventlet in you nova deployment… Almost

Our APIs (nova-api, nova-metadata) runs with native threading by default now as well as the nova-scheduler. So you don’t even need to enable native threading mode there. But if you encounter issues then please report them to us as a bug, and you can still switch back to Eventlet mode while we are fixing those nasty bugs.

The nova-conductor and nova-compute still runs with Eventlet by default as this is our first release where these services gained the capability to run without Eventlet. Still you can follow our guide and switch them to native threading. Please do that in pre-production first. Especially nova-compute can cause surprises, so we ask for caution and we are eager for your feedback.

And last, our console proxies are only working with Eventlet at the time of Gazpacho. There is work underway to change that, but the release deadline hit before we can finish it.

We are also doing testing in multiple fronts. Our nova-next CI job runs all services, that supports it, in native threading mode and the tempest results are green and stable. We also added a job, nova-alt-configuration that keeps testing all our services with the deprecated Eventlet support. The rest of the CI jobs using each service’s default mode during testing. This includes the jobs run by other OpenStack projects. This means that as of Gazpacho the greater OpenStack project already gates with native threaded nova-api, nova-metadata, and nova-scheduler without apparent issues. We are planning to change our default concurrency mode to native threading for nova-conductor and nova-compute too early in the Hibiscus cycle.

Making our functional tests running with Eventlet is a task in progress. But the majority of the unit tests are running in both modes in separate jobs. But note that there is a short list of unit test excludes that are still disabled in native threading mode. For sure we want this list shrink to zero soon, but we are not there yet.

At the end of the Flamingo release we made nice progress about scale testing our APIs and the nova-scheduler in the upstream gate with Rally, but this work was the one that needed to stop in Gazpacho to make room for the translation on nova-compute. Never the less we have some guidance how to scale these services in native threading mode in our doc.

And that is just the visible parts that are not done yet. During Gazpacho we needed to make shortcuts and accumulated technical debt intentionally to make it possible to release nova-compute with native support. These code crufts, and hacks needs to be cleaned up in the near future. Also as soon as we hit the time to drop the support of the Eventlet mode we will have a series of refactorings to remove all the complications currently needed to support both modes.

The numbers

In Gazpacho we landed 59 commits for Eventlet removal which shows that we upped our game compared to Flamingo where this number was 39.

I tried to generate the bar chart of the number of remaining Eventlet mentions in the codebase, but the numbers do not line up nicely. In general we have more code lines now that referring to “eventlet” than when we started. In reality however those mentions are a lot more centralized. We moved the majority of our Eventlet logic to nova.utils, but that file now has a lot more Eventlet specific lines. Also we need to test those utils, so we have a bunch of Eventlet specific test cases, hence the overall numbers are way up.

Instead it make sense to look the files that are importing Eventlet:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
$ egrep -l -R -e "import eventlet|from eventlet" ./nova
./nova/virt/libvirt/host.py
./nova/monkey_patch.py
./nova/utils.py
./nova/console/websocketproxy.py
./nova/tests/unit/virt/libvirt/test_host.py
./nova/tests/unit/virt/libvirt/test_driver.py
./nova/tests/unit/virt/libvirt/volume/test_mount.py
./nova/tests/unit/virt/vmwareapi/test_driver_api.py
./nova/tests/unit/virt/disk/mount/test_nbd.py
./nova/tests/unit/test_hacking.py
./nova/tests/unit/test_utils.py
./nova/tests/unit/storage/test_rbd.py
./nova/tests/fixtures/notifications.py
./nova/tests/fixtures/nova.py
  • nova.virt.libvirt.host needed two distinct ibvirt event handling implementations one for eventlet and one for native threading.
  • nova.monkey_patch and nova.utils are our main code to handle selecting and using the proper concurrency primitives, so that the rest of the codebase does not need to care.
  • nova.console.websocketproxy is our main remaining Eventlet only codepath to translate to native threading.
  • nova.test.fixtures hides most of our unit tests from the concurrency mode they are running with. The notification fixture here is probably an oversight due to not having our functional tests running in both modes yet.
  • nova.tests.unit holds all of our unit tests and some modules have significantly different behavior in the two concurrency modes, so there are mode specific test cases there depending still on Eventlet.

The journey

Adapt our timeline to realities

Based on the progress in Flamingo we raised the flag to the TC that the Eventlet removal timeline of OpenStack needs to be revised to make it more realistic for Nova. This went smoother than I anticipated and led to some extra clarification about the relationship between this work and our SLURP upgrade support.

We need more Executors

For me the biggest learning came from realizing how dependent our codebase is to different type of Executors. At a first glance an Executor is just a good old thread pool. A bunch of worker threads waiting for new tasks to execute while providing Future objects for the callers to gather the results. When we translated the APIs and the scheduler we already used them happily as an abstraction over GreenThreads and native Threads. It was nice and easy.

Sure we needed to be careful about the size of our Executors. The native worker threads are a lot more expensive beasts than the GreenThreads. In the past we had Executors defined to allow thousands of GreenThreads by default. That would be very memory consuming in native threading. This led to some tweaking of our default configuration in native threading mode. Make sure you read the related release notes before you hit new limits.

But size is just an integer. (Another famous last world.) The nova-compute ups the game with two very different needs regarding Executors:

Here is a task, but give me some time to decide if I really need it

At least the use case here is simple to explain. We have VMs, those VMs can be hard rebooted via the API. We implement that by telling libvirt to reboot the VM. Simple see. But.

We also want to detect when a VM stops itself, or crashes. Libvirt happily sends us events about the VM power state changes. So we listen to the STOPPED event and update the VM power state when we receive it.

Now put the two behaviors together. During hard reboot the VM stops, libvirt sends the STOPPED event, then the VM starts, libvirt sends the RESUMED event. This event handling happens concurrently with nova’s hard reboot codepath that also sets the power state of the VM in the DB. It is easy to see how racy that two concurrent power state updates can be. Also updating the DB twice during a hard reboot is simply a waste of DB and message bus capacity. So for a long time nova had the logic that when a libvirt STOPPED event arrived, it created a power state update task, but did not immediately run it. Instead it told Eventlet to run it after a couple seconds of delay. And if a RESUMED event arrived in the meantime then nova cancelled the power state update task. No race, no unnecessary DB update during hard reboot, but if the VM crashed suddenly, it’s power state will updated in the DB eventually.

The catch is, Eventlet allows cancelling a task that was started with a delay specified, native threading does not have such capabilities out of the box.

The naïve implementation with native threading is adding a sleep before the power state update function in the task. But that is not good. A task running in a native thread cannot be cancelled after it started running, even if “running” here actually means executing time.sleep(). In the other hand Executors have a tasks queue to handle the case where there are more tasks than free workers. Tasks that are queued waiting (an another form of “sleeping”) for workers can be cancelled. This gave us a hint how to solve this. Do not execute time.sleep() but instead wait in a queue.

This lead to our first, but probably not our last, Executor customization: StaticallyDelayingCancellableTaskExecutorWrapper. The enterprisy names tells everything: It is a wrapper for Executors. (As the mantra says: Prefer composition over inheritance.) It supports cancelling tasks while those tasks are delayed. The statical delay here is just a simplification. In our use case every task needs to be delayed with the same statically defined delay. You will see this simplifies the implementation.

So this is how it works:

  • It gets an normal Executor from the outside to use to really run tasks
  • It has a queue of delayed tasks waiting to be executed after the delay or ignored if it is cancelled before the delay runs out.
  • It has an extra thread that detects when the delay runs out and executes the task if it is not yet cancelled.

There are couple of complications.

The client expect a Future object when it submits a task. Normally the Future object is created by the real Executor when a task is submitted. The whole point in this wrapper is that it does not submit the task right away for execution, so it needs to create and return its own Future object to the client to satisfy the Executor interface, and then make sure that this Future from the wrapper is connected to the real Future returned by the real Executor after the task is submitted after the delay. Fortunately the Future class exposes the set_result() and set_exception() functions to manipulate the internal state of a Future. So connecting the two Futures is as simple as wrapping the original task into:

1
2
3
4
5
6
7
8
9
10
11
    @staticmethod
    def _task_wrapper(task) -> None:
        """This wraps the original task so when it finishes in the real
        executor the result of the task can be copied to the Future object
        already returned to our caller from submit_with_delay(). So
        the caller can get the result or exception from the task.
        """
        try:
            task.future.set_result(task.fn(*task.args, **task.kwargs))
        except BaseException as e:
            task.future.set_exception(e)

The Future class also exposes the set_running_or_notify_cancel() function that allows our Executor wrapper to atomically check if the client cancelled the task, otherwise it sets the task to running state and therefore marks it non-cancellable any more.

After all this setup the logic of the scheduler thread in the wrapper can be summarized as the following:

  1. Take the oldest task from the queue, if the queue empty wait on it.
  2. Wait until the delay runs out for the task.
  3. Check if the task is not cancelled and set it to running. If cancelled go to 1.
  4. Wrap the task to connect it’s result to the already returned Future and submit it to the real Executor. Then go to 1.

Of course there are a bit more error handling in the code and also we need to handle clean shutdown if requested.

Note, it is only valid to take the oldest task from the queue and blindly wait for its deadline because we know that every task has the same amount of delay (and because we don’t believe in time travel), so the oldest task in the queue needs to be executed next. In the general case when each task can define its own delay we would need a priority queue and a bit more logic to handle when a new task arrives which has higher priority (i.e. closer start time) than the currently highest prio task. As a followup we want to develop this generic implementation so others can use it too.

I want a single pool but separate concurrency caps on different task types

This is other “exotic” use case for our Executors in nova-compute. But first we needed to hit a bug to even figure out we have a problem to solve.

The bug was detected in the vmware CI, probably because it run tempest with more parallelism. We saw that VM boots hang until they timed out while setting up the networking for the VM. It turned out that it was a deadlock on our default Executor that was used both to run the VM build logic in a task and the network setup for that VM in a separate child task. As we touched before our Executor sizes are limited in native threading. The default executor was limited to 5 workers max. When 5 VM build arrived at the same time, each grabbed a worker to build the VM. So the Executor became full. Then each build task submitted a network setup child task to the same default Executor, moved on to do some other logic, when that was done then started waiting for the network setup child task. As the Executor was full with the build tasks, the network child tasks was never executed and the build task waited forever. We have logs when an Executor starts queueing up tasks, this helped figure out what happened.

This problem taught us that we should care about parent, child task relationship if they share the same Executor. If the child is not a fire and forget task, but a task the parent will wait for, then sharing an Executor can lead to deadlocks. We had the same bug forever in nova, it was just very hard to produce it as the Executor size in Eventlet mode was 1000, so it would required 1000 parallel build request to arrive to the same compute to cause trouble.

So we moved the build task to its own long_task executor and kept the default executor for fire and forget tasks and child tasks that are not spawning grandchild tasks. Then we saw that build is not the only long running tasks that used the default Executor. We had snapshot as well. Then further looking revealed that even though these task types shared an Executor, the default one, and will share the new long_task Executor they have individual concurrency limits implemented by bounded semaphores.

The nova-compute can be configured to limit then number of concurrent VM lifecycle operations of a given type. E.g. by default nova-compute allowed 10 parallel VM builds, 1 parallel outgoing live migration, and 5 parallel VM snapshots. Originally it implemented the build and snapshot case with a single GreenThread pool with 1000 workers and two bounded semaphores one for build and one for snapshot. As GreenThreads are cheap we did not have to worry about having a list of tasks holding up a GreenThread each while waiting on the semaphore for their execution slot. This does not work with native Threading. We cannot waste one thread for each waiting operation where there could be hundreds of them.

One can ask why don’t we keep them in the message bus waiting. The problem there is that the same topic is used for other, non-capped operations, so we have to process the messages from the topic to find those operations.

A simplest solution would be to have a separate Executor for each operation type with the configured cap used as the size of the Executor. This way we would relay on the queue in the Executor to hold the tasks waiting for a slot. We rejected this idea as it creates too many independent Executors to manage, also it wastes resources as workers, even if they are idle, cannot be shared across different operation types.

Right now we went with a hybrid approach. The live migration is kept as a separate executor as it has a very different limit than the other two. The build and the snapshot operations are now forced to share the same cap in native threading mode and therefore implemented by the single shared long_task Executor.

In the long run another Executor wrapper can be developed that maintains a per task type cap within a shared executor by having tasks queues and scheduling queued tasks to slots when another tasks finishes.

Also in the future we might want to come up with a way to track parent-child task relationships and prevent submitting the child task to the parent’s executor. Right now only code review prevents us to re-introduce such a deadlock prone situation.

When unit testing is harder than tempest

This was an interesting realization of mine. Even though in retrospect it is easy to see why making tempest work with the native threaded service was easier than passing the whole unit test job with the same. Once manual testing of the native threading mode in devstack showed that the service works tempest showed limited amount of surprises. In the other hand we still have unit tests today that randomly hangs or produces strange SQLAlchemy issues when run with native threading and therefore disabled in the native threaded tox job. In retrospect I have the following understanding explaining this:

  • The tempest tests are pure back box tests, eventlet is “just” a refactoring, the black box does not care about internals.
  • Our unit tests are many cases testing implementation of the unit not the behavior of the unit making them fragile during refactoring. In some cases unit tests are trying to cover concurrent behaviors with clever injection of locks and events. As Eventlet executes tasks until completion or explicit yielding some tests were relying on this run to completion semantic too which does not exists in native Threading.
  • Some of our unit test doing too much. E.g. a big chunk of our unit tests depend on a DB being provided to the test case.

It will be interesting to see how hard it will be to mak our functional test passing and stable with native threading as this test suits sits between tempest and unit test.

Scale testing

One of the key change during Eventlet removal is the cost of concurrency. A native thread cost a lot more memory than a GreenThread. This forced us to limit our Executor sizes to reasonable numbers as our old defaults were simply too much. But what is a reasonable default? It is very hard to tell analytically, so we decided to try to find it via testing.

In the upstream CI we have limited resources (CPU cores and RAM), so we cannot the big deployment to find out how they scale when the concurrency mode is changed. So we have to make some compromise.

The most resource consuming part of our deployment are compute nodes and VMs created on them by nova. So if we accept that we cannot scale test nova-compute then we can limit our resource usage. Fortunately nova already provides a way to simulate most of the heavy lifting of a nova-compute service by implementing a fake virt driver. That driver does not spin up VMs on the hypervisor it only has an in-memory state, and implements the happy path for the compute operations. After a bit of tweaking in devstack we could create 3 cells and 20 compute nodes within a single CI worker machine. This allows us to scale test at least our APIs, scheduler, conductor and even the nova-compute’s virt driver independent pieces like the ComputeManager and the ResourceTracker.

Another complication in a our CI that two consecutive job runs will observe significantly different performance from the CI worker machine. It is because we use multiple cloud providers each having different performance and ever changing load situation. So comparing perf-scale results across multiple job runs is not feasible as we cannot control the underlying machines. Instead we run the same performance tests twice in the same job, once with Eventlet and once with native threading, then we compare the results within the same job.

We started using Rally to implement performance test scenarios. And created and ansible role that can collect and compare the perf data: CPU and memory usage and test execution time.

This was ongoing work in Flamingo but due to limited time we stopped it during Gazpacho to refocus on nova-compute itself. Now that nova-compute is done we want to go back to these trial and take them further. The end goal would be a job that at least runs periodically and fails if the native threading performance starts differing significantly from the Eventlet one.

This post is licensed under CC BY 4.0 by the author.