Request for input: Extended event channel support

The following has been posted on the xen-devel and xen-users mailing lists.

Executive summary

The number of event channels available for dom0 is currently one of the biggest limitations on scaling up the number of VMs which can be created on a single system. There are two alternative implementations we could choose, one of which is ready now, the other of which is potentially technically superior, but will not be ready for the 4.3 release.
The core question we need to ask the community: How important is lifting the event channel scalability limit to 4.3? Will waiting until 4.4 cause a limit in the uptake of the Xen platform?
Read on for a deeper technical description of the issue and the various solutions.

The issue

The existing event channel implementation for PV guests is implemented as 2-level bit array. This limits the total number of event channels to word_size ^ 2, which is 1024 for 32-bit guests and 4096 for 64-bit guests.
This sounds like a lot, until you consider that in a typical system, each VM needs 4 or more event channels in domain 0. This means that for a 32-bit dom0, there is a theoretical maximum of 256 guests — and in practice it’s more like 180 or so, because of event channels required for other things. XenServer already has customers using VDI that require more VMs than this.

The dilemma

When we began the 4.3 release cycle, this was one of the items we identified as a key feature we needed to get for 4.3. Wei Liu started work on an extension of the existing implmentation, allowing 3 levels of event channels. The draft of this is ready, and just needs the last bit of polishing and bug-chasing before it can be accepted.
However, several months ago, David Vrabel came up with an alternate design which in theory was more scalable, based on queues of linked lists (which we have internally been calling “FIFO” for short). David has been working on the implementation since, and has a draft protoype; but it’s in no shape to be included in 4.3.
There are some things that are attractive about the second solution, including the flexible assignment of interrupt priorities, ease of scalability, and potentially even the FIFO nature of the interrupt delivery.
The question at hand then, is whether to take what we have in the 3-level implementation for 4.3, or wait to see how the FIFO implementation turns out (taking either it or the 3-level implementation in 4.4).

The solution in hand: 3-level event channels

The basic idea behind 3-level event channels is to extend the existing 2-level implementation to 3 levels. Going to 3 levels would give us 32k event channels for 32-bit, and 256k for 64-bit.
One of the advantages of this method is that since it is similar to the existing method, the general concepts and race conditions are fairly well understood and tested.
One of the disadvantages that this method inherits from the 2-level event channels is the lack of priority. In the initial implementation of event channels, priority was handled by event channel order: scans for events always started at 0 and went upwards. However, this was not very scalable, as lower-numbered events could easily completely lock out higher-numbered events; and frequently “lower-numbered” simply meant “created earlier”. Event channels were forced into a priority even if one was not wanted.
So the implementation was tweaked, so that scans don’t start at 0, but continue where the last event left off. This made it so that earlier events were not prioritized and removed the starvation issue, but at the cost of removing all event priorities. Certain events, like the timer event, are special-cased to be always checked, but this is rather a bit of a hack and not very scalable or flexible.
One thing that should be noted is that adding the extra level is envisoned only to be used by guests that need the extended event channel space, such as dom0 and driver domains; domUs will continue to use the 2-level version.

The solution close at hand: FIFO event channels

The FIFO solution makes event delivery a matter of adding items to a highly structured linked list. The number of event channels for the interface design has a theoretical maximum of 2^28; the current implementation is limimited at 2^17, which is over 100,000. The number is the same for both 32-bit and 64-bit kernels.
One of the key design advantages of the FIFO is the ability to assign an arbitrary priority to any event. There are 16 priorities available; one queue for each priority. Higher-priority queues are handled below lower-priority queues, but events within a queue are handled in FIFO order.
Another potential advantage is the FIFO ordering. With the current event channel implementation, one can construct scenarios where even with events of the same priority, clusters of events can lock out others based on where they are or the number of them. FIFO solves this by handling events within the same priority strictly in the order in which they were raised. It’s not clear yet, however, whether this has a measurable impact on performance.
One of the potential disadvantages of the FIFO solution is the amount of memory that it requires to be mapped into the Xen address space. The FIFO solution requires an entire word per event channel; a reasonably configured system might have up to 128 Xen-mapped pages per dom0 or domU. On the other hand, this number can be scaled at a fine-grained level, and limited by the toolstack; a typical domU would require only one page mapped in the hypervisor.
By comparison, the 3-level solution requires only two bits per event channel. Any domain using the extra level would require exactly 16 pages for 64-bit domains, and 2 pages for 32-bit domains. We would expect this to include dom0 and any driver domains, but that domUs would continue using 2-level event channels (and thus require no extra pages to be mapped).

Considerations

There are a number of additional considerations to take into account.
The first is that the hypervisor maintainers have made it clear that once 3-level event channels is accepted, FIFO will have a higher bar to clear for acceptance. That is, if we wait for the 4.4 timeframe before choosing one to accept, then FIFO will only need to be marginally preferrable to 3-level to be accepted. However, if we accept the 3-level implimentation for 4.3, then FIFO will need to demonstrate that it is significantly better for 4.3 in order to be accepted.
We are not yet aware of any companies that are blocked on this feature. Citrix XenServer clients using Citrix’s VDI solution need to be able to run more than 200 guests; however, because XenServer control both the kernel and hypervisor side, they can introduce temporary, non-backwards or forwards-compatible changes to work around the limitation, and so are not blocked. Oracle and SuSE have not indicated that this a feature they are in dire need of. Most cloud deployments that we know of — even extremely large ones like Amazon or Rackspace — use large numbers of relatively inexpensive computers, and so typically do not need to run more than 200 VMs per physical host.
Another factor to consider is that we are considering attempting a shorter release cadence for 4.4 — 6 months or possibly less. That means that the impact of delaying the event channel scalability feature will be reduced.

What we need to know

What we’re missing in order to make an informed decision is voices from the community: If we delay the event channel scalability feature until 4.4, how likely is this to be an issue? Are there current users or potential users of Xen who need to be able to scale past 200 VMs on a single host, and who would end up choosing another hypervisor if this feature were delayed?
Please respond with any feedback on the xen-devel or xen-users mailing list.
Thank you for your time and input.
-George Dunlap,
4.3 Release manager

Read more