libxl event API improvements

Over the past few months we have been working on improving the API for the libxl library. libxl is to become the base layer for all Xen toolstacks. We intend the version of libxl in Xen 4.2 to have a stable interface, with which we will maintain backward compatibility for some time to come.
The Xen 4.1 libxl API had some awkward features. Particularly, dealing with long-running operations, and getting information about events such as domain death, was difficult to do correctly in daemons such as libvirt’s virtd and XCP/XenServer. For example, the wait for domain death facility did not tell you which domain had died! And many of the functions would block a whole event-loop-based process while a long-running operation completed. The new arrangements are intended to support everything from the simple xl command line utility, to event-callback-based daemons such as virtd, and also to be convenient for use in multithreaded programs.
This has required a lot of behind-the-scenes infrastructure, which insulates libxl code implementing specific VM operations from the need to know about the calling toolstack’s concurrency model. As I write this, the changes are already in the xen-unstable.hg tree undergoing testing, and we are putting the finishing touches to the APIs.
We are pleased to see that work is already underway to glue these new arrangements into libvirt, and that the shape of the interface we have provided seems good for libvirt, since that was one of the primary users we had in mind. For example the new event registration callbacks in libxl can be directly mapped to libvirt’s event loop registration facilities.

Some tricky problems

We have had to do some things that the Unix API doesn’t make easy.
One exciting task was that we wanted to provide simple synchronous versions of many long-running calls, where the “natural” internal approach (used everywhere for long-running operations internally) is a callback-on-completion model. So the library internally has a set of machinery which can, if desired, run an internal event loop until a particular operation completes, or which can deliver operation completion via a callback into the calling toolstack or as an event data structure to be retrieved by the calling toolstack’s event loop.
And all of this needed to be threadsafe so that programs which use threading for concurrency can use what appear to be simple synchronous calls.
Another of the key difficulties is related to subprocesses. We’re trying to write a library which:

  • needs to fork occasional subprocesses
  • sometimes needs wait for those suprocesses
  • may need to live in a program which forks its own subprocesses for other reasons
  • may need to live in a multithreaded program

This is quite challenging. We need to negotiate with the calling toolstack about the ownership of SIGCHLD, so that there can be one place in the whole progam which reaps children. We also need to do a rather complicated dance around the creation of fds (particularly pipes) which we don’t want to be inherited by other processes which might be forked off. Our solution involves using pthread_atfork handlers to implement a “no forking” lock, which allows us to prevent any thread in the entire program from forking until we have regularised the state of all our fds.
One of the most awkward wrinkles was our need to allocate and use ptys. The only portable function for doing this is the openpty family. However, openpty might fork and is not useable in any program which might install a SIGCHLD handler. That effectively precludes its use in any general daemon. Our answer to the conundrum? We use our explicitly negotiated child process handling machinery to fork a child, whose only purpose is to call openpty and send us the resulting file descriptors using fd passing. This works surprisingly well.

A modest proposal

There are some lessons for the Unix API here. Some of them (for example, difficulties surrounding CLOEXEC) are quite difficult to fix in a backward compatible way. But there is one obvious improvement that would have made everything a lot easier:
There needs to be a way for a library to spawn a child processes in a way that is invisible to the rest of the program. A possibility would be an fdfork call which returns an fd, where the generated child is invisible to waitpid and generates no signals but where instead it can be waited on by selecting or polling on the fd.

Conclusion

Overall this has been a challenging exercise – but I like a challenge. I think we have managed to isolate most of the complexity into specialist event handling machinery so that the actual implementations of libxl’s major functionality don’t need to deal with it. In particular it should be easy to write code without thread-safety errors; event handling logic errors are still a possibility but are easier to spot in review and easier to debug.
The provision of a good, long-term supported, interface for libxl is one major plank of the transition plan away from the obsolete and deprecated Xend.
We hope the result is infrastructure and an API which will stand us in good stead as Xen evolves.

Read more