s6
Software
skarnet.org

Service startup notifications

It is easy for a process supervision suite to know when a service that was up is now down: the long-lived process implementing the service is dead. The supervisor, running as the daemon's parent, is instantly notified via a SIGCHLD. When it happens, s6-supervise sends a 'd' event to its ./event fifodir, so every subscriber knows that the service is down. All is well.

It is much trickier for a process supervision suite to know when a service that was down is now up. The supervisor forks and execs the daemon, and knows when the exec has succeeded; but after that point, it's all up to the daemon itself. Some daemons do a lot of initialization work before they're actually ready to serve, and it is impossible for the supervisor to know exactly when the service is really ready. s6-supervise sends a 'u' event to its ./event fifodir when it successfully spawns the daemon, but any subscriber reacting to 'u' is subject to a race condition - the service provided by the daemon may not be ready yet.

Reliable startup notifications need support from the daemons themselves. Daemons should notify the outside world when the service they are providing is reliably up - because only they know when it is the case.

s6 provides two ways for daemons to perform startup notification.

  1. Daemons can use the ftrigw_notify() function, provided in the ftrigw library. This is extremely simple and efficient, but requires specific s6 support in the daemon.
  2. Daemons can write a line to a file descriptor of their choice, then close that file descriptor, when they're ready to serve. This is a generic mechanism that some daemons already implement, and does not require anything specific in the daemon's code. The administrator can then run the daemon under s6-notifywhenup, which will properly catch the daemon's message and notify all the subscribers with a 'U' event, meaning that the service is now up.

    Note that there is still a small race condition remaining: if the daemon writes a line then instantly dies, and the supervisor picks up the death before the s6-notifywhenup program picks up the line, it is possible for the event sequence written to the fifodir to be wrong - 'd' before 'U'. This should be extremely rare, but unfortunately the race condition is unavoidable. The only way to be absolutely race-free is to have the daemon perform its readiness notification itself, which requires specific support.

The second method should really be implemented in every long-running program providing a service. When it is not the case, it's impossible to provide reliable startup notifications, and subscribers should then be content with the unreliable 'u' events provided by s6-supervise.