s6-rc: service management concepts

The foundations for a solid design

Table of contents

Service states, machine states

The job of a service manager is to bring the machine from one state, the current state, to another, the wanted state, either at boot time or at the administrator's request. The process by which the machine moves from the current state to the wanted state is called a transition.

The state of a machine is defined by the services that are running on it. A service can have two states: up or down. Some service managers like to define other states, such as "started" or "failed", but these are not real states as seen by an external user: a web browser does not care whether the web server has been "started" or has "failed", all it sees is whether it is up or down.

(The previous sentence is not totally accurate. What a web browser sees is whether the web server is up and ready: readiness is defined by the ability for a service to... provide service. A service can be up but not ready yet when it is in the process of initializing itself. We will explore readiness in more detail later; for now, you can consider that up means up and ready, unless explicitly stated otherwise.)

The machine's current state is a set of service states. For instance, at boot time, the machine's current state is "all the services are down", and the machine's wanted state is "a certain set of services are up". (We name this certain set of services the top bundle; more on that later.)

Transitions

Since a machine state is a set of service state, as a direct consequence, a machine's transition is a set of service transitions from their current state to their wanted state. If the machine is bringing a set of services up, it is called an up transition — and every service in the set undergoes an up transition; if the machine is bringing a set of services down, then it is called a down transition, and services in the set undergo a down transition as well. Note that every possible machine transition can be seen as a down transition followed by an up transition, and being able to reason separately on sets of down transitions and on sets of up transitions is a very useful property, that we will make heavy use of.

A service transition can succeed, in which case the machine's current state changes, getting closer to the wanted state, or it can fail. When it fails, what the service manager does depends on certain factors:

  • If the failure can be identified as permanent, then attempting the transition again is pointless. In which case the transition permanently fails, and that means the machine state transition fails - the machine will never reach its wanted state. That does not mean other service transitions stop; they continue, and the machine state ends up as close as possible to the wanted state, but it will not reach it, and the user is informed of the failure.
  • If the failure can be identified as temporary, then the transition can be retried. The delay between two attempts, as well as the maximum number of attempts, depends on what the administrator has configured for the service: it is the retry policy. If the transition has still not succeeded after the defined maximum number of attempts, then the failure becomes permanent and the user is informed.

The way to identify permanent and temporary failures depends on the service, and are configured as part of the retry policy.

As a special engineering note, that is unsatisfying from a theoretical point of view (because it makes our concepts asymmetrical) but vital where real-life services are concerned, let us mention right away that down transitions should never fail. Except in very specific, very rare cases, it should always be possible to successfully stop a service: as far as services are concerned, death is always an out. Allowing down transitions to fail leads to ridiculous issues like systemd being unable to shutdown a system. This should never happen: when a user wants their system off, they want it off, and fighting against that will only cause frustration and plug-pulling.

Parallelism

A traditional serial service manager performs all transitions one after another, in a sequence; this is not efficient, because if a transition spends some time waiting, or even doing CPU-intensive computations on one core while other cores are available, then time is wasted if other transitions could be taking place during that time. A good service manager is able to perform transitions in parallel, to make the best use of the machine's available resources.

In order to perform transitions in parallel, the service manager must know what transitions are independent (so they can be performed at the same time without influencing one another) and which ones can only be done in a sequence. That means that the administrator must provide the service manager with a list of dependencies between services.

Dependencies

At a very basic level, a dependency from service B to service A means that B can only be up when A is up; and so, B should only be brought up once A is already up. For instance, a web server should only be brought up when the database hosting its content is itself up.

A service C that has nothing to do with A or B can be brought up whenever — in particular, it can be brought up in parallel with A or B, without being bound by their state in any way.

If a service D depends on B, and A depends on D, then the dependencies are invalid: there is a dependency cycle, DBAD. This configuration must be rejected by the service manager.

On the other hand, if D and E both depend on B, and F depends on both D and E, it is not a cycle, and it is acceptable: the service manager will first bring A up, then B, then D and E in parallel, then F once both D and E are up.

This shows that the acceptable structure for a list of dependencies is a directed acyclic graph, or DAG. When we talk about the list of dependencies, we should say the dependency DAG, but it is a bit hermetic, so we'll just talk about the dependency graph.

One of the most important aspects of a service manager is validation of the dependency graph. If the depdendency graph is invalid, then the service manager cannot do its jobs of bringing services up or down in the proper order. If this validation happens at boot time, when the service manager starts, and the graph happens to be invalid, then what should the service manager do?

Boot time is the worst possible time to detect errors, especially in low-level software such as a service manager, because the machine is not fully operational yet and the administrator may not have many tools to fix the problem. In particular, if the network services are started by the service manager, dependency graph validation happens before the network is operational, and if it fails, the machine has no network. Nobody wants that.

Consequently, dependency graph validation must be done before boot time. A service set must be checked and validated while the machine is already running and functional, before it is rebooted. It must be possible to guarantee bootability on a service set once it has been checked.

This is why a service manager must have both offline tools and online tools, and keep two separate sets of services: the live set and the working set.

Live set, working set

(The prototype version of s6-rc uses the concept of service databases; there is one live service database and all the others are, implicitly, working service databases. We change the terminology here, at the same time we refine the concept).