From b092e9d469727f6d271961c7dca7cf8925987af1 Mon Sep 17 00:00:00 2001 From: Laurent Bercot Date: Thu, 29 Jul 2021 12:17:05 +0000 Subject: Add design documentation files Signed-off-by: Laurent Bercot --- doc/design/concepts.html | 251 +++++++++++++++++++++++++++++++++++++++++++++++ doc/design/index.html | 66 +++++++++++++ doc/design/services.html | 203 ++++++++++++++++++++++++++++++++++++++ 3 files changed, 520 insertions(+) create mode 100644 doc/design/concepts.html create mode 100644 doc/design/index.html create mode 100644 doc/design/services.html diff --git a/doc/design/concepts.html b/doc/design/concepts.html new file mode 100644 index 0000000..b5f1268 --- /dev/null +++ b/doc/design/concepts.html @@ -0,0 +1,251 @@ + + + + + + + + s6-rc: service management concepts + + + + + + +
+ + + + + + + + +
+
+

s6-rc: service management concepts

+

The foundations for a solid design

+
+ +
+ +

+

+ +

Table of contents

+ + + +

Service states, machine states

+ +

+ The job of a service manager is to bring the machine from one state, the + current state, to another, the wanted state, + either at boot time or at the administrator's request. The process by which + the machine moves from the current state to the wanted state + is called a transition. +

+ +

+ The state of a machine is defined by the services that are running on it. + A service can have two states: up or down. Some service + managers like to define other states, such as "started" or "failed", but + these are not real states as seen by an external user: a web browser does + not care whether the web server has been "started" or has "failed", all it + sees is whether it is up or down. +

+ +

+ (The previous sentence is not totally accurate. What a web browser sees is + whether the web server is up and ready: readiness is defined by the + ability for a service to... provide service. A service can be up but + not ready yet when it is in the process of initializing itself. We + will explore readiness in more detail later; for now, you can consider that + up means up and ready, unless explicitly stated otherwise.) +

+ +

+ The machine's current state is a set of service states. For instance, + at boot time, the machine's current state is "all the services are + down", and the machine's wanted state is "a certain set of + services are up". (We name this certain set of services the + top bundle; more on that later.) +

+ +

Transitions

+ +

+ Since a machine state is a set of service state, as a direct consequence, + a machine's transition is a set of service transitions from + their current state to their wanted state. If the machine is + bringing a set of services up, it is called an up transition — and + every service in the set undergoes an up transition; + if the machine is bringing a set of services down, then it is called a down + transition, and services in the set undergo a down transition as well. + Note that every possible machine transition can be seen as a down transition + followed by an up transition, and being able to reason separately on sets of + down transitions and on sets of up transitions is a very useful + property, that we will make heavy use of. +

+ +

+ A service transition can succeed, in which case the machine's current state + changes, getting closer to the wanted state, or it can fail. + When it fails, what the service manager does depends on certain factors: +

+ +
    +
  • + If the failure can be identified as permanent, then attempting the transition + again is pointless. In which case the transition permanently fails, and that means + the machine state transition fails - the machine will never reach its wanted + state. That does not mean other service transitions stop; they continue, and the + machine state ends up as close as possible to the wanted state, but it will + not reach it, and the user is informed of the failure. +
  • + +
  • + If the failure can be identified as temporary, then the transition can be + retried. The delay between two attempts, as well as the maximum number of attempts, + depends on what the administrator has configured for the service: it is the + retry policy. If the transition has still not succeeded after the defined + maximum number of attempts, then the failure becomes permanent and the user is + informed. +
  • +
+ +

+ The way to identify permanent and temporary failures depends on the service, and are + configured as part of the retry policy. +

+ +

+ As a special engineering note, that is unsatisfying from a theoretical point of view + (because it makes our concepts asymmetrical) but vital where real-life services + are concerned, let us mention right away that down transitions should never + fail. Except in very specific, very rare cases, it should always be possible + to successfully stop a service: as far as services are concerned, death is always + an out. Allowing down transitions to fail leads to ridiculous issues like + systemd being unable to + shutdown a system. This should never happen: when a user wants their system off, + they want it off, and fighting against that will only cause frustration and + plug-pulling. +

+ +

Parallelism

+ +

+ A traditional serial service manager performs all transitions one after + another, in a sequence; this is not efficient, because if a transition spends some + time waiting, or even doing CPU-intensive computations on one core while other cores + are available, then time is wasted if other transitions could be taking place during + that time. A good service manager is able to perform transitions in parallel, + to make the best use of the machine's available resources. +

+ +

+ In order to perform transitions in parallel, the service manager must know what + transitions are independent (so they can be performed at the same time without + influencing one another) and which ones can only be done in a sequence. That means + that the administrator must provide the service manager with a list of + dependencies between services. +

+ +

Dependencies

+ +

+ At a very basic level, a dependency from service B to service A + means that B can only be up when A is up; and so, + B should only be brought up once A is already up. For instance, a web + server should only be brought up when the database hosting its content is itself up. +

+ +

+ A service C that has nothing to do with A or B can be brought + up whenever — in particular, it can be brought up in parallel with A or + B, without being bound by their state in any way. +

+ +

+ If a service D depends on B, and A depends on D, then + the dependencies are invalid: there is a dependency cycle, + DBAD. This configuration must + be rejected by the service manager. +

+ +

+ On the other hand, if D and E both depend on B, and F + depends on both D and E, it is not a cycle, and it is acceptable: the + service manager will first bring A up, then B, then D and E + in parallel, then F once both D and E are up. +

+ +

+ This shows that the acceptable structure for a list of dependencies is a directed + acyclic graph, or DAG. When we talk about the list of dependencies, we should say + the dependency DAG, but it is a bit hermetic, so we'll just talk about the + dependency graph. +

+ +

+ One of the most important aspects of a service manager is validation of the dependency + graph. If the depdendency graph is invalid, then the service manager cannot do its + jobs of bringing services up or down in the proper order. If this validation happens at + boot time, when the service manager starts, and the graph happens to be invalid, then what + should the service manager do? +

+ +

+ Boot time is the worst possible time to detect errors, especially in low-level + software such as a service manager, because the machine is not fully operational yet and + the administrator may not have many tools to fix the problem. In particular, if the + network services are started by the service manager, dependency graph validation happens + before the network is operational, and if it fails, the machine has no network. Nobody + wants that. +

+ +

+ Consequently, dependency graph validation must be done before boot time. + A service set must be checked and validated while the machine is already running and + functional, before it is rebooted. It must be possible to guarantee bootability + on a service set once it has been checked. +

+ +

+ This is why a service manager must have both offline tools and online + tools, and keep two separate sets of services: the live set and the + working set. +

+ +

Live set, working set

+ +

+ (The prototype version of s6-rc uses the concept of service databases; + there is one live service database and all the others are, implicitly, + working service databases. We change the terminology here, at the same time + we refine the concept). +

+ +
+
+
+ + + + diff --git a/doc/design/index.html b/doc/design/index.html new file mode 100644 index 0000000..42e8a7d --- /dev/null +++ b/doc/design/index.html @@ -0,0 +1,66 @@ + + + + + + + + the s6 ecosystem: the s6-rc service manager + + + + + + +
+ + + + + + + + +
+
+

s6-rc

+

A powerful and reliable service management engine

+
+ +
+ +

+

+ +

Table of contents

+ + + +

Service management concepts

+ +

+ This page explains a few essential concepts + taking part in the design of the s6-rc service manager. +

+ +
+
+
+ + + + diff --git a/doc/design/services.html b/doc/design/services.html new file mode 100644 index 0000000..b6275af --- /dev/null +++ b/doc/design/services.html @@ -0,0 +1,203 @@ + + + + + + + + s6-rc: services + + + + + + +
+ + + + + + + + +
+
+

s6-rc: services

+

The basic building block

+
+ +
+ +

+

+ +

Table of contents

+ + + +

Service types

+ +

+ In all genericity, a service is a basic unit that can undergo a + transition; but not all services can be handled the same way. Services + are divided into several categories, which we call types; these + are the following. +

+ +
    +
  1. Longrun. + +

    + A longrun is the "traditional" definition of a service, + implemented by a long-lived process, a.k.a. a daemon. As a first + approximation, it means that when the daemon is alive, the service is up, + and when the daemon is not present, the service is down. Longruns are the + most common type of service, and the main reason why it's a good thing for + a service manager to work in tandem with a process supervisor: the details + of keeping the daemon alive, surveying its readiness, etc. are delegated + to the process supervisor, which abstracts some complexity away from the + service manager. +

    +
  2. + +
  3. Oneshot. + +

    + A oneshot is a service that represents a state change in the + machine, but that does not need a daemon because the state is maintained by + the kernel. For instance, "mounting a filesystem" and "setting a sysctl" are + oneshots: the service is considered up when the filesystem is mounted + or the sysctl has been performed, and down when the filesystem is + unmounted or the sysctl has its default value. Note that it's generally + meaningless to revert a sysctl (and in most cases it's also a bad idea to try + and unmount filesystems before the very end of a shutdown procedure), so it is + quite common for the down transition of a oneshot to be a nop: after + the first time the service has been brought up, the state basically never + changes. +

    + +

    + Longruns and oneshots are collectively called atomic + services. They are the core service types, the ones that actually do the + work. Other service types are just convenience tools around them. +

    +
  4. + +
  5. External. + +

    + An external is a service that is not handled by the + service manager itself, but by a system that is external to it. It is a way for + the service manager to delegate complex subsystems to other programs such as a + network manager. The service manager does not know how to perform transitions + for an external, it does not know anything but its name. +

    + +

    + It is impossible to set the wanted state of an external: such + a service has to be triggered entirely outside of the service manager. All the + service manager does is receive events that inform it of the external's current + state. +

    + +

    + Consequently, an external does have any dependencies. It is, however, + possible for a service to depend on an external — that is their intended use, + gating the transition of another service to the reception of an external event. +

    +
  6. + +
  7. Bundle. + +

    + A bundle is a pseudo-service representing a set of services: it is used + to implement service conjunction (AND). when a + bundle is wanted up, it means that all the services it + contains are wanted up. A bundle's current state is up + if all the services it contains are up, and it is down otherwise. +

    +

    + However, when a bundle is wanted down, it also means that all + (and not just one!) of the services it contains are wanted down, so take + care when explicitly bringing down bundles. +

    +
  8. + +
  9. Virtual. + +

    + A virtual is a pseudo-service representing a set of services, but used for + disjunction (OR) instead: instead of meaning "all the services in the set", it means + "one of the services in the set". A virtual's current state is up + if at least one of the services it represents is up, and down + otherwise. +

    +
  10. +
      + +

      Dynamic instantiation

      + +

      + In all genericity, a service is a basic unit that can undergo a + transition; but not all services can be handled the same way. Services + are divided into several categories, which we call types; these + are the following. +

      + + + +
    1. Dynamically instantiated longrun. + +

      + A dynamically instantiated longrun, or DIL, is a template for + an indeterminate amount of longruns that all follow the same model, + and that differ by one parameter, the instance name. They are used + to implement sets of similar services that the user will want to start on + demand: for instance, a set of gettys. A DIL is identified by a + @ at the end of the service name; anything that follows the @ + is the instance parameter. For instance, getty@ can be the name + of the DIL spawning the gettys, and getty@tty2 can be a + dynamic instance of getty@ with tty2 as the instance + parameter. +

      + +

      + (It is possible to define a regular, static (as opposed to dynamically + instantiated), getty@tty1 service even if + the getty@ DIL exists: in that case, getty@tty1 will always + refer to the static service and it will be impossible to spawn a getty@ + instance with tty1 as an instance parameter. This can be a good way to + ensure that specific "instances" are special-cased.) +

      + +

      + However, DILs have a strong limitation: only dynamically instantiated services + can depend on them, and only with the same instance parameter. In other + words: B cannot depend on A@, only B@ can depend on + A@, and that means that for any x, B@x depends on + A@x. +

      +
    2. + +
+
+
+ + + + -- cgit v1.2.3