summaryrefslogtreecommitdiff
path: root/doc/design/concepts.html
diff options
context:
space:
mode:
Diffstat (limited to 'doc/design/concepts.html')
-rw-r--r--doc/design/concepts.html251
1 files changed, 251 insertions, 0 deletions
diff --git a/doc/design/concepts.html b/doc/design/concepts.html
new file mode 100644
index 0000000..b5f1268
--- /dev/null
+++ b/doc/design/concepts.html
@@ -0,0 +1,251 @@
+<!doctype html>
+<html lang="en">
+<head>
+ <meta charset="utf-8">
+ <meta name="viewport" content="width=device-width, initial-scale=1.0">
+ <meta name="description" content="s6-rc: service management concepts">
+ <meta name="keywords" content="s6-rc service management concepts dependencies unix administration root laurent bercot ska skarnet supervision init system boot service systemd alternative" />
+ <title>s6-rc: service management concepts</title>
+ <link rel="stylesheet" href="/css/pure/pure-min.css">
+ <!-- <link rel="stylesheet" href="https://unpkg.com/purecss@2.0.5/build/pure-min.css" integrity="sha384-G9DpmGxRIF6tpgbrkZVcZDeIomEU22LgTguPAI739bbKytjPE/kHTK5YxjJAAEXC" crossorigin="anonymous"> -->
+ <link rel="stylesheet" href="/layouts/side-menu/styles.css">
+</head>
+<body>
+
+<div id="layout">
+ <!-- Menu toggle -->
+ <a href="#menu" id="menuLink" class="menu-link">
+ <!-- Hamburger icon -->
+ <span></span>
+ </a>
+
+ <div id="menu">
+ <div class="pure-menu">
+ <a class="pure-menu-heading" style="text-transform:none;" href="/">skarnet.com</a>
+ <ul class="pure-menu-list">
+ <li class="pure-menu-item"> <a href="/" class="pure-menu-link">Home</a> </li>
+ <li class="pure-menu-item"> <a href="/projects/" class="pure-menu-link">Projects</a> </li>
+ <li class="pure-menu-item"> <a href="/contact/" class="pure-menu-link">Contact</a> </li>
+ <li class="pure-menu-item"> <a href="//skarnet.org/" class="pure-menu-link">skarnet.org</a> </li>
+ </ul>
+ </div>
+ </div>
+
+ <div id="main">
+ <div class="header">
+ <h1> s6-rc: service management concepts </h1>
+ <h2> The foundations for a solid design </h2>
+ </div>
+
+ <div class="content">
+
+ <p>
+ </p>
+
+ <h2 class="content-subhead" id="toc"> Table of contents </h2>
+
+ <ul>
+ <li> <a href="#toc">Table of contents</a> </li>
+ <li> <a href="#states">Service states, machine states</a> </li>
+ <li> <a href="#transitions">Transitions</a> </li>
+ <li> <a href="#dependencies">Dependencies</a> </li>
+ <li> <a href="#servicesets">Live set, working set</a> </li>
+ </ul>
+
+ <h2 class="content-subhead" id="states"> Service states, machine states </h2>
+
+ <p>
+ The job of a service manager is to bring the machine from one state, the
+ <em>current state</em>, to another, the <em>wanted state</em>,
+ either at boot time or at the administrator's request. The process by which
+ the machine moves from the <em>current state</em> to the <em>wanted state</em>
+ is called a <em>transition</em>.
+ </p>
+
+ <p>
+ The state of a machine is defined by the services that are running on it.
+ A service can have two states: <em>up</em> or <em>down</em>. Some service
+ managers like to define other states, such as "started" or "failed", but
+ these are not real states as seen by an external user: a web browser does
+ not care whether the web server has been "started" or has "failed", all it
+ sees is whether it is <em>up</em> or <em>down</em>.
+ </p>
+
+ <p>
+ (The previous sentence is not totally accurate. What a web browser sees is
+ whether the web server is <em>up and ready</em>: readiness is defined by the
+ ability for a service to... provide service. A service can be <em>up</em> but
+ not <em>ready</em> yet when it is in the process of initializing itself. We
+ will explore readiness in more detail later; for now, you can consider that
+ <em>up</em> means <em>up and ready</em>, unless explicitly stated otherwise.)
+ </p>
+
+ <p>
+ The machine's <em>current state</em> is a set of service states. For instance,
+ at boot time, the machine's <em>current state</em> is "all the services are
+ <em>down</em>", and the machine's <em>wanted state</em> is "a certain set of
+ services are <em>up</em>". (We name this certain set of services the
+ <em>top bundle</em>; more on that later.)
+ </p>
+
+ <h2 class="content-subhead" id="transitions"> Transitions </h2>
+
+ <p>
+ Since a machine state is a set of service state, as a direct consequence,
+ a machine's <em>transition</em> is a set of service <em>transitions</em> from
+ their <em>current state</em> to their <em>wanted state</em>. If the machine is
+ bringing a set of services up, it is called an <em>up transition</em> &mdash; and
+ every service in the set undergoes an <em>up transition</em>;
+ if the machine is bringing a set of services down, then it is called a <em>down
+ transition</em>, and services in the set undergo a <em>down transition</em> as well.
+ Note that every possible machine transition can be seen as a <em>down transition</em>
+ followed by an <em>up transition</em>, and being able to reason separately on sets of
+ <em>down transitions</em> and on sets of <em>up transitions</em> is a very useful
+ property, that we will make heavy use of.
+ </p>
+
+ <p>
+ A service transition can succeed, in which case the machine's <em>current state</em>
+ changes, getting closer to the <em>wanted state</em>, or it can fail.
+ When it fails, what the service manager does depends on certain factors:
+ </p>
+
+ <ul>
+ <li>
+ If the failure can be identified as <em>permanent</em>, then attempting the transition
+ again is pointless. In which case the transition permanently fails, and that means
+ the machine state transition fails - the machine will never reach its <em>wanted
+ state</em>. That does not mean other service transitions stop; they continue, and the
+ machine state ends up as close as possible to the <em>wanted state</em>, but it will
+ not reach it, and the user is informed of the failure.
+ </li>
+
+ <li>
+ If the failure can be identified as <em>temporary</em>, then the transition can be
+ retried. The delay between two attempts, as well as the maximum number of attempts,
+ depends on what the administrator has configured for the service: it is the
+ <em>retry policy</em>. If the transition has still not succeeded after the defined
+ maximum number of attempts, then the failure becomes permanent and the user is
+ informed.
+ </li>
+ </ul>
+
+ <p>
+ The way to identify permanent and temporary failures depends on the service, and are
+ configured as part of the <em>retry policy</em>.
+ </p>
+
+ <p>
+ As a special engineering note, that is unsatisfying from a theoretical point of view
+ (because it makes our concepts asymmetrical) but <em>vital</em> where real-life services
+ are concerned, let us mention right away that <strong>down transitions should never
+ fail</strong>. Except in very specific, very rare cases, it should always be possible
+ to successfully stop a service: as far as services are concerned, <em>death is always
+ an out</em>. Allowing down transitions to fail leads to ridiculous issues like
+ <a href="https://github.com/systemd/systemd/issues/12967">systemd being unable to
+ shutdown a system</a>. This should never happen: when a user wants their system off,
+ <em>they want it off</em>, and fighting against that will only cause frustration and
+ plug-pulling.
+ </p>
+
+ <h3 class="content-subsubhead"> Parallelism </h3>
+
+ <p>
+ A traditional serial service manager performs all <em>transitions</em> one after
+ another, in a sequence; this is not efficient, because if a transition spends some
+ time waiting, or even doing CPU-intensive computations on one core while other cores
+ are available, then time is wasted if other transitions could be taking place during
+ that time. A good service manager is able to perform transitions <em>in parallel</em>,
+ to make the best use of the machine's available resources.
+ </p>
+
+ <p>
+ In order to perform transitions in parallel, the service manager must know what
+ transitions are independent (so they can be performed at the same time without
+ influencing one another) and which ones can only be done in a sequence. That means
+ that the administrator must provide the service manager with a list of
+ <em>dependencies</em> between services.
+ </p>
+
+ <h2 class="content-subhead" id="dependencies"> Dependencies </h2>
+
+ <p>
+ At a very basic level, a <em>dependency</em> from service <tt>B</tt> to service <tt>A</tt>
+ means that <tt>B</tt> can only be <em>up</em> when <tt>A</tt> is <em>up</em>; and so,
+ <tt>B</tt> should only be brought up once <tt>A</tt> is already up. For instance, a web
+ server should only be brought up when the database hosting its content is itself up.
+ </p>
+
+ <p>
+ A service <tt>C</tt> that has nothing to do with <tt>A</tt> or <tt>B</tt> can be brought
+ up whenever &mdash; in particular, it can be brought up in parallel with <tt>A</tt> or
+ <tt>B</tt>, without being bound by their state in any way.
+ </p>
+
+ <p>
+ If a service <tt>D</tt> depends on <tt>B</tt>, and <tt>A</tt> depends on <tt>D</tt>, then
+ the dependencies are invalid: there is a <em>dependency cycle</em>,
+ <tt>D</tt> &rarr; <tt>B</tt> &rarr; <tt>A</tt> &rarr; <tt>D</tt>. This configuration must
+ be rejected by the service manager.
+ </p>
+
+ <p>
+ On the other hand, if <tt>D</tt> and <tt>E</tt> both depend on <tt>B</tt>, and <tt>F</tt>
+ depends on both <tt>D</tt> and <tt>E</tt>, it is not a cycle, and it is acceptable: the
+ service manager will first bring <tt>A</tt> up, then <tt>B</tt>, then <tt>D</tt> and <tt>E</tt>
+ in parallel, then <tt>F</tt> once both <tt>D</tt> and <tt>E</tt> are up.
+ </p>
+
+ <p>
+ This shows that the acceptable structure for a list of dependencies is a <em>directed
+ acyclic graph</em>, or DAG. When we talk about the list of dependencies, we should say
+ <em>the dependency DAG</em>, but it is a bit hermetic, so we'll just talk about the
+ <em>dependency graph</em>.
+ </p>
+
+ <p>
+ One of the most important aspects of a service manager is validation of the <em>dependency
+ graph</em>. If the depdendency graph is invalid, then the service manager cannot do its
+ jobs of bringing services up or down in the proper order. If this validation happens at
+ boot time, when the service manager starts, and the graph happens to be invalid, then what
+ should the service manager do?
+ </p>
+
+ <p>
+ Boot time is the <em>worst</em> possible time to detect errors, especially in low-level
+ software such as a service manager, because the machine is not fully operational yet and
+ the administrator may not have many tools to fix the problem. In particular, if the
+ network services are started by the service manager, dependency graph validation happens
+ before the network is operational, and if it fails, the machine has no network. Nobody
+ wants that.
+ </p>
+
+ <p>
+ Consequently, dependency graph validation must be done <em>before</em> boot time.
+ A service set must be checked and validated while the machine is already running and
+ functional, before it is rebooted. It must be possible to <em>guarantee bootability</em>
+ on a service set once it has been checked.
+ </p>
+
+ <p>
+ This is why a service manager must have both <em>offline tools</em> and <em>online
+ tools</em>, and keep two separate sets of services: the <em>live set</em> and the
+ <em>working set</em>.
+ </p>
+
+ <h2 class="content-subhead" id="servicesets"> Live set, working set </h2>
+
+ <p>
+ (The prototype version of s6-rc uses the concept of <em>service databases</em>;
+ there is one <em>live service database</em> and all the others are, implicitly,
+ <em>working service databases</em>. We change the terminology here, at the same time
+ we refine the concept).
+ </p>
+
+ </div>
+ </div>
+</div>
+
+<script src="/js/ui.js"></script>
+</body>
+</html>