summaryrefslogtreecommitdiff
path: root/doc/design/concepts.html
blob: b5f1268f73850c515eb40cf1e4e5068ffdf40043 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
<!doctype html>
<html lang="en">
<head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <meta name="description" content="s6-rc: service management concepts">
    <meta name="keywords" content="s6-rc service management concepts dependencies unix administration root laurent bercot ska skarnet supervision init system boot service systemd alternative" />
    <title>s6-rc: service management concepts</title>
    <link rel="stylesheet" href="/css/pure/pure-min.css">
    <!-- <link rel="stylesheet" href="https://unpkg.com/purecss@2.0.5/build/pure-min.css" integrity="sha384-G9DpmGxRIF6tpgbrkZVcZDeIomEU22LgTguPAI739bbKytjPE/kHTK5YxjJAAEXC" crossorigin="anonymous"> -->
    <link rel="stylesheet" href="/layouts/side-menu/styles.css">
</head>
<body>

<div id="layout">
  <!-- Menu toggle -->
  <a href="#menu" id="menuLink" class="menu-link">
    <!-- Hamburger icon -->
    <span></span>
  </a>

  <div id="menu">
    <div class="pure-menu">
      <a class="pure-menu-heading" style="text-transform:none;" href="/">skarnet.com</a>
      <ul class="pure-menu-list">
        <li class="pure-menu-item"> <a href="/" class="pure-menu-link">Home</a> </li>
        <li class="pure-menu-item"> <a href="/projects/" class="pure-menu-link">Projects</a> </li>
        <li class="pure-menu-item"> <a href="/contact/" class="pure-menu-link">Contact</a> </li>
        <li class="pure-menu-item"> <a href="//skarnet.org/" class="pure-menu-link">skarnet.org</a> </li>
      </ul>
    </div>
  </div>

  <div id="main">
    <div class="header">
      <h1> s6-rc: service management concepts </h1>
        <h2> The foundations for a solid design </h2>
    </div>

    <div class="content">

      <p>
      </p>

      <h2 class="content-subhead" id="toc"> Table of contents </h2>

      <ul>
        <li> <a href="#toc">Table of contents</a> </li>
        <li> <a href="#states">Service states, machine states</a> </li>
        <li> <a href="#transitions">Transitions</a> </li>
        <li> <a href="#dependencies">Dependencies</a> </li>
        <li> <a href="#servicesets">Live set, working set</a> </li>
      </ul>

      <h2 class="content-subhead" id="states"> Service states, machine states </h2>

      <p>
        The job of a service manager is to bring the machine from one state, the
        <em>current state</em>, to another, the <em>wanted state</em>,
        either at boot time or at the administrator's request. The process by which
        the machine moves from the <em>current state</em> to the <em>wanted state</em>
        is called a <em>transition</em>.
      </p>

      <p>
        The state of a machine is defined by the services that are running on it.
        A service can have two states: <em>up</em> or <em>down</em>. Some service
        managers like to define other states, such as "started" or "failed", but
        these are not real states as seen by an external user: a web browser does
        not care whether the web server has been "started" or has "failed", all it
        sees is whether it is <em>up</em> or <em>down</em>.
      </p>

      <p>
        (The previous sentence is not totally accurate. What a web browser sees is
        whether the web server is <em>up and ready</em>: readiness is defined by the
        ability for a service to... provide service. A service can be <em>up</em> but
        not <em>ready</em> yet when it is in the process of initializing itself. We
        will explore readiness in more detail later; for now, you can consider that
        <em>up</em> means <em>up and ready</em>, unless explicitly stated otherwise.)
      </p>

      <p>
        The machine's <em>current state</em> is a set of service states. For instance,
        at boot time, the machine's <em>current state</em> is "all the services are
        <em>down</em>", and the machine's <em>wanted state</em> is "a certain set of
        services are <em>up</em>". (We name this certain set of services the
        <em>top bundle</em>; more on that later.)
      </p>

      <h2 class="content-subhead" id="transitions"> Transitions </h2>

      <p>
        Since a machine state is a set of service state, as a direct consequence,
        a machine's <em>transition</em> is a set of service <em>transitions</em> from
        their <em>current state</em> to their <em>wanted state</em>. If the machine is
        bringing a set of services up, it is called an <em>up transition</em> &mdash; and
        every service in the set undergoes an <em>up transition</em>;
        if the machine is bringing a set of services down, then it is called a <em>down
        transition</em>, and services in the set undergo a <em>down transition</em> as well.
        Note that every possible machine transition can be seen as a <em>down transition</em>
        followed by an <em>up transition</em>, and being able to reason separately on sets of
        <em>down transitions</em> and on sets of <em>up transitions</em> is a very useful
        property, that we will make heavy use of.
      </p>

      <p>
        A service transition can succeed, in which case the machine's <em>current state</em>
        changes, getting closer to the <em>wanted state</em>, or it can fail.
        When it fails, what the service manager does depends on certain factors:
      </p>

      <ul>
       <li>
        If the failure can be identified as <em>permanent</em>, then attempting the transition
        again is pointless. In which case the transition permanently fails, and that means
        the machine state transition fails - the machine will never reach its <em>wanted
        state</em>. That does not mean other service transitions stop; they continue, and the
        machine state ends up as close as possible to the <em>wanted state</em>, but it will
        not reach it, and the user is informed of the failure.
       </li>

       <li>
        If the failure can be identified as <em>temporary</em>, then the transition can be
        retried. The delay between two attempts, as well as the maximum number of attempts,
        depends on what the administrator has configured for the service: it is the
        <em>retry policy</em>. If the transition has still not succeeded after the defined
        maximum number of attempts, then the failure becomes permanent and the user is
        informed.
       </li>
      </ul>

      <p>
        The way to identify permanent and temporary failures depends on the service, and are
        configured as part of the <em>retry policy</em>. 
      </p>

      <p>
        As a special engineering note, that is unsatisfying from a theoretical point of view
        (because it makes our concepts asymmetrical) but <em>vital</em> where real-life services
        are concerned, let us mention right away that <strong>down transitions should never
        fail</strong>. Except in very specific, very rare cases, it should always be possible
        to successfully stop a service: as far as services are concerned, <em>death is always
        an out</em>. Allowing down transitions to fail leads to ridiculous issues like
        <a href="https://github.com/systemd/systemd/issues/12967">systemd being unable to
        shutdown a system</a>. This should never happen: when a user wants their system off,
        <em>they want it off</em>, and fighting against that will only cause frustration and
        plug-pulling.
      </p>

      <h3 class="content-subsubhead"> Parallelism </h3>

      <p>
        A traditional serial service manager performs all <em>transitions</em> one after
        another, in a sequence; this is not efficient, because if a transition spends some
        time waiting, or even doing CPU-intensive computations on one core while other cores
        are available, then time is wasted if other transitions could be taking place during
        that time. A good service manager is able to perform transitions <em>in parallel</em>,
        to make the best use of the machine's available resources.
      </p>

      <p>
        In order to perform transitions in parallel, the service manager must know what
        transitions are independent (so they can be performed at the same time without
        influencing one another) and which ones can only be done in a sequence. That means
        that the administrator must provide the service manager with a list of
        <em>dependencies</em> between services.
      </p>

      <h2 class="content-subhead" id="dependencies"> Dependencies </h2>

      <p>
        At a very basic level, a <em>dependency</em> from service <tt>B</tt> to service <tt>A</tt>
        means that <tt>B</tt> can only be <em>up</em> when <tt>A</tt> is <em>up</em>; and so,
        <tt>B</tt> should only be brought up once <tt>A</tt> is already up. For instance, a web
        server should only be brought up when the database hosting its content is itself up.
      </p>

      <p>
        A service <tt>C</tt> that has nothing to do with <tt>A</tt> or <tt>B</tt> can be brought
        up whenever &mdash; in particular, it can be brought up in parallel with <tt>A</tt> or
        <tt>B</tt>, without being bound by their state in any way.
      </p>

      <p>
        If a service <tt>D</tt> depends on <tt>B</tt>, and <tt>A</tt> depends on <tt>D</tt>, then
        the dependencies are invalid: there is a <em>dependency cycle</em>,
        <tt>D</tt> &rarr; <tt>B</tt> &rarr; <tt>A</tt> &rarr; <tt>D</tt>. This configuration must
        be rejected by the service manager.
      </p>

      <p>
        On the other hand, if <tt>D</tt> and <tt>E</tt> both depend on <tt>B</tt>, and <tt>F</tt>
        depends on both <tt>D</tt> and <tt>E</tt>, it is not a cycle, and it is acceptable: the
        service manager will first bring <tt>A</tt> up, then <tt>B</tt>, then <tt>D</tt> and <tt>E</tt>
        in parallel, then <tt>F</tt> once both <tt>D</tt> and <tt>E</tt> are up.
      </p>

      <p>
        This shows that the acceptable structure for a list of dependencies is a <em>directed
        acyclic graph</em>, or DAG. When we talk about the list of dependencies, we should say
        <em>the dependency DAG</em>, but it is a bit hermetic, so we'll just talk about the
        <em>dependency graph</em>.
      </p>

      <p>
        One of the most important aspects of a service manager is validation of the <em>dependency
        graph</em>. If the depdendency graph is invalid, then the service manager cannot do its
        jobs of bringing services up or down in the proper order. If this validation happens at
        boot time, when the service manager starts, and the graph happens to be invalid, then what
        should the service manager do?
      </p>

      <p>
        Boot time is the <em>worst</em> possible time to detect errors, especially in low-level
        software such as a service manager, because the machine is not fully operational yet and
        the administrator may not have many tools to fix the problem. In particular, if the
        network services are started by the service manager, dependency graph validation happens
        before the network is operational, and if it fails, the machine has no network. Nobody
        wants that.
      </p>

      <p>
        Consequently, dependency graph validation must be done <em>before</em> boot time.
        A service set must be checked and validated while the machine is already running and
        functional, before it is rebooted. It must be possible to <em>guarantee bootability</em>
        on a service set once it has been checked.
      </p>

      <p>
        This is why a service manager must have both <em>offline tools</em> and <em>online
        tools</em>, and keep two separate sets of services: the <em>live set</em> and the
        <em>working set</em>.
      </p>

      <h2 class="content-subhead" id="servicesets"> Live set, working set </h2>

      <p>
        (The prototype version of s6-rc uses the concept of <em>service databases</em>;
        there is one <em>live service database</em> and all the others are, implicitly,
        <em>working service databases</em>. We change the terminology here, at the same time
        we refine the concept).
      </p>

    </div>
  </div>
</div>

<script src="/js/ui.js"></script>
</body>
</html>