summaryrefslogtreecommitdiffstats
path: root/book-enea-nfv-core-installation-guide/doc/high_availability.xml
diff options
context:
space:
mode:
authorMiruna Paun <Miruna.Paun@enea.com>2017-09-28 18:53:02 +0200
committerMiruna Paun <Miruna.Paun@enea.com>2017-09-28 18:53:02 +0200
commitcc001420304566cd252f2c6323dec3a826a12954 (patch)
tree299a124c12a96df1000e0b87753df9caef34b06a /book-enea-nfv-core-installation-guide/doc/high_availability.xml
parent380e975b1b93e83705c8ed30197b1c23f8193814 (diff)
downloaddoc-enea-nfv-cc001420304566cd252f2c6323dec3a826a12954.tar.gz
Proofed entire installation guide, added all new need images.
USERDOCAP-240 Signed-off-by: Miruna Paun <Miruna.Paun@enea.com>
Diffstat (limited to 'book-enea-nfv-core-installation-guide/doc/high_availability.xml')
-rw-r--r--book-enea-nfv-core-installation-guide/doc/high_availability.xml599
1 files changed, 311 insertions, 288 deletions
diff --git a/book-enea-nfv-core-installation-guide/doc/high_availability.xml b/book-enea-nfv-core-installation-guide/doc/high_availability.xml
index e489101..93f6468 100644
--- a/book-enea-nfv-core-installation-guide/doc/high_availability.xml
+++ b/book-enea-nfv-core-installation-guide/doc/high_availability.xml
@@ -2,27 +2,24 @@
2<chapter id="high_availability"> 2<chapter id="high_availability">
3 <title>High Availability Guide</title> 3 <title>High Availability Guide</title>
4 4
5 <para>ENEA NFV Core 1.0 has been designed to provide high availability 5 <para>Enea NFV Core 1.0 has been designed to provide high availability
6 characteristics that are needed for developing and deploying telco-grade NFV 6 characteristics that are needed for developing and deploying telco-grade NFV
7 solutions on top of our OPNFV based platform.</para> 7 solutions on top of our OPNFV based platform. The High Availability subject
8 8 in general is very wide and still an important focus in both opensource
9 <para>The High Availability subject in general is very wide and still an 9 communities and the independent/proprietary solutions market.</para>
10 important focus in both opensource communities and independent/proprietary 10
11 solutions market. ENEA NFV Core 1.0 aims to initially leverage the efforts 11 <para>Enea NFV Core 1.0 aims to initially leverage the efforts in the
12 in the upstream OPNFV and OpenStack opensource projects, combining solutions 12 upstream OPNFV and OpenStack opensource projects, combining solutions from
13 from both worlds in an effort to provide flexibility and a wide enough use 13 both worlds in an effort to provide flexibility and use-case coverage. Enea
14 case coverage. ENEA has a long time expertise and proprietary solutions 14 has long term expertise and proprietary solutions addressing High
15 addressing High Availability for telco applications, which are subject to 15 Availability for telco applications, which are subject to integration with
16 integrating with the NFV based solutions, however the initial scope for ENEA 16 the NFV based solutions.</para>
17 NFV Core is to leverage as much as possible the OPNFV Reference Platform and
18 open source projects in general, such as it will be seen further ahead in
19 this chapter.</para>
20 17
21 <section id="levels"> 18 <section id="levels">
22 <title>High Availability Levels</title> 19 <title>High Availability Levels</title>
23 20
24 <para>The base for the feature set in ENEA NFV Core is divided into three 21 <para>The foundation for the feature set available in Enea NFV Core is
25 levels:</para> 22 divided into three levels:</para>
26 23
27 <itemizedlist> 24 <itemizedlist>
28 <listitem> 25 <listitem>
@@ -30,7 +27,7 @@
30 </listitem> 27 </listitem>
31 28
32 <listitem> 29 <listitem>
33 <para>NFV Platform HA</para> 30 <para>NFV Platform H.A.</para>
34 </listitem> 31 </listitem>
35 32
36 <listitem> 33 <listitem>
@@ -38,118 +35,126 @@
38 </listitem> 35 </listitem>
39 </itemizedlist> 36 </itemizedlist>
40 37
41 <para>The same division of levels of fault management can be seen in the 38 <para>The same division of levels for fault management can be seen in the
42 scope of the High Availability for OPNFV (Availability) project. OPNFV 39 scope of the High Availability for OPNFV ("Availability") project. OPNFV
43 also hosts the Doctor Project which is a fault management and maintenance 40 also hosts Doctor, a fault management and maintenance project designed to
44 project to develop and realize the consequent implementation for the OPNFV 41 develop and perform the consequent implementation of the OPNFV reference
45 reference platform.</para> 42 platform. These two projects complement each other.</para>
46 43
47 <para>These two projects complement each other.</para> 44 <para>The Availability project addresses H.A. requirements and solutions
48 45 from the perspective of the three levels mentioned above. It produces high
49 <para>The Availability project addresses HA requirement and solutions from 46 level requirements and API definitions for High Availability for OPNFV, a
50 the perspective of the three levels mentioned above and produces high 47 H.A. Gap Analysis Report for OpenStack and more recently, works on
51 level requirements and API definitions for High Availability of OPNFV, HA 48 optimizing existing OPNFV test frameworks, such as Yardstick, developing
52 Gap Analysis Report for OpenStack and more recently works on optimizing 49 test cases which realize H.A.-specific use-cases and scenarios derived
53 existing OPNFV test frameworks, such as Yardstick, and develops test cases 50 from the H.A. requirements.</para>
54 which realize HA specific use cases and scenarios such as derived from the 51
55 HA requirements.</para> 52 <para>The Doctor project aims to build fault management and maintenance
56 53 framework for the high availability of Network Services, on top of a
57 <para>The Doctor Project on the other hand aims to build fault management 54 virtualized infrastructure. The key feature is immediate notification of
58 and maintenance framework for high availability of Network Services on top 55 unavailability of virtualized resources from VIM, to process recovery of
59 of virtualized infrastructure; the key feature is immediate notification 56 VNFs on them. </para>
60 of unavailability of virtualized resources from VIM, to process recovery 57
61 of VNFs on them. The Doctor project has also collaborated with the 58 <para>The Doctor project has also collaborated with the Availability
62 Availability project on identifying gaps in upstream project, mainly 59 project on identifying gaps in upstream projects, such as but not
63 OpenStack but not exclusive, and has worked towards implementing missing 60 exclusively OpenStack. It has also worked towards implementing missing
64 features or improving the functionality, one good example being the Aodh 61 features and improving functionality, with a good example being the Aodh
65 event based alarms, which allows for fast notifications when certain 62 event based alarms, which allow for fast notifications when certain
66 predefined events occur. The Doctor project also produced an architecture 63 predefined events occur. </para>
67 design and a reference implementation based on opensource components, 64
68 which will be presented later on in this document.</para> 65 <para>The Doctor project also produced an architectural design and a
66 reference implementation based on opensource components, which will be
67 presented later on in this document.</para>
69 </section> 68 </section>
70 69
71 <section id="doctor_arch"> 70 <section id="doctor_arch">
72 <title>Doctor Architecture</title> 71 <title>Doctor Architecture</title>
73 72
74 <para>The Doctor documentation shows the detailed architecture for Fault 73 <para>The Doctor project documentation shows the detailed architecture for
75 Management and NFVI Maintenance . The two are very similar so we will 74 Fault Management and NFVI Maintenance. Being quite similar with each
76 focus on the Fault Management.</para> 75 other, the focus in the following sections shall remain on Fault
76 Management.</para>
77 77
78 <para>The architecture specifies a set of functional blocks:</para> 78 <para>The architecture specifies a set of functional blocks:</para>
79 79
80 <itemizedlist> 80 <itemizedlist>
81 <listitem> 81 <listitem>
82 <para>Monitor - monitors the virtualized infrastructure capturing 82 <para><emphasis role="bold">Monitor</emphasis> - monitors the
83 fault events in the Software and Hardware; for this particular 83 virtualized infrastructure, capturing fault events in software and
84 component we chose Zabbix which is integrated into the platform by 84 hardware. For this component we choose <emphasis
85 means of the Fuel Zabbix Plugin, available upstream.</para> 85 role="bold">Zabbix</emphasis> which is integrated into the platform
86 through the Fuel Zabbix Plugin, available upstream.</para>
86 </listitem> 87 </listitem>
87 88
88 <listitem> 89 <listitem>
89 <para>Inspector - this component is able to receive notifications from 90 <para><emphasis role="bold">Inspector</emphasis> - this component
90 Monitor components and also OpenStack core components, which allows it 91 receives notifications from Monitor components and OpenStack core
91 to create logic relationships between entities, identify affected 92 components, allowing it to create logical relationships between
92 resources when faults occur, and communicates with Controllers to 93 entities, identify affected resources when faults occur, and to
93 update the states of the virtual and physical resources. For this 94 communicate with Controllers in order to update the states of the
94 component ENEA NFV Core 1.0 makes use of Vitrage , an OpenStack 95 virtual and physical resources.</para>
95 related project used for Root Cause Analysis, which has been adapted 96
96 to server as a Doctor Inspector. The integration into the platform is 97 <para>For this component Enea NFV Core 1.0 makes use of Vitrage, an
97 realized with the help of a Fuel Plugin which has been developed 98 OpenStack related project used for Root Cause Analysis. The
98 internally by ENEA.</para> 99 integration into the platform is done with the help of a Fuel Plugin
100 which has been developed internally by Enea.</para>
99 </listitem> 101 </listitem>
100 102
101 <listitem> 103 <listitem>
102 <para>Controller - OpenStack core components act as Controllers, which 104 <para><emphasis role="bold">Controller - </emphasis>OpenStack core
103 are responsible for maintaining the resource map between physical and 105 components act as Controllers responsible for maintaining the resource
104 virtual resources, they accept update requests from the Inspector and 106 map between physical and virtual resources. They accept update
105 are responsible for sending failure event notifications to the 107 requests from the Inspector and are responsible for sending failure
106 Notifier. Components such as Nova, Neutron, Glance, Heat act as 108 event notifications to the Notifier. Components such as Nova, Neutron,
107 Controllers in the Doctor Architecture.</para> 109 Glance, and Heat, act as Controllers in the Doctor
110 Architecture.</para>
108 </listitem> 111 </listitem>
109 112
110 <listitem> 113 <listitem>
111 <para>Notifier - the focus of this component is on selecting and 114 <para><emphasis role="bold">Notifier</emphasis> - the focus of this
112 aggregating failure events received from the controller based on 115 component is on selecting and aggregating failure events received from
113 policies mandated by the Consumer. The role of the Notifier is 116 the controller, based on policies mandated by the Consumer. The role
114 accomplished by the Aodh component in OpenStack.</para> 117 of the Notifier is filled by the Aodh component in OpenStack.</para>
115 </listitem> 118 </listitem>
116 </itemizedlist> 119 </itemizedlist>
117 120
118 <para>Besides the Doctor components there are a couple other blocks 121 <para>Alongside the Doctor components, there are a few other blocks
119 mentioned in the architecture:</para> 122 mentioned:</para>
120 123
121 <itemizedlist> 124 <itemizedlist>
122 <listitem> 125 <listitem>
123 <para>Administrator - this represents the human role of administrating 126 <para><emphasis role="bold">Administrator</emphasis> - this represents
124 the platform by means of dedicated interfaces, either visual 127 the human role of administrating the platform by means of dedicated
125 dashboards, like OpenStack Horizon or Fuel Dashboard, or via CLI 128 interfaces. These can be visual dashboards like OpenStack Horizon or
126 tools, like the OpenStack unified CLI that can be accessed 129 Fuel Dashboard, or via CLI tools like the OpenStack unified CLI, that
127 traditionally from one of the servers that act as OpenStack Controller 130 can be accessed from one of the servers that act as OpenStack
128 nodes. In the case of ENEA NFV Core 1.0, the Administrator can also 131 Controller nodes. </para>
129 access the Zabbix dashboard for doing further configurations. The same 132
133 <para>In Enea NFV Core 1.0 the Administrator can also access the
134 Zabbix dashboard to perform supplementary configurations. The same
130 applies for the Vitrage tool, which comes with its own Horizon 135 applies for the Vitrage tool, which comes with its own Horizon
131 dashboard which enables the user to visually inspect the faults 136 dashboard, enabling the user to visually inspect the faults reported
132 reported by the monitoring tools and also creates visual 137 by the monitoring tools through visual representations of the virtual
133 representations of the virtual and physical resources, the 138 and physical resources, the relationships between them and the fault
134 relationships between them and the fault correlation. For Vitrage, 139 correlation. </para>
135 users will usually want to configure additional usecases and describe 140
136 relationships between components, via template files written in yaml 141 <para>For Vitrage, users will usually want to configure additional
137 format. More information about using Vitrage will be presented in a 142 use-cases and describe relationships between components via template
138 following section.</para> 143 files written in <literal>yaml</literal> format.</para>
139 </listitem> 144 </listitem>
140 145
141 <listitem> 146 <listitem>
142 <para>Consumer - this block is vaguely described in the Doctor 147 <para><emphasis role="bold">Consumer</emphasis> - this block is
143 Architecture and it's out of its scope. Doctor only deals with fault 148 vaguely described in the Doctor Architecture and is out of its current
144 detection and management, making sure faults are handled as soon as 149 scope. Doctor only deals with fault detection and management, but
145 possible after detection, identifies affected virtual resources and 150 since the actual VNFs are managed, according to the ETSI architecture,
146 updates the states of them, but since the actual VNFs are managed, 151 by a different entity, Doctor does not deal with recovery actions of
147 according to the ETSI architecture, by a different entity, Doctor does 152 the VNFs. The role of the Consumer thus falls to that of a VNF Manager
148 not deal with recovery actions of the VNFs. The role of the Consumer 153 and Orchestrator.</para>
149 thus falls in the task of a VNF Manager and Orchestrator. ENEA NFV 154
150 Core 1.0 provides VNF management capabilities using Tacker, which is 155 <para>Enea NFV Core 1.0 provides VNF management capabilities using
151 an OpenStack project that implements a generic VNF Manager and 156 Tacker, which is an OpenStack project that implements a generic VNF
152 Orchestrator according to the ETSI MANO Architectural 157 Manager and Orchestrator, according to the ETSI MANO Architectural
153 Framework.</para> 158 Framework.</para>
154 </listitem> 159 </listitem>
155 </itemizedlist> 160 </itemizedlist>
@@ -170,12 +175,12 @@
170 175
171 <para>The architecture described in the Doctor project has been 176 <para>The architecture described in the Doctor project has been
172 demonstrated in various PoCs and demos, but always using sample 177 demonstrated in various PoCs and demos, but always using sample
173 components for either the consumer or the monitor. ENEA has worked with 178 components for either the consumer or the monitor. Enea has worked with
174 upstream projects, Doctor and Vitrage, to realize the goals of the 179 upstream projects Doctor and Vitrage, to realize the goals of the Doctor
175 Doctor project by using real components, as described before.</para> 180 project by using real components as described above.</para>
176 181
177 <para>The two pictures below show a typical fault management scenario, 182 <para>The two pictures below show a typical fault management
178 as described in the Doctor documentation.</para> 183 scenario:</para>
179 184
180 <mediaobject> 185 <mediaobject>
181 <imageobject> 186 <imageobject>
@@ -189,70 +194,81 @@
189 </imageobject> 194 </imageobject>
190 </mediaobject> 195 </mediaobject>
191 196
192 <para>ENEA NFV Core 1.0 uses the same approach described above, but it's 197 <para>Enea NFV Core 1.0 uses the same approach described above:</para>
193 worth going through each step and detail them.</para>
194 198
195 <orderedlist> 199 <orderedlist>
196 <listitem> 200 <listitem>
197 <para>When creating a VNF, the user will have to enable the 201 <para>When creating a VNF, the user will have to enable the
198 monitoring capabilities of Tacker, by passing a template which 202 monitoring capabilities of Tacker by passing a template, which
199 specifies that an alarm will be created when the VM represented by 203 specifies that an alarm will be created when the VM represented by
200 this VNF changes state. The support for alarm monitoring in Tacker 204 this VNF changes state. The support for alarm monitoring in Tacker
201 is captured in the Alarm Monitoring Framework spec in OpenStack 205 is detailed in the Alarm Monitoring Framework spec in the OpenStack
202 documentation. In a few words, Tacker should be able to create a VNF 206 documentation.</para>
203 and then create an Aodh alarm of type event which triggers when the 207
204 instance is in state ERROR. The action to take when this event 208 <para>Tacker should be able to create a VNF and then an Aodh alarm
205 triggers is to perform an HTTP call, to an URL managed by Tacker. As 209 of type event, triggerable when the instance is in a state of ERROR.
206 a result of this action, Tacker can detect when an instance has 210 When this event is triggered perform an HTTP call to a URL managed
207 failed (for whatever reasons) and will respawn it somewhere 211 by Tacker. As a result of this action, Tacker can detect when an
208 else.</para> 212 instance has failed (for whatever reason) and will respawn it
213 somewhere else.</para>
209 </listitem> 214 </listitem>
210 215
211 <listitem> 216 <listitem>
212 <para>The subscribe response in this case is an empty operation, the 217 <para>The subscribed response in this case is an empty operation,
213 Notifier (Aodh) only has to confirm that the alarm has been 218 the Notifier (Aodh) only has to confirm that the alarm has been
214 created.</para> 219 created.</para>
215 </listitem> 220 </listitem>
216 221
217 <listitem> 222 <listitem>
218 <para>The NFVI sends monitoring events for resources the VIM has 223 <para>The NFVI sends monitoring events for the resources the VIM has
219 been subscribed to. Note: this subscription message exchange between 224 been subscribed to. </para>
220 the VIM and NFVI is not shown in this message flow. This steps is 225
221 related to Vitrage's capability of receiving notifications from 226 <note>
222 OpenStack services, at this moment Vitrage supports notifications 227 <para>This subscription message exchange between the VIM and NFVI
223 from nova.host, nova.instances, nova.zone, cinder.volume, 228 is not shown in this message flow. This step is related to
224 neutron.network, neutron.port and heat.stack OpenStack 229 Vitrage's capability of receiving notifications from OpenStack
225 datasources.</para> 230 services. At this moment Vitrage supports notifications from
231 <literal>nova.host</literal>, <literal>nova.instances</literal>,
232 <literal>nova.zone</literal>, <literal>cinder.volume</literal>,
233 <literal>neutron.network</literal>,
234 <literal>neutron.port</literal> and <literal>heat.stack</literal>
235 OpenStack datasources.</para>
236 </note>
226 </listitem> 237 </listitem>
227 238
228 <listitem> 239 <listitem>
229 <para>This steps describes faults being detected by Zabbix which are 240 <para>This step describes faults detected by Zabbix which are sent
230 sent to the Inspector (Vitrage) as soon as detected, using a push 241 to the Inspector (Vitrage) as soon as detected. This is done using a
231 approach by means of sending an AMQP message to a dedicated message 242 push approach by means of sending an AMQP message to a dedicated
232 queue managed by Vitrage. For example, if nova-compute fails on one 243 message queue managed by Vitrage. For example, if
233 of the compute nodes, Zabbix will format a message specifying all 244 <literal>nova-compute</literal> fails on one of the compute nodes,
234 the needed details needed for processing the fault, e.g. a 245 Zabbix will format a message specifying all the needed details
235 timestamp, what host failed, what event occurred and others.</para> 246 required for processing the fault: a timestamp, what host failed,
247 what event occurred etc.</para>
236 </listitem> 248 </listitem>
237 249
238 <listitem> 250 <listitem>
239 <para>Database lookup to find the virtual resources affected by the 251 <para>This step shows database lookup geared to find the virtual
240 detected fault. In this step Vitrage will perform various 252 resources affected by the detected fault. Vitrage will perform
241 calculations to detect what virtual resources are affected by the 253 various calculations to detect what virtual resources are affected
242 raw failure presented by Zabbix. Vitrage can be configured via 254 by the raw failure presented by Zabbix. </para>
243 templates to correlate instances with the physical hosts they are 255
244 running on, so that if a compute node fails, then instances running 256 <para>Vitrage can be configured via templates to correlate instances
245 on that host will be affected. A typical usecase is to mark the 257 with the physical hosts they are running on, so that if a compute
246 compute node down (a.k.a mark_host_down) and update the states of 258 node fails, then instances running on that host will be affected. A
247 all instances running on them, by issuing Nova API calls for each of 259 typical use-case is to mark the compute node down
248 these instances. Step 5c) shows the Controller (Nova in this case) 260 (<literal>mark_host_down</literal>) and update the states of all
249 acting upon the state change of the instance and issues an event 261 instances running on them. This is done by issuing Nova API calls
250 alarm to Aodh.</para> 262 for each of these instances. </para>
263
264 <para>Step 5c. shows the Controller (Nova in this case) acting upon
265 the state change of the instance and issuing an event alarm to
266 Aodh.</para>
251 </listitem> 267 </listitem>
252 268
253 <listitem> 269 <listitem>
254 <para>The Notifier will acknowledge the alarm event request from 270 <para>The Notifier will acknowledge the alarm event request from
255 Nova and will trigger the alarm(s) created by Tacker in step 1). 271 Nova and will trigger the alarm(s) created by Tacker in step 1.
256 Since Tacker has configured the alarm to send an HTTP request, Aodh 272 Since Tacker has configured the alarm to send an HTTP request, Aodh
257 will perform that HTTP call at the URL managed by Tacker.</para> 273 will perform that HTTP call at the URL managed by Tacker.</para>
258 </listitem> 274 </listitem>
@@ -268,7 +284,7 @@
268 </listitem> 284 </listitem>
269 </orderedlist> 285 </orderedlist>
270 286
271 <note> 287 <note condition="hidden">
272 <para>The ENEA NFV Core 1.0 Pre-Release fully covers the required 288 <para>The ENEA NFV Core 1.0 Pre-Release fully covers the required
273 Doctor functionality only for the Vitrage and Zabbix 289 Doctor functionality only for the Vitrage and Zabbix
274 components.</para> 290 components.</para>
@@ -280,27 +296,25 @@
280 296
281 <para>Vitrage supports Zabbix datasource by means of regularly polling 297 <para>Vitrage supports Zabbix datasource by means of regularly polling
282 the Zabbix agents, which need to be configured in advance. The Vitrage 298 the Zabbix agents, which need to be configured in advance. The Vitrage
283 plugin developed internally by ENEA can automatically configure Zabbix 299 plugin developed internally by Enea can automatically configure Zabbix
284 so that everything works as expected.</para> 300 so that everything works as expected. Polling however, is not fast
285 301 enough for a telco use-case, so it is necessary to configure push
286 <para>However, polling is not fast enough for a telco usecase, so it is 302 notifications for Zabbix . This requires manual configuration on one of
287 necessary to configure pushed notifications for Zabbix . This requires 303 the controller nodes, since Zabbix uses a centralized database which
288 manual configuration on one of the controller nodes, since Zabbix uses a 304 makes the configuration available on all the other nodes.</para>
289 centralized database which makes the configuration available on all the
290 other nodes.</para>
291 305
292 <para>The Zabbix configuration dashboard is available at the same IP 306 <para>The Zabbix configuration dashboard is available at the same IP
293 address where OpenStack can be reached, e.g. 307 address where OpenStack can be reached, e.g.
294 http://&lt;vip__zbx_vip_mgmt&gt;/zabbix.</para> 308 <literal>http://&lt;vip__zbx_vip_mgmt&gt;/zabbix</literal>.</para>
295 309
296 <para>To forward zabbix events to Vitrage a new media script needs to be 310 <para>To forward zabbix events to Vitrage, a new media script needs to
297 created and associated with a user. Follow the steps below as a Zabbix 311 be created and associated with a user. Follow the steps below as a
298 Admin user:</para> 312 Zabbix Admin user:</para>
299 313
300 <orderedlist> 314 <orderedlist>
301 <listitem> 315 <listitem>
302 <para>Create a new media type [Admininstration Media Types Create 316 <para>Create a new media type [Admininstration Media Types &gt;
303 Media Type]</para> 317 Create Media Type]</para>
304 318
305 <itemizedlist> 319 <itemizedlist>
306 <listitem> 320 <listitem>
@@ -312,7 +326,7 @@
312 </listitem> 326 </listitem>
313 327
314 <listitem> 328 <listitem>
315 <para>Script name: zabbix_vitrage.py</para> 329 <para>Script name: <filename>zabbix_vitrage.py</filename></para>
316 </listitem> 330 </listitem>
317 </itemizedlist> 331 </itemizedlist>
318 </listitem> 332 </listitem>
@@ -327,14 +341,15 @@
327 </listitem> 341 </listitem>
328 342
329 <listitem> 343 <listitem>
330 <para>Send to: rabbit://rabbit_user:rabbit_pass@127.0.0.1:5672/ 344 <para>Send to:
331 --- Vitrage message bus url (you need to search for this in 345 <literal>rabbit://rabbit_user:rabbit_pass@127.0.0.1:5672/
332 /etc/vitrage/vitrage.conf or /etc/nova/nova.conf 346 ---</literal> Vitrage message bus url (you need to search for
333 transport_url)</para> 347 this in <literal>/etc/vitrage/vitrage.conf or
348 /etc/nova/nova.conf transport_url</literal>)</para>
334 </listitem> 349 </listitem>
335 350
336 <listitem> 351 <listitem>
337 <para>When active: 1-7,00:00-24:00</para> 352 <para>When active: 1-7, 00:00-24:00</para>
338 </listitem> 353 </listitem>
339 354
340 <listitem> 355 <listitem>
@@ -348,8 +363,8 @@
348 </listitem> 363 </listitem>
349 364
350 <listitem> 365 <listitem>
351 <para>Configure Action [Configuration Actions Create Action 366 <para>Configure Action [Configuration &gt; Actions &gt; Create
352 Action]</para> 367 Action &gt; Action]</para>
353 368
354 <itemizedlist> 369 <itemizedlist>
355 <listitem> 370 <listitem>
@@ -361,19 +376,25 @@
361 </listitem> 376 </listitem>
362 377
363 <listitem> 378 <listitem>
364 <para>Default Message: host={HOST.NAME1} hostid={HOST.ID1} 379 <para>Default Message:</para>
365 hostip={HOST.IP1} triggerid={TRIGGER.ID} 380
366 description={TRIGGER.NAME} rawtext={TRIGGER.NAME.ORIG} 381 <programlisting>host={HOST.NAME1}
367 expression={TRIGGER.EXPRESSION} value={TRIGGER.VALUE} 382hostid={HOST.ID1}
368 priority={TRIGGER.NSEVERITY} lastchange={EVENT.DATE} 383hostip={HOST.IP1}
369 {EVENT.TIME}</para> 384triggerid={TRIGGER.ID}
385description={TRIGGER.NAME}
386rawtext={TRIGGER.NAME.ORIG}
387expression={TRIGGER.EXPRESSION}
388value={TRIGGER.VALUE}
389priority={TRIGGER.NSEVERITY}
390lastchange={EVENT.DATE} {EVENT.TIME}</programlisting>
370 </listitem> 391 </listitem>
371 </itemizedlist> 392 </itemizedlist>
372 </listitem> 393 </listitem>
373 394
374 <listitem> 395 <listitem>
375 <para>To send events add under the Conditions tab: 'Maintenance 396 <para>To send events add under the <literal>Conditions</literal>
376 status not in 'maintenance'".</para> 397 tab: "Maintenance status not in "maintenance"".</para>
377 </listitem> 398 </listitem>
378 399
379 <listitem> 400 <listitem>
@@ -391,32 +412,34 @@
391 </listitem> 412 </listitem>
392 </orderedlist> 413 </orderedlist>
393 414
394 <para>Using these instructions, Zabbix will call the zabbix_vitrage.py 415 <para>Using these instructions, Zabbix will call the
395 script, which is made readily available by the Fuel Vitrage Plugin, 416 <literal>zabbix_vitrage.py</literal> script, made readily available by
396 passing the arguments described in step 3). The zabbix_vitrage.py script 417 the Fuel Vitrage Plugin, to pass the arguments described in step 3. The
397 will then interpret the parameters and format an AMQP message will be 418 <literal>zabbix_vitrage.py</literal> script will then interpret the
398 sent to the vitrage.notifications queue, which is managed by the 419 parameters and format an AMQP message to be sent to the
420 <literal>vitrage.notifications</literal> queue, managed by the
399 vitrage-graph service.</para> 421 vitrage-graph service.</para>
400 </section> 422 </section>
401 423
402 <section id="vitrage_config"> 424 <section id="vitrage_config">
403 <title>Vitrage Configuration</title> 425 <title>Vitrage Configuration</title>
404 426
405 <para>The Vitrage team has been collaborating with OPNFV Doctor Project 427 <para>The Vitrage team has been collaborating with the OPNFV Doctor
406 in order to support Vitrage as an Inspector Component. The Doctor 428 project in order to support Vitrage as an Inspector Component. The
407 usecase for Vitrage is described in an OpenStack blueprint . 429 Doctor use-case for Vitrage is described in an OpenStack blueprint. Enea
408 Additionally, ENEA NFV Core has complemented Vitrage with the capability 430 NFV Core has complemented Vitrage with the ability to set the states of
409 of setting states of failed instances by implementing an action type in 431 failed instances by implementing an action type in Vitrage. This action
410 Vitrage which calls Nova APIs to set instances in error state. There is 432 calls Nova APIs to set instances in error state. An action type which
411 also an action type which allows fencing failed hosts.</para> 433 allows fencing failed hosts also exists.</para>
412 434
413 <para>In order to make use of these features, Vitrage supports 435 <para>In order to make use of these features, Vitrage supports
414 additional configurations via yaml templates that must be placed in 436 additional configurations via <literal>yaml</literal> templates that
415 /etc/vitrage/templates on the nodes have the Vitrage role.</para> 437 must be placed in <literal>/etc/vitrage/templates</literal> on the nodes
438 have the Vitrage role.</para>
416 439
417 <para>The example below shows how to program Vitrage to mark failed 440 <para>The example below shows how to program Vitrage to mark failed
418 compute hosts as down and then to change the state of the instances to 441 compute hosts as down and then to change the state of the instances to
419 Error, by creating Vitrage deduced alarms.</para> 442 ERROR, by creating Vitrage deduced alarms.</para>
420 443
421 <programlisting>metadata: 444 <programlisting>metadata:
422 name: test_nova_mark_instance_err 445 name: test_nova_mark_instance_err
@@ -466,7 +489,7 @@ scenarios:
466 properties: 489 properties:
467 state: ERROR</programlisting> 490 state: ERROR</programlisting>
468 491
469 <para>For the action type of fencing a similar action item must be 492 <para>For the action type of fencing, a similar action item must be
470 added:</para> 493 added:</para>
471 494
472 <programlisting>- scenario: 495 <programlisting>- scenario:
@@ -477,8 +500,9 @@ scenarios:
477 action_target: 500 action_target:
478 target: host</programlisting> 501 target: host</programlisting>
479 502
480 <para>After a template is configured, it is required to restart the 503 <para>After a template is configured, a restart of the
481 vitrage-api and vitrage-graph services:</para> 504 <literal>vitrage-api</literal> and <literal>vitrage-graph</literal>
505 services is needed:</para>
482 506
483 <programlisting>root@node-6:~# systemctl restart vitrage-api 507 <programlisting>root@node-6:~# systemctl restart vitrage-api
484root@node-6:~# systemctl restart vitrage-graph</programlisting> 508root@node-6:~# systemctl restart vitrage-graph</programlisting>
@@ -487,12 +511,12 @@ root@node-6:~# systemctl restart vitrage-graph</programlisting>
487 <section id="vitrage_custom"> 511 <section id="vitrage_custom">
488 <title>Vitrage Customizations</title> 512 <title>Vitrage Customizations</title>
489 513
490 <para>ENEA NFV Core 1.0 has added custom features for Vitrage which 514 <para>Enea NFV Core 1.0 has added custom features for Vitrage which
491 allow two kinds of action:</para> 515 allow two kinds of actions:</para>
492 516
493 <orderedlist> 517 <orderedlist>
494 <listitem> 518 <listitem>
495 <para>Perform actions Northbound of the VIM</para> 519 <para>Perform actions Northbound of the VIM:</para>
496 520
497 <itemizedlist> 521 <itemizedlist>
498 <listitem> 522 <listitem>
@@ -500,23 +524,23 @@ root@node-6:~# systemctl restart vitrage-graph</programlisting>
500 </listitem> 524 </listitem>
501 525
502 <listitem> 526 <listitem>
503 <para>Setting instance state to error in nova; this is used in 527 <para>Setting instance state to ERROR in nova. This is used in
504 conjunction with an alarm created by Tacker, as described 528 conjunction with an alarm created by Tacker, as described
505 before, should allow Tacker to detect when an instance is 529 before, and should allow Tacker to detect when an instance is
506 affected and take proper actions.</para> 530 affected and take proper actions.</para>
507 </listitem> 531 </listitem>
508 </itemizedlist> 532 </itemizedlist>
509 </listitem> 533 </listitem>
510 534
511 <listitem> 535 <listitem>
512 <para>Perform actions Southbound of the VIM.</para> 536 <para>Perform actions Southbound of the VIM:</para>
513 537
514 <para>Vitrage templates allow us to program fencing actions for 538 <para>Vitrage templates allow us to program fencing actions for
515 hosts with failed services. In the event of that systemd is unable 539 hosts with failed services. In the event that
516 to recover from a critical process or other type of sofware error 540 <literal>systemd</literal> is unable to recover from a critical
517 ocurs on Hardware supporting them, we can program a fencing of that 541 process or a type of sofware error ocurs on the Hardware supporting
518 Node which will perform a reboot thus attempting to recover a failed 542 them, the fencing of Node can be programmed, and it in turn will
519 node.</para> 543 perform a reboot, attempting to recover the failed node.</para>
520 </listitem> 544 </listitem>
521 </orderedlist> 545 </orderedlist>
522 </section> 546 </section>
@@ -529,48 +553,49 @@ root@node-6:~# systemctl restart vitrage-graph</programlisting>
529 characteristics employ pacemaker for achieving highly available OpenStack 553 characteristics employ pacemaker for achieving highly available OpenStack
530 services. Traditionally pacemaker has been used for managing only the 554 services. Traditionally pacemaker has been used for managing only the
531 control plane services, so it can effectively provide redundancy and 555 control plane services, so it can effectively provide redundancy and
532 recovery for the Controller nodes only. One reason for this is that 556 recovery for the Controller nodes only. A reason for this is that
533 Controller nodes and Compute nodes essentially have very different High 557 Controller nodes and Compute nodes essentially have very different High
534 Availability requirements that need to be considered. Typically, for 558 Availability requirements that need to be considered. </para>
535 Controller nodes, the services that run on them are stateless, with few 559
536 exceptions, where only one instance of a given service is allowed, but for 560 <para>Typically, for Controller nodes, the services that run on them are
537 which redundancy is still desired, one good example being an AMQP service 561 stateless, with few exceptions, where only one instance of a given service
538 (e.g. RabbitMQ). Compute nodes HA requirements depend on the type of 562 is allowed, but for which redundancy is still desired. A good example
539 services that run on them, but typically it is desired that failures on 563 would be an AMQP service (e.g. RabbitMQ). Compute nodes H.A. requirements
540 these nodes is detected as soon as possible so that the instances that run 564 depend on the type of services that run on them, but typically it is
541 on them can be either migrated, resurrected or restarted. One other aspect 565 desired that failures on these nodes be detected as soon as possible so
542 is that sometimes failures on the physical hosts do not necessarily cause 566 that the instances that run on them can be either migrated, resurrected or
543 a failure on the services (VNFs), but having these services incapacitated 567 restarted. Sometimes failures on the physical hosts do not necessarily
544 can prevent accessing and controlling the services.</para> 568 cause a failure on the services (VNFs), but having these services
545 569 incapacitated can prevent access to and controlling the services.</para>
546 <para>So Controller High Availability is one subject which is in general 570
547 well understood and experimented with, and the base of achieving this is 571 <para>Controller High Availability is thus a subject generally well
548 Pacemaker using Corosync underneath.</para> 572 understood and experimented with, and the basis for this is Pacemaker
573 using Corosync underneath.</para>
549 574
550 <para>Extending the use of pacemaker to Compute nodes was thought as a 575 <para>Extending the use of pacemaker to Compute nodes was thought as a
551 possible solution for providing VNF high availability, but this turns out 576 possible solution for providing VNF high availability, but the problem
552 to be a problem which is not easy to solve. On one hand pacemaker as a 577 turned out to be more complicated. On one hand, pacemaker as a clustering
553 clustering tool can only scale properly up to limited number of nodes, 578 tool, can only scale properly up to a limited number of nodes, usually
554 usually less than 128. This poses a problem for large scale deployments 579 less than 128. This poses a problem for large scale deployments where
555 where hundreds of compute nodes are required. On the other hand, Compute 580 hundreds of compute nodes are required. On the other hand, Compute node
556 node HA requires other considerations and calls for specially designed 581 H.A. requires other considerations and calls for specially designed
557 solutions.</para> 582 solutions.</para>
558 583
559 <section id="pm_remote"> 584 <section id="pm_remote">
560 <title>Pacemaker Remote</title> 585 <title>Pacemaker Remote</title>
561 586
562 <para>As mentioned earlier, pacemaker and corosync do not scale well 587 <para>As mentioned earlier, pacemaker and corosync do not scale well
563 over a large cluster, because each node has to talk to everyone, 588 over a large cluster, since each node has to talk to every other,
564 essentially creating a mesh configuration. Some solution to this problem 589 essentially creating a mesh configuration. A solution to this problem
565 could be partitioning the cluster into smaller groups, but this solution 590 could be partitioning the cluster into smaller groups, but this has its
566 has its limitation and it's generally difficult to manage.</para> 591 limitations and it is generally difficult to manage. </para>
567 592
568 <para>A better solution is using pacemaker-remote, a feature of 593 <para>A better solution is using <literal>pacemaker-remote</literal>, a
569 pacemaker which allows extending the cluster beyond the usual limits by 594 feature of pacemaker, which allows for extending the cluster beyond the
570 using the pacemaker monitoring capabilities, essentially creating a new 595 usual limits by using the pacemaker monitoring capabilities. It
571 type of resource which enables adding light weight nodes to the cluster. 596 essentially creates a new type of resource which enables adding light
572 More information about pacemaker-remote can be found on the official 597 weight nodes to the cluster. More information about pacemaker-remote can
573 clusterlabs website.</para> 598 be found on the official clusterlabs website.</para>
574 599
575 <para>Please note that at this moment pacemaker remote must be 600 <para>Please note that at this moment pacemaker remote must be
576 configured manually after deployment. Here are the manual steps for 601 configured manually after deployment. Here are the manual steps for
@@ -578,13 +603,13 @@ root@node-6:~# systemctl restart vitrage-graph</programlisting>
578 603
579 <orderedlist> 604 <orderedlist>
580 <listitem> 605 <listitem>
581 <para>Logon to the Fuel Master using the default credentials if not 606 <para>Log onto the Fuel Master using the default credentials, if
582 changed (root/r00tme)</para> 607 they have not been changed (root/r00tme).</para>
583 </listitem> 608 </listitem>
584 609
585 <listitem> 610 <listitem>
586 <para>Type fuel node to obtain the list of nodes, their roles and 611 <para>Type fuel node to obtain the list of nodes, their roles and
587 the IP addresses</para> 612 the IP addresses.</para>
588 613
589 <programlisting>[root@fuel ~]# fuel node 614 <programlisting>[root@fuel ~]# fuel node
590id | status | name | cluster | ip | mac | roles / 615id | status | name | cluster | ip | mac | roles /
@@ -604,10 +629,10 @@ controller, vitrage | | 1 | 1</programlisting>
604 </listitem> 629 </listitem>
605 630
606 <listitem> 631 <listitem>
607 <para>Each controller has a unique pacemaker authkey, we need to 632 <para>Each controller has a unique pacemaker authkey. One needs to
608 keep one an propagate it to the other servers. Assuming node-1, 633 be kept and propagated to the other servers. Assuming node-1, node-2
609 node-2 and node-3 are the controllers, execute the following from 634 and node-3 are the controllers, execute the following from the Fuel
610 the Fuel console:</para> 635 console:</para>
611 636
612 <programlisting>[root@fuel ~]# scp node-1:/etc/pacemaker/authkey . 637 <programlisting>[root@fuel ~]# scp node-1:/etc/pacemaker/authkey .
613[root@fuel ~]# scp authkey node-2:/etc/pacemaker/ 638[root@fuel ~]# scp authkey node-2:/etc/pacemaker/
@@ -619,7 +644,7 @@ controller, vitrage | | 1 | 1</programlisting>
619 644
620 <listitem> 645 <listitem>
621 <para>For each compute node, log on to it using the corresponding 646 <para>For each compute node, log on to it using the corresponding
622 IP.</para> 647 IP</para>
623 </listitem> 648 </listitem>
624 649
625 <listitem> 650 <listitem>
@@ -629,7 +654,7 @@ controller, vitrage | | 1 | 1</programlisting>
629 </listitem> 654 </listitem>
630 655
631 <listitem> 656 <listitem>
632 <para>Copy the authkey from the Fuel master and make sure the right 657 <para>Copy the authkey from the Fuel Master and make sure the right
633 permissions are set:</para> 658 permissions are set:</para>
634 659
635 <programlisting>[root@node-4:~]# cp authkey /etc/pacemaker 660 <programlisting>[root@node-4:~]# cp authkey /etc/pacemaker
@@ -637,21 +662,22 @@ controller, vitrage | | 1 | 1</programlisting>
637 </listitem> 662 </listitem>
638 663
639 <listitem> 664 <listitem>
640 <para>Add iptables rule for the default port (3121). Also save it to 665 <para>Add an iptables rule for the default port (3121). Save it also
641 /etc/iptables/rules.v4 to make it persistent:</para> 666 to <literal>/etc/iptables/rules.v4</literal> to make it
667 persistent:</para>
642 668
643 <programlisting>root@node-4:~# iptables -A INPUT -s 192.168.0.0/24 -p tcp -m multiport / 669 <programlisting>root@node-4:~# iptables -A INPUT -s 192.168.0.0/24 -p tcp -m multiport /
644--dports 3121 -m comment --comment "pacemaker_remoted from 192.168.0.0/24" -j ACCEPT </programlisting> 670--dports 3121 -m comment --comment "pacemaker_remoted from 192.168.0.0/24" -j ACCEPT</programlisting>
645 </listitem> 671 </listitem>
646 672
647 <listitem> 673 <listitem>
648 <para>Start the pacemaker-remote service</para> 674 <para>Start the pacemaker-remote service:</para>
649 675
650 <programlisting>[root@node-4:~]# systemctl start pacemaker-remote.service</programlisting> 676 <programlisting>[root@node-4:~]# systemctl start pacemaker-remote.service</programlisting>
651 </listitem> 677 </listitem>
652 678
653 <listitem> 679 <listitem>
654 <para>Log on one of the controller nodes and configure the 680 <para>Log onto one of the controller nodes and configure the
655 pacemaker-remote resources:</para> 681 pacemaker-remote resources:</para>
656 682
657 <programlisting>[root@node-1:~]# pcs resource create node-4.domain.tld remote 683 <programlisting>[root@node-1:~]# pcs resource create node-4.domain.tld remote
@@ -685,20 +711,21 @@ RemoteOnline: [ node-4.domain.tld node-5.domain.tld ]</programlisting>
685 <title>Pacemaker Fencing</title> 711 <title>Pacemaker Fencing</title>
686 712
687 <para>ENEA NFV Core 1.0 makes use of the fencing capabilities of 713 <para>ENEA NFV Core 1.0 makes use of the fencing capabilities of
688 Pacemaker to isolate faulty nodes and trigger recovery actions by means 714 pacemaker to isolate faulty nodes and trigger recovery actions by means
689 of power cycling the failed nodes. Fencing is configured by creating 715 of power cycling the failed nodes. Fencing is configured by creating
690 STONITH type resources for each of the servers in the cluster, both 716 <literal>STONITH</literal> type resources for each of the servers in the
691 Controller nodes and Compute nodes. The STONITH adapter for fencing the 717 cluster, both Controller nodes and Compute nodes. The
692 nodes is fence_ipmilan, which makes use of the IPMI capabilities of the 718 <literal>STONITH</literal> adapter for fencing the nodes is
693 Cavium ThunderX servers.</para> 719 <literal>fence_ipmilan</literal>, which makes use of the IPMI
720 capabilities of the ThunderX servers.</para>
694 721
695 <para>Here are the steps for enabling fencing capabilities in the 722 <para>Here are the steps for enabling fencing capabilities on a
696 cluster:</para> 723 cluster:</para>
697 724
698 <orderedlist> 725 <orderedlist>
699 <listitem> 726 <listitem>
700 <para>Logon to the Fuel Master using the default credentials if not 727 <para>Log onto the Fuel Master using the default credentials, if not
701 changed (root/r00tme).</para> 728 they have not been changed (root/r00tme).</para>
702 </listitem> 729 </listitem>
703 730
704 <listitem> 731 <listitem>
@@ -719,18 +746,17 @@ id | status | name | cluster | ip | mac | roles
719 2 | ready | Untitled (8b:64) | 1 | 10.20.0.3 | 68:05:ca:46:8b:64 | / 746 2 | ready | Untitled (8b:64) | 1 | 10.20.0.3 | 68:05:ca:46:8b:64 | /
720controller, mongo, tacker | | 1 | 1 747controller, mongo, tacker | | 1 | 1
721 3 | ready | Untitled (8c:45) | 1 | 10.20.0.5 | 68:05:ca:46:8c:45 | / 748 3 | ready | Untitled (8c:45) | 1 | 10.20.0.5 | 68:05:ca:46:8c:45 | /
722controller, vitrage | | 1 | 1 749controller, vitrage | | 1 | 1</programlisting>
723</programlisting>
724 </listitem> 750 </listitem>
725 751
726 <listitem> 752 <listitem>
727 <para>Logon to each server to install additional packages:</para> 753 <para>Log onto each server to install additional packages:</para>
728 754
729 <programlisting>[root@node-1:~]# apt-get install fence-agents ipmitool</programlisting> 755 <programlisting>[root@node-1:~]# apt-get install fence-agents ipmitool</programlisting>
730 </listitem> 756 </listitem>
731 757
732 <listitem> 758 <listitem>
733 <para>Configure pacemaker fencing resources; this needs to be done 759 <para>Configure pacemaker fencing resources. This needs to be done
734 once on one of the controllers. The parameters will vary, depending 760 once on one of the controllers. The parameters will vary, depending
735 on the BMC addresses of each node and credentials.</para> 761 on the BMC addresses of each node and credentials.</para>
736 762
@@ -752,9 +778,9 @@ ipaddr=10.0.100.155 login=ADMIN passwd=ADMIN op monitor interval="60s"</programl
752 </listitem> 778 </listitem>
753 779
754 <listitem> 780 <listitem>
755 <para>Activate fencing by enabling stonith property in pacemaker (by 781 <para>Activate fencing by enabling the <literal>stonith</literal>
756 default it is disabled); this also needs to be done only once, on 782 property in pacemaker (disabled by default). This also needs to be
757 one of the controllers.</para> 783 done only once, on one of the controllers.</para>
758 784
759 <programlisting>[root@node-1:~]# pcs property set stonith-enabled=true</programlisting> 785 <programlisting>[root@node-1:~]# pcs property set stonith-enabled=true</programlisting>
760 </listitem> 786 </listitem>
@@ -767,28 +793,25 @@ ipaddr=10.0.100.155 login=ADMIN passwd=ADMIN op monitor interval="60s"</programl
767 793
768 <para>The OpenStack community has been working for some time on 794 <para>The OpenStack community has been working for some time on
769 identifying possible solutions for enabling High Availability for Compute 795 identifying possible solutions for enabling High Availability for Compute
770 nodes, although initially the subject of HA on compute node was very 796 nodes, after a period of belief that this subject was not something that
771 controversial as not being something that should concern the cloud 797 should concern the cloud platform. Over time it became obvious that even
772 platform. Over time it became obvious that even on a true cloud platform, 798 on a true cloud platform, where services are designed to run without being
773 where services are designed to run without being affected by the 799 affected by the availability of the cloud platform, fault management and
774 availability of the cloud platform, fault management and recovery is still 800 recovery are still very important and desirable. This is also the case for
775 very important and desirable. This is very much the case for NFV 801 NFV applications, where in the good tradition of telecom applications, the
776 applications, where, in the good tradition of telecom applications, the 802 operators must have complete engineering control over the resources they
777 operators must have complete engineering control over the resources it 803 own and manage.</para>
778 owns and manages.</para> 804
779 805 <para>The work for Compute node High Availability is captured in an
780 <para>The work for compute node high availability is captured in an
781 OpenStack user story and documented upstream, showing proposed solutions, 806 OpenStack user story and documented upstream, showing proposed solutions,
782 summit talks and presentations.</para> 807 summit talks and presentations. A number of these solutions make use of
783 808 OpenStack Resource Agents, which are a set of specialized pacemaker
784 <para>A number of these solutions make use of OpenStack Resource Agents, 809 resources capable of identifying failures in compute nodes and can perform
785 which are basically a set of specialized pacemaker resources which are 810 automatic evacuation of the instances affected by these failures.</para>
786 capable of identifying failures in compute nodes and can perform automatic
787 evacuation of the instances affected by these failures.</para>
788 811
789 <para>ENEA NFV Core 1.0 aims to validate and integrate this work and to 812 <para>ENEA NFV Core 1.0 aims to validate and integrate this work and to
790 make this feature available in the platform to be used as an alternative 813 make this feature available in the platform aimed as an alternative to the
791 to the Doctor framework, where simple, autonomous recovery of the running 814 Doctor framework, where simple, autonomous recovery of running instances
792 instances is desired.</para> 815 is desired.</para>
793 </section> 816 </section>
794</chapter> \ No newline at end of file 817</chapter> \ No newline at end of file