diff options
author | Miruna Paun <Miruna.Paun@enea.com> | 2017-09-28 18:53:02 +0200 |
---|---|---|
committer | Miruna Paun <Miruna.Paun@enea.com> | 2017-09-28 18:53:02 +0200 |
commit | cc001420304566cd252f2c6323dec3a826a12954 (patch) | |
tree | 299a124c12a96df1000e0b87753df9caef34b06a /book-enea-nfv-core-installation-guide/doc/high_availability.xml | |
parent | 380e975b1b93e83705c8ed30197b1c23f8193814 (diff) | |
download | doc-enea-nfv-cc001420304566cd252f2c6323dec3a826a12954.tar.gz |
Proofed entire installation guide, added all new need images.
USERDOCAP-240
Signed-off-by: Miruna Paun <Miruna.Paun@enea.com>
Diffstat (limited to 'book-enea-nfv-core-installation-guide/doc/high_availability.xml')
-rw-r--r-- | book-enea-nfv-core-installation-guide/doc/high_availability.xml | 599 |
1 files changed, 311 insertions, 288 deletions
diff --git a/book-enea-nfv-core-installation-guide/doc/high_availability.xml b/book-enea-nfv-core-installation-guide/doc/high_availability.xml index e489101..93f6468 100644 --- a/book-enea-nfv-core-installation-guide/doc/high_availability.xml +++ b/book-enea-nfv-core-installation-guide/doc/high_availability.xml | |||
@@ -2,27 +2,24 @@ | |||
2 | <chapter id="high_availability"> | 2 | <chapter id="high_availability"> |
3 | <title>High Availability Guide</title> | 3 | <title>High Availability Guide</title> |
4 | 4 | ||
5 | <para>ENEA NFV Core 1.0 has been designed to provide high availability | 5 | <para>Enea NFV Core 1.0 has been designed to provide high availability |
6 | characteristics that are needed for developing and deploying telco-grade NFV | 6 | characteristics that are needed for developing and deploying telco-grade NFV |
7 | solutions on top of our OPNFV based platform.</para> | 7 | solutions on top of our OPNFV based platform. The High Availability subject |
8 | 8 | in general is very wide and still an important focus in both opensource | |
9 | <para>The High Availability subject in general is very wide and still an | 9 | communities and the independent/proprietary solutions market.</para> |
10 | important focus in both opensource communities and independent/proprietary | 10 | |
11 | solutions market. ENEA NFV Core 1.0 aims to initially leverage the efforts | 11 | <para>Enea NFV Core 1.0 aims to initially leverage the efforts in the |
12 | in the upstream OPNFV and OpenStack opensource projects, combining solutions | 12 | upstream OPNFV and OpenStack opensource projects, combining solutions from |
13 | from both worlds in an effort to provide flexibility and a wide enough use | 13 | both worlds in an effort to provide flexibility and use-case coverage. Enea |
14 | case coverage. ENEA has a long time expertise and proprietary solutions | 14 | has long term expertise and proprietary solutions addressing High |
15 | addressing High Availability for telco applications, which are subject to | 15 | Availability for telco applications, which are subject to integration with |
16 | integrating with the NFV based solutions, however the initial scope for ENEA | 16 | the NFV based solutions.</para> |
17 | NFV Core is to leverage as much as possible the OPNFV Reference Platform and | ||
18 | open source projects in general, such as it will be seen further ahead in | ||
19 | this chapter.</para> | ||
20 | 17 | ||
21 | <section id="levels"> | 18 | <section id="levels"> |
22 | <title>High Availability Levels</title> | 19 | <title>High Availability Levels</title> |
23 | 20 | ||
24 | <para>The base for the feature set in ENEA NFV Core is divided into three | 21 | <para>The foundation for the feature set available in Enea NFV Core is |
25 | levels:</para> | 22 | divided into three levels:</para> |
26 | 23 | ||
27 | <itemizedlist> | 24 | <itemizedlist> |
28 | <listitem> | 25 | <listitem> |
@@ -30,7 +27,7 @@ | |||
30 | </listitem> | 27 | </listitem> |
31 | 28 | ||
32 | <listitem> | 29 | <listitem> |
33 | <para>NFV Platform HA</para> | 30 | <para>NFV Platform H.A.</para> |
34 | </listitem> | 31 | </listitem> |
35 | 32 | ||
36 | <listitem> | 33 | <listitem> |
@@ -38,118 +35,126 @@ | |||
38 | </listitem> | 35 | </listitem> |
39 | </itemizedlist> | 36 | </itemizedlist> |
40 | 37 | ||
41 | <para>The same division of levels of fault management can be seen in the | 38 | <para>The same division of levels for fault management can be seen in the |
42 | scope of the High Availability for OPNFV (Availability) project. OPNFV | 39 | scope of the High Availability for OPNFV ("Availability") project. OPNFV |
43 | also hosts the Doctor Project which is a fault management and maintenance | 40 | also hosts Doctor, a fault management and maintenance project designed to |
44 | project to develop and realize the consequent implementation for the OPNFV | 41 | develop and perform the consequent implementation of the OPNFV reference |
45 | reference platform.</para> | 42 | platform. These two projects complement each other.</para> |
46 | 43 | ||
47 | <para>These two projects complement each other.</para> | 44 | <para>The Availability project addresses H.A. requirements and solutions |
48 | 45 | from the perspective of the three levels mentioned above. It produces high | |
49 | <para>The Availability project addresses HA requirement and solutions from | 46 | level requirements and API definitions for High Availability for OPNFV, a |
50 | the perspective of the three levels mentioned above and produces high | 47 | H.A. Gap Analysis Report for OpenStack and more recently, works on |
51 | level requirements and API definitions for High Availability of OPNFV, HA | 48 | optimizing existing OPNFV test frameworks, such as Yardstick, developing |
52 | Gap Analysis Report for OpenStack and more recently works on optimizing | 49 | test cases which realize H.A.-specific use-cases and scenarios derived |
53 | existing OPNFV test frameworks, such as Yardstick, and develops test cases | 50 | from the H.A. requirements.</para> |
54 | which realize HA specific use cases and scenarios such as derived from the | 51 | |
55 | HA requirements.</para> | 52 | <para>The Doctor project aims to build fault management and maintenance |
56 | 53 | framework for the high availability of Network Services, on top of a | |
57 | <para>The Doctor Project on the other hand aims to build fault management | 54 | virtualized infrastructure. The key feature is immediate notification of |
58 | and maintenance framework for high availability of Network Services on top | 55 | unavailability of virtualized resources from VIM, to process recovery of |
59 | of virtualized infrastructure; the key feature is immediate notification | 56 | VNFs on them. </para> |
60 | of unavailability of virtualized resources from VIM, to process recovery | 57 | |
61 | of VNFs on them. The Doctor project has also collaborated with the | 58 | <para>The Doctor project has also collaborated with the Availability |
62 | Availability project on identifying gaps in upstream project, mainly | 59 | project on identifying gaps in upstream projects, such as but not |
63 | OpenStack but not exclusive, and has worked towards implementing missing | 60 | exclusively OpenStack. It has also worked towards implementing missing |
64 | features or improving the functionality, one good example being the Aodh | 61 | features and improving functionality, with a good example being the Aodh |
65 | event based alarms, which allows for fast notifications when certain | 62 | event based alarms, which allow for fast notifications when certain |
66 | predefined events occur. The Doctor project also produced an architecture | 63 | predefined events occur. </para> |
67 | design and a reference implementation based on opensource components, | 64 | |
68 | which will be presented later on in this document.</para> | 65 | <para>The Doctor project also produced an architectural design and a |
66 | reference implementation based on opensource components, which will be | ||
67 | presented later on in this document.</para> | ||
69 | </section> | 68 | </section> |
70 | 69 | ||
71 | <section id="doctor_arch"> | 70 | <section id="doctor_arch"> |
72 | <title>Doctor Architecture</title> | 71 | <title>Doctor Architecture</title> |
73 | 72 | ||
74 | <para>The Doctor documentation shows the detailed architecture for Fault | 73 | <para>The Doctor project documentation shows the detailed architecture for |
75 | Management and NFVI Maintenance . The two are very similar so we will | 74 | Fault Management and NFVI Maintenance. Being quite similar with each |
76 | focus on the Fault Management.</para> | 75 | other, the focus in the following sections shall remain on Fault |
76 | Management.</para> | ||
77 | 77 | ||
78 | <para>The architecture specifies a set of functional blocks:</para> | 78 | <para>The architecture specifies a set of functional blocks:</para> |
79 | 79 | ||
80 | <itemizedlist> | 80 | <itemizedlist> |
81 | <listitem> | 81 | <listitem> |
82 | <para>Monitor - monitors the virtualized infrastructure capturing | 82 | <para><emphasis role="bold">Monitor</emphasis> - monitors the |
83 | fault events in the Software and Hardware; for this particular | 83 | virtualized infrastructure, capturing fault events in software and |
84 | component we chose Zabbix which is integrated into the platform by | 84 | hardware. For this component we choose <emphasis |
85 | means of the Fuel Zabbix Plugin, available upstream.</para> | 85 | role="bold">Zabbix</emphasis> which is integrated into the platform |
86 | through the Fuel Zabbix Plugin, available upstream.</para> | ||
86 | </listitem> | 87 | </listitem> |
87 | 88 | ||
88 | <listitem> | 89 | <listitem> |
89 | <para>Inspector - this component is able to receive notifications from | 90 | <para><emphasis role="bold">Inspector</emphasis> - this component |
90 | Monitor components and also OpenStack core components, which allows it | 91 | receives notifications from Monitor components and OpenStack core |
91 | to create logic relationships between entities, identify affected | 92 | components, allowing it to create logical relationships between |
92 | resources when faults occur, and communicates with Controllers to | 93 | entities, identify affected resources when faults occur, and to |
93 | update the states of the virtual and physical resources. For this | 94 | communicate with Controllers in order to update the states of the |
94 | component ENEA NFV Core 1.0 makes use of Vitrage , an OpenStack | 95 | virtual and physical resources.</para> |
95 | related project used for Root Cause Analysis, which has been adapted | 96 | |
96 | to server as a Doctor Inspector. The integration into the platform is | 97 | <para>For this component Enea NFV Core 1.0 makes use of Vitrage, an |
97 | realized with the help of a Fuel Plugin which has been developed | 98 | OpenStack related project used for Root Cause Analysis. The |
98 | internally by ENEA.</para> | 99 | integration into the platform is done with the help of a Fuel Plugin |
100 | which has been developed internally by Enea.</para> | ||
99 | </listitem> | 101 | </listitem> |
100 | 102 | ||
101 | <listitem> | 103 | <listitem> |
102 | <para>Controller - OpenStack core components act as Controllers, which | 104 | <para><emphasis role="bold">Controller - </emphasis>OpenStack core |
103 | are responsible for maintaining the resource map between physical and | 105 | components act as Controllers responsible for maintaining the resource |
104 | virtual resources, they accept update requests from the Inspector and | 106 | map between physical and virtual resources. They accept update |
105 | are responsible for sending failure event notifications to the | 107 | requests from the Inspector and are responsible for sending failure |
106 | Notifier. Components such as Nova, Neutron, Glance, Heat act as | 108 | event notifications to the Notifier. Components such as Nova, Neutron, |
107 | Controllers in the Doctor Architecture.</para> | 109 | Glance, and Heat, act as Controllers in the Doctor |
110 | Architecture.</para> | ||
108 | </listitem> | 111 | </listitem> |
109 | 112 | ||
110 | <listitem> | 113 | <listitem> |
111 | <para>Notifier - the focus of this component is on selecting and | 114 | <para><emphasis role="bold">Notifier</emphasis> - the focus of this |
112 | aggregating failure events received from the controller based on | 115 | component is on selecting and aggregating failure events received from |
113 | policies mandated by the Consumer. The role of the Notifier is | 116 | the controller, based on policies mandated by the Consumer. The role |
114 | accomplished by the Aodh component in OpenStack.</para> | 117 | of the Notifier is filled by the Aodh component in OpenStack.</para> |
115 | </listitem> | 118 | </listitem> |
116 | </itemizedlist> | 119 | </itemizedlist> |
117 | 120 | ||
118 | <para>Besides the Doctor components there are a couple other blocks | 121 | <para>Alongside the Doctor components, there are a few other blocks |
119 | mentioned in the architecture:</para> | 122 | mentioned:</para> |
120 | 123 | ||
121 | <itemizedlist> | 124 | <itemizedlist> |
122 | <listitem> | 125 | <listitem> |
123 | <para>Administrator - this represents the human role of administrating | 126 | <para><emphasis role="bold">Administrator</emphasis> - this represents |
124 | the platform by means of dedicated interfaces, either visual | 127 | the human role of administrating the platform by means of dedicated |
125 | dashboards, like OpenStack Horizon or Fuel Dashboard, or via CLI | 128 | interfaces. These can be visual dashboards like OpenStack Horizon or |
126 | tools, like the OpenStack unified CLI that can be accessed | 129 | Fuel Dashboard, or via CLI tools like the OpenStack unified CLI, that |
127 | traditionally from one of the servers that act as OpenStack Controller | 130 | can be accessed from one of the servers that act as OpenStack |
128 | nodes. In the case of ENEA NFV Core 1.0, the Administrator can also | 131 | Controller nodes. </para> |
129 | access the Zabbix dashboard for doing further configurations. The same | 132 | |
133 | <para>In Enea NFV Core 1.0 the Administrator can also access the | ||
134 | Zabbix dashboard to perform supplementary configurations. The same | ||
130 | applies for the Vitrage tool, which comes with its own Horizon | 135 | applies for the Vitrage tool, which comes with its own Horizon |
131 | dashboard which enables the user to visually inspect the faults | 136 | dashboard, enabling the user to visually inspect the faults reported |
132 | reported by the monitoring tools and also creates visual | 137 | by the monitoring tools through visual representations of the virtual |
133 | representations of the virtual and physical resources, the | 138 | and physical resources, the relationships between them and the fault |
134 | relationships between them and the fault correlation. For Vitrage, | 139 | correlation. </para> |
135 | users will usually want to configure additional usecases and describe | 140 | |
136 | relationships between components, via template files written in yaml | 141 | <para>For Vitrage, users will usually want to configure additional |
137 | format. More information about using Vitrage will be presented in a | 142 | use-cases and describe relationships between components via template |
138 | following section.</para> | 143 | files written in <literal>yaml</literal> format.</para> |
139 | </listitem> | 144 | </listitem> |
140 | 145 | ||
141 | <listitem> | 146 | <listitem> |
142 | <para>Consumer - this block is vaguely described in the Doctor | 147 | <para><emphasis role="bold">Consumer</emphasis> - this block is |
143 | Architecture and it's out of its scope. Doctor only deals with fault | 148 | vaguely described in the Doctor Architecture and is out of its current |
144 | detection and management, making sure faults are handled as soon as | 149 | scope. Doctor only deals with fault detection and management, but |
145 | possible after detection, identifies affected virtual resources and | 150 | since the actual VNFs are managed, according to the ETSI architecture, |
146 | updates the states of them, but since the actual VNFs are managed, | 151 | by a different entity, Doctor does not deal with recovery actions of |
147 | according to the ETSI architecture, by a different entity, Doctor does | 152 | the VNFs. The role of the Consumer thus falls to that of a VNF Manager |
148 | not deal with recovery actions of the VNFs. The role of the Consumer | 153 | and Orchestrator.</para> |
149 | thus falls in the task of a VNF Manager and Orchestrator. ENEA NFV | 154 | |
150 | Core 1.0 provides VNF management capabilities using Tacker, which is | 155 | <para>Enea NFV Core 1.0 provides VNF management capabilities using |
151 | an OpenStack project that implements a generic VNF Manager and | 156 | Tacker, which is an OpenStack project that implements a generic VNF |
152 | Orchestrator according to the ETSI MANO Architectural | 157 | Manager and Orchestrator, according to the ETSI MANO Architectural |
153 | Framework.</para> | 158 | Framework.</para> |
154 | </listitem> | 159 | </listitem> |
155 | </itemizedlist> | 160 | </itemizedlist> |
@@ -170,12 +175,12 @@ | |||
170 | 175 | ||
171 | <para>The architecture described in the Doctor project has been | 176 | <para>The architecture described in the Doctor project has been |
172 | demonstrated in various PoCs and demos, but always using sample | 177 | demonstrated in various PoCs and demos, but always using sample |
173 | components for either the consumer or the monitor. ENEA has worked with | 178 | components for either the consumer or the monitor. Enea has worked with |
174 | upstream projects, Doctor and Vitrage, to realize the goals of the | 179 | upstream projects Doctor and Vitrage, to realize the goals of the Doctor |
175 | Doctor project by using real components, as described before.</para> | 180 | project by using real components as described above.</para> |
176 | 181 | ||
177 | <para>The two pictures below show a typical fault management scenario, | 182 | <para>The two pictures below show a typical fault management |
178 | as described in the Doctor documentation.</para> | 183 | scenario:</para> |
179 | 184 | ||
180 | <mediaobject> | 185 | <mediaobject> |
181 | <imageobject> | 186 | <imageobject> |
@@ -189,70 +194,81 @@ | |||
189 | </imageobject> | 194 | </imageobject> |
190 | </mediaobject> | 195 | </mediaobject> |
191 | 196 | ||
192 | <para>ENEA NFV Core 1.0 uses the same approach described above, but it's | 197 | <para>Enea NFV Core 1.0 uses the same approach described above:</para> |
193 | worth going through each step and detail them.</para> | ||
194 | 198 | ||
195 | <orderedlist> | 199 | <orderedlist> |
196 | <listitem> | 200 | <listitem> |
197 | <para>When creating a VNF, the user will have to enable the | 201 | <para>When creating a VNF, the user will have to enable the |
198 | monitoring capabilities of Tacker, by passing a template which | 202 | monitoring capabilities of Tacker by passing a template, which |
199 | specifies that an alarm will be created when the VM represented by | 203 | specifies that an alarm will be created when the VM represented by |
200 | this VNF changes state. The support for alarm monitoring in Tacker | 204 | this VNF changes state. The support for alarm monitoring in Tacker |
201 | is captured in the Alarm Monitoring Framework spec in OpenStack | 205 | is detailed in the Alarm Monitoring Framework spec in the OpenStack |
202 | documentation. In a few words, Tacker should be able to create a VNF | 206 | documentation.</para> |
203 | and then create an Aodh alarm of type event which triggers when the | 207 | |
204 | instance is in state ERROR. The action to take when this event | 208 | <para>Tacker should be able to create a VNF and then an Aodh alarm |
205 | triggers is to perform an HTTP call, to an URL managed by Tacker. As | 209 | of type event, triggerable when the instance is in a state of ERROR. |
206 | a result of this action, Tacker can detect when an instance has | 210 | When this event is triggered perform an HTTP call to a URL managed |
207 | failed (for whatever reasons) and will respawn it somewhere | 211 | by Tacker. As a result of this action, Tacker can detect when an |
208 | else.</para> | 212 | instance has failed (for whatever reason) and will respawn it |
213 | somewhere else.</para> | ||
209 | </listitem> | 214 | </listitem> |
210 | 215 | ||
211 | <listitem> | 216 | <listitem> |
212 | <para>The subscribe response in this case is an empty operation, the | 217 | <para>The subscribed response in this case is an empty operation, |
213 | Notifier (Aodh) only has to confirm that the alarm has been | 218 | the Notifier (Aodh) only has to confirm that the alarm has been |
214 | created.</para> | 219 | created.</para> |
215 | </listitem> | 220 | </listitem> |
216 | 221 | ||
217 | <listitem> | 222 | <listitem> |
218 | <para>The NFVI sends monitoring events for resources the VIM has | 223 | <para>The NFVI sends monitoring events for the resources the VIM has |
219 | been subscribed to. Note: this subscription message exchange between | 224 | been subscribed to. </para> |
220 | the VIM and NFVI is not shown in this message flow. This steps is | 225 | |
221 | related to Vitrage's capability of receiving notifications from | 226 | <note> |
222 | OpenStack services, at this moment Vitrage supports notifications | 227 | <para>This subscription message exchange between the VIM and NFVI |
223 | from nova.host, nova.instances, nova.zone, cinder.volume, | 228 | is not shown in this message flow. This step is related to |
224 | neutron.network, neutron.port and heat.stack OpenStack | 229 | Vitrage's capability of receiving notifications from OpenStack |
225 | datasources.</para> | 230 | services. At this moment Vitrage supports notifications from |
231 | <literal>nova.host</literal>, <literal>nova.instances</literal>, | ||
232 | <literal>nova.zone</literal>, <literal>cinder.volume</literal>, | ||
233 | <literal>neutron.network</literal>, | ||
234 | <literal>neutron.port</literal> and <literal>heat.stack</literal> | ||
235 | OpenStack datasources.</para> | ||
236 | </note> | ||
226 | </listitem> | 237 | </listitem> |
227 | 238 | ||
228 | <listitem> | 239 | <listitem> |
229 | <para>This steps describes faults being detected by Zabbix which are | 240 | <para>This step describes faults detected by Zabbix which are sent |
230 | sent to the Inspector (Vitrage) as soon as detected, using a push | 241 | to the Inspector (Vitrage) as soon as detected. This is done using a |
231 | approach by means of sending an AMQP message to a dedicated message | 242 | push approach by means of sending an AMQP message to a dedicated |
232 | queue managed by Vitrage. For example, if nova-compute fails on one | 243 | message queue managed by Vitrage. For example, if |
233 | of the compute nodes, Zabbix will format a message specifying all | 244 | <literal>nova-compute</literal> fails on one of the compute nodes, |
234 | the needed details needed for processing the fault, e.g. a | 245 | Zabbix will format a message specifying all the needed details |
235 | timestamp, what host failed, what event occurred and others.</para> | 246 | required for processing the fault: a timestamp, what host failed, |
247 | what event occurred etc.</para> | ||
236 | </listitem> | 248 | </listitem> |
237 | 249 | ||
238 | <listitem> | 250 | <listitem> |
239 | <para>Database lookup to find the virtual resources affected by the | 251 | <para>This step shows database lookup geared to find the virtual |
240 | detected fault. In this step Vitrage will perform various | 252 | resources affected by the detected fault. Vitrage will perform |
241 | calculations to detect what virtual resources are affected by the | 253 | various calculations to detect what virtual resources are affected |
242 | raw failure presented by Zabbix. Vitrage can be configured via | 254 | by the raw failure presented by Zabbix. </para> |
243 | templates to correlate instances with the physical hosts they are | 255 | |
244 | running on, so that if a compute node fails, then instances running | 256 | <para>Vitrage can be configured via templates to correlate instances |
245 | on that host will be affected. A typical usecase is to mark the | 257 | with the physical hosts they are running on, so that if a compute |
246 | compute node down (a.k.a mark_host_down) and update the states of | 258 | node fails, then instances running on that host will be affected. A |
247 | all instances running on them, by issuing Nova API calls for each of | 259 | typical use-case is to mark the compute node down |
248 | these instances. Step 5c) shows the Controller (Nova in this case) | 260 | (<literal>mark_host_down</literal>) and update the states of all |
249 | acting upon the state change of the instance and issues an event | 261 | instances running on them. This is done by issuing Nova API calls |
250 | alarm to Aodh.</para> | 262 | for each of these instances. </para> |
263 | |||
264 | <para>Step 5c. shows the Controller (Nova in this case) acting upon | ||
265 | the state change of the instance and issuing an event alarm to | ||
266 | Aodh.</para> | ||
251 | </listitem> | 267 | </listitem> |
252 | 268 | ||
253 | <listitem> | 269 | <listitem> |
254 | <para>The Notifier will acknowledge the alarm event request from | 270 | <para>The Notifier will acknowledge the alarm event request from |
255 | Nova and will trigger the alarm(s) created by Tacker in step 1). | 271 | Nova and will trigger the alarm(s) created by Tacker in step 1. |
256 | Since Tacker has configured the alarm to send an HTTP request, Aodh | 272 | Since Tacker has configured the alarm to send an HTTP request, Aodh |
257 | will perform that HTTP call at the URL managed by Tacker.</para> | 273 | will perform that HTTP call at the URL managed by Tacker.</para> |
258 | </listitem> | 274 | </listitem> |
@@ -268,7 +284,7 @@ | |||
268 | </listitem> | 284 | </listitem> |
269 | </orderedlist> | 285 | </orderedlist> |
270 | 286 | ||
271 | <note> | 287 | <note condition="hidden"> |
272 | <para>The ENEA NFV Core 1.0 Pre-Release fully covers the required | 288 | <para>The ENEA NFV Core 1.0 Pre-Release fully covers the required |
273 | Doctor functionality only for the Vitrage and Zabbix | 289 | Doctor functionality only for the Vitrage and Zabbix |
274 | components.</para> | 290 | components.</para> |
@@ -280,27 +296,25 @@ | |||
280 | 296 | ||
281 | <para>Vitrage supports Zabbix datasource by means of regularly polling | 297 | <para>Vitrage supports Zabbix datasource by means of regularly polling |
282 | the Zabbix agents, which need to be configured in advance. The Vitrage | 298 | the Zabbix agents, which need to be configured in advance. The Vitrage |
283 | plugin developed internally by ENEA can automatically configure Zabbix | 299 | plugin developed internally by Enea can automatically configure Zabbix |
284 | so that everything works as expected.</para> | 300 | so that everything works as expected. Polling however, is not fast |
285 | 301 | enough for a telco use-case, so it is necessary to configure push | |
286 | <para>However, polling is not fast enough for a telco usecase, so it is | 302 | notifications for Zabbix . This requires manual configuration on one of |
287 | necessary to configure pushed notifications for Zabbix . This requires | 303 | the controller nodes, since Zabbix uses a centralized database which |
288 | manual configuration on one of the controller nodes, since Zabbix uses a | 304 | makes the configuration available on all the other nodes.</para> |
289 | centralized database which makes the configuration available on all the | ||
290 | other nodes.</para> | ||
291 | 305 | ||
292 | <para>The Zabbix configuration dashboard is available at the same IP | 306 | <para>The Zabbix configuration dashboard is available at the same IP |
293 | address where OpenStack can be reached, e.g. | 307 | address where OpenStack can be reached, e.g. |
294 | http://<vip__zbx_vip_mgmt>/zabbix.</para> | 308 | <literal>http://<vip__zbx_vip_mgmt>/zabbix</literal>.</para> |
295 | 309 | ||
296 | <para>To forward zabbix events to Vitrage a new media script needs to be | 310 | <para>To forward zabbix events to Vitrage, a new media script needs to |
297 | created and associated with a user. Follow the steps below as a Zabbix | 311 | be created and associated with a user. Follow the steps below as a |
298 | Admin user:</para> | 312 | Zabbix Admin user:</para> |
299 | 313 | ||
300 | <orderedlist> | 314 | <orderedlist> |
301 | <listitem> | 315 | <listitem> |
302 | <para>Create a new media type [Admininstration Media Types Create | 316 | <para>Create a new media type [Admininstration Media Types > |
303 | Media Type]</para> | 317 | Create Media Type]</para> |
304 | 318 | ||
305 | <itemizedlist> | 319 | <itemizedlist> |
306 | <listitem> | 320 | <listitem> |
@@ -312,7 +326,7 @@ | |||
312 | </listitem> | 326 | </listitem> |
313 | 327 | ||
314 | <listitem> | 328 | <listitem> |
315 | <para>Script name: zabbix_vitrage.py</para> | 329 | <para>Script name: <filename>zabbix_vitrage.py</filename></para> |
316 | </listitem> | 330 | </listitem> |
317 | </itemizedlist> | 331 | </itemizedlist> |
318 | </listitem> | 332 | </listitem> |
@@ -327,14 +341,15 @@ | |||
327 | </listitem> | 341 | </listitem> |
328 | 342 | ||
329 | <listitem> | 343 | <listitem> |
330 | <para>Send to: rabbit://rabbit_user:rabbit_pass@127.0.0.1:5672/ | 344 | <para>Send to: |
331 | --- Vitrage message bus url (you need to search for this in | 345 | <literal>rabbit://rabbit_user:rabbit_pass@127.0.0.1:5672/ |
332 | /etc/vitrage/vitrage.conf or /etc/nova/nova.conf | 346 | ---</literal> Vitrage message bus url (you need to search for |
333 | transport_url)</para> | 347 | this in <literal>/etc/vitrage/vitrage.conf or |
348 | /etc/nova/nova.conf transport_url</literal>)</para> | ||
334 | </listitem> | 349 | </listitem> |
335 | 350 | ||
336 | <listitem> | 351 | <listitem> |
337 | <para>When active: 1-7,00:00-24:00</para> | 352 | <para>When active: 1-7, 00:00-24:00</para> |
338 | </listitem> | 353 | </listitem> |
339 | 354 | ||
340 | <listitem> | 355 | <listitem> |
@@ -348,8 +363,8 @@ | |||
348 | </listitem> | 363 | </listitem> |
349 | 364 | ||
350 | <listitem> | 365 | <listitem> |
351 | <para>Configure Action [Configuration Actions Create Action | 366 | <para>Configure Action [Configuration > Actions > Create |
352 | Action]</para> | 367 | Action > Action]</para> |
353 | 368 | ||
354 | <itemizedlist> | 369 | <itemizedlist> |
355 | <listitem> | 370 | <listitem> |
@@ -361,19 +376,25 @@ | |||
361 | </listitem> | 376 | </listitem> |
362 | 377 | ||
363 | <listitem> | 378 | <listitem> |
364 | <para>Default Message: host={HOST.NAME1} hostid={HOST.ID1} | 379 | <para>Default Message:</para> |
365 | hostip={HOST.IP1} triggerid={TRIGGER.ID} | 380 | |
366 | description={TRIGGER.NAME} rawtext={TRIGGER.NAME.ORIG} | 381 | <programlisting>host={HOST.NAME1} |
367 | expression={TRIGGER.EXPRESSION} value={TRIGGER.VALUE} | 382 | hostid={HOST.ID1} |
368 | priority={TRIGGER.NSEVERITY} lastchange={EVENT.DATE} | 383 | hostip={HOST.IP1} |
369 | {EVENT.TIME}</para> | 384 | triggerid={TRIGGER.ID} |
385 | description={TRIGGER.NAME} | ||
386 | rawtext={TRIGGER.NAME.ORIG} | ||
387 | expression={TRIGGER.EXPRESSION} | ||
388 | value={TRIGGER.VALUE} | ||
389 | priority={TRIGGER.NSEVERITY} | ||
390 | lastchange={EVENT.DATE} {EVENT.TIME}</programlisting> | ||
370 | </listitem> | 391 | </listitem> |
371 | </itemizedlist> | 392 | </itemizedlist> |
372 | </listitem> | 393 | </listitem> |
373 | 394 | ||
374 | <listitem> | 395 | <listitem> |
375 | <para>To send events add under the Conditions tab: 'Maintenance | 396 | <para>To send events add under the <literal>Conditions</literal> |
376 | status not in 'maintenance'".</para> | 397 | tab: "Maintenance status not in "maintenance"".</para> |
377 | </listitem> | 398 | </listitem> |
378 | 399 | ||
379 | <listitem> | 400 | <listitem> |
@@ -391,32 +412,34 @@ | |||
391 | </listitem> | 412 | </listitem> |
392 | </orderedlist> | 413 | </orderedlist> |
393 | 414 | ||
394 | <para>Using these instructions, Zabbix will call the zabbix_vitrage.py | 415 | <para>Using these instructions, Zabbix will call the |
395 | script, which is made readily available by the Fuel Vitrage Plugin, | 416 | <literal>zabbix_vitrage.py</literal> script, made readily available by |
396 | passing the arguments described in step 3). The zabbix_vitrage.py script | 417 | the Fuel Vitrage Plugin, to pass the arguments described in step 3. The |
397 | will then interpret the parameters and format an AMQP message will be | 418 | <literal>zabbix_vitrage.py</literal> script will then interpret the |
398 | sent to the vitrage.notifications queue, which is managed by the | 419 | parameters and format an AMQP message to be sent to the |
420 | <literal>vitrage.notifications</literal> queue, managed by the | ||
399 | vitrage-graph service.</para> | 421 | vitrage-graph service.</para> |
400 | </section> | 422 | </section> |
401 | 423 | ||
402 | <section id="vitrage_config"> | 424 | <section id="vitrage_config"> |
403 | <title>Vitrage Configuration</title> | 425 | <title>Vitrage Configuration</title> |
404 | 426 | ||
405 | <para>The Vitrage team has been collaborating with OPNFV Doctor Project | 427 | <para>The Vitrage team has been collaborating with the OPNFV Doctor |
406 | in order to support Vitrage as an Inspector Component. The Doctor | 428 | project in order to support Vitrage as an Inspector Component. The |
407 | usecase for Vitrage is described in an OpenStack blueprint . | 429 | Doctor use-case for Vitrage is described in an OpenStack blueprint. Enea |
408 | Additionally, ENEA NFV Core has complemented Vitrage with the capability | 430 | NFV Core has complemented Vitrage with the ability to set the states of |
409 | of setting states of failed instances by implementing an action type in | 431 | failed instances by implementing an action type in Vitrage. This action |
410 | Vitrage which calls Nova APIs to set instances in error state. There is | 432 | calls Nova APIs to set instances in error state. An action type which |
411 | also an action type which allows fencing failed hosts.</para> | 433 | allows fencing failed hosts also exists.</para> |
412 | 434 | ||
413 | <para>In order to make use of these features, Vitrage supports | 435 | <para>In order to make use of these features, Vitrage supports |
414 | additional configurations via yaml templates that must be placed in | 436 | additional configurations via <literal>yaml</literal> templates that |
415 | /etc/vitrage/templates on the nodes have the Vitrage role.</para> | 437 | must be placed in <literal>/etc/vitrage/templates</literal> on the nodes |
438 | have the Vitrage role.</para> | ||
416 | 439 | ||
417 | <para>The example below shows how to program Vitrage to mark failed | 440 | <para>The example below shows how to program Vitrage to mark failed |
418 | compute hosts as down and then to change the state of the instances to | 441 | compute hosts as down and then to change the state of the instances to |
419 | Error, by creating Vitrage deduced alarms.</para> | 442 | ERROR, by creating Vitrage deduced alarms.</para> |
420 | 443 | ||
421 | <programlisting>metadata: | 444 | <programlisting>metadata: |
422 | name: test_nova_mark_instance_err | 445 | name: test_nova_mark_instance_err |
@@ -466,7 +489,7 @@ scenarios: | |||
466 | properties: | 489 | properties: |
467 | state: ERROR</programlisting> | 490 | state: ERROR</programlisting> |
468 | 491 | ||
469 | <para>For the action type of fencing a similar action item must be | 492 | <para>For the action type of fencing, a similar action item must be |
470 | added:</para> | 493 | added:</para> |
471 | 494 | ||
472 | <programlisting>- scenario: | 495 | <programlisting>- scenario: |
@@ -477,8 +500,9 @@ scenarios: | |||
477 | action_target: | 500 | action_target: |
478 | target: host</programlisting> | 501 | target: host</programlisting> |
479 | 502 | ||
480 | <para>After a template is configured, it is required to restart the | 503 | <para>After a template is configured, a restart of the |
481 | vitrage-api and vitrage-graph services:</para> | 504 | <literal>vitrage-api</literal> and <literal>vitrage-graph</literal> |
505 | services is needed:</para> | ||
482 | 506 | ||
483 | <programlisting>root@node-6:~# systemctl restart vitrage-api | 507 | <programlisting>root@node-6:~# systemctl restart vitrage-api |
484 | root@node-6:~# systemctl restart vitrage-graph</programlisting> | 508 | root@node-6:~# systemctl restart vitrage-graph</programlisting> |
@@ -487,12 +511,12 @@ root@node-6:~# systemctl restart vitrage-graph</programlisting> | |||
487 | <section id="vitrage_custom"> | 511 | <section id="vitrage_custom"> |
488 | <title>Vitrage Customizations</title> | 512 | <title>Vitrage Customizations</title> |
489 | 513 | ||
490 | <para>ENEA NFV Core 1.0 has added custom features for Vitrage which | 514 | <para>Enea NFV Core 1.0 has added custom features for Vitrage which |
491 | allow two kinds of action:</para> | 515 | allow two kinds of actions:</para> |
492 | 516 | ||
493 | <orderedlist> | 517 | <orderedlist> |
494 | <listitem> | 518 | <listitem> |
495 | <para>Perform actions Northbound of the VIM</para> | 519 | <para>Perform actions Northbound of the VIM:</para> |
496 | 520 | ||
497 | <itemizedlist> | 521 | <itemizedlist> |
498 | <listitem> | 522 | <listitem> |
@@ -500,23 +524,23 @@ root@node-6:~# systemctl restart vitrage-graph</programlisting> | |||
500 | </listitem> | 524 | </listitem> |
501 | 525 | ||
502 | <listitem> | 526 | <listitem> |
503 | <para>Setting instance state to error in nova; this is used in | 527 | <para>Setting instance state to ERROR in nova. This is used in |
504 | conjunction with an alarm created by Tacker, as described | 528 | conjunction with an alarm created by Tacker, as described |
505 | before, should allow Tacker to detect when an instance is | 529 | before, and should allow Tacker to detect when an instance is |
506 | affected and take proper actions.</para> | 530 | affected and take proper actions.</para> |
507 | </listitem> | 531 | </listitem> |
508 | </itemizedlist> | 532 | </itemizedlist> |
509 | </listitem> | 533 | </listitem> |
510 | 534 | ||
511 | <listitem> | 535 | <listitem> |
512 | <para>Perform actions Southbound of the VIM.</para> | 536 | <para>Perform actions Southbound of the VIM:</para> |
513 | 537 | ||
514 | <para>Vitrage templates allow us to program fencing actions for | 538 | <para>Vitrage templates allow us to program fencing actions for |
515 | hosts with failed services. In the event of that systemd is unable | 539 | hosts with failed services. In the event that |
516 | to recover from a critical process or other type of sofware error | 540 | <literal>systemd</literal> is unable to recover from a critical |
517 | ocurs on Hardware supporting them, we can program a fencing of that | 541 | process or a type of sofware error ocurs on the Hardware supporting |
518 | Node which will perform a reboot thus attempting to recover a failed | 542 | them, the fencing of Node can be programmed, and it in turn will |
519 | node.</para> | 543 | perform a reboot, attempting to recover the failed node.</para> |
520 | </listitem> | 544 | </listitem> |
521 | </orderedlist> | 545 | </orderedlist> |
522 | </section> | 546 | </section> |
@@ -529,48 +553,49 @@ root@node-6:~# systemctl restart vitrage-graph</programlisting> | |||
529 | characteristics employ pacemaker for achieving highly available OpenStack | 553 | characteristics employ pacemaker for achieving highly available OpenStack |
530 | services. Traditionally pacemaker has been used for managing only the | 554 | services. Traditionally pacemaker has been used for managing only the |
531 | control plane services, so it can effectively provide redundancy and | 555 | control plane services, so it can effectively provide redundancy and |
532 | recovery for the Controller nodes only. One reason for this is that | 556 | recovery for the Controller nodes only. A reason for this is that |
533 | Controller nodes and Compute nodes essentially have very different High | 557 | Controller nodes and Compute nodes essentially have very different High |
534 | Availability requirements that need to be considered. Typically, for | 558 | Availability requirements that need to be considered. </para> |
535 | Controller nodes, the services that run on them are stateless, with few | 559 | |
536 | exceptions, where only one instance of a given service is allowed, but for | 560 | <para>Typically, for Controller nodes, the services that run on them are |
537 | which redundancy is still desired, one good example being an AMQP service | 561 | stateless, with few exceptions, where only one instance of a given service |
538 | (e.g. RabbitMQ). Compute nodes HA requirements depend on the type of | 562 | is allowed, but for which redundancy is still desired. A good example |
539 | services that run on them, but typically it is desired that failures on | 563 | would be an AMQP service (e.g. RabbitMQ). Compute nodes H.A. requirements |
540 | these nodes is detected as soon as possible so that the instances that run | 564 | depend on the type of services that run on them, but typically it is |
541 | on them can be either migrated, resurrected or restarted. One other aspect | 565 | desired that failures on these nodes be detected as soon as possible so |
542 | is that sometimes failures on the physical hosts do not necessarily cause | 566 | that the instances that run on them can be either migrated, resurrected or |
543 | a failure on the services (VNFs), but having these services incapacitated | 567 | restarted. Sometimes failures on the physical hosts do not necessarily |
544 | can prevent accessing and controlling the services.</para> | 568 | cause a failure on the services (VNFs), but having these services |
545 | 569 | incapacitated can prevent access to and controlling the services.</para> | |
546 | <para>So Controller High Availability is one subject which is in general | 570 | |
547 | well understood and experimented with, and the base of achieving this is | 571 | <para>Controller High Availability is thus a subject generally well |
548 | Pacemaker using Corosync underneath.</para> | 572 | understood and experimented with, and the basis for this is Pacemaker |
573 | using Corosync underneath.</para> | ||
549 | 574 | ||
550 | <para>Extending the use of pacemaker to Compute nodes was thought as a | 575 | <para>Extending the use of pacemaker to Compute nodes was thought as a |
551 | possible solution for providing VNF high availability, but this turns out | 576 | possible solution for providing VNF high availability, but the problem |
552 | to be a problem which is not easy to solve. On one hand pacemaker as a | 577 | turned out to be more complicated. On one hand, pacemaker as a clustering |
553 | clustering tool can only scale properly up to limited number of nodes, | 578 | tool, can only scale properly up to a limited number of nodes, usually |
554 | usually less than 128. This poses a problem for large scale deployments | 579 | less than 128. This poses a problem for large scale deployments where |
555 | where hundreds of compute nodes are required. On the other hand, Compute | 580 | hundreds of compute nodes are required. On the other hand, Compute node |
556 | node HA requires other considerations and calls for specially designed | 581 | H.A. requires other considerations and calls for specially designed |
557 | solutions.</para> | 582 | solutions.</para> |
558 | 583 | ||
559 | <section id="pm_remote"> | 584 | <section id="pm_remote"> |
560 | <title>Pacemaker Remote</title> | 585 | <title>Pacemaker Remote</title> |
561 | 586 | ||
562 | <para>As mentioned earlier, pacemaker and corosync do not scale well | 587 | <para>As mentioned earlier, pacemaker and corosync do not scale well |
563 | over a large cluster, because each node has to talk to everyone, | 588 | over a large cluster, since each node has to talk to every other, |
564 | essentially creating a mesh configuration. Some solution to this problem | 589 | essentially creating a mesh configuration. A solution to this problem |
565 | could be partitioning the cluster into smaller groups, but this solution | 590 | could be partitioning the cluster into smaller groups, but this has its |
566 | has its limitation and it's generally difficult to manage.</para> | 591 | limitations and it is generally difficult to manage. </para> |
567 | 592 | ||
568 | <para>A better solution is using pacemaker-remote, a feature of | 593 | <para>A better solution is using <literal>pacemaker-remote</literal>, a |
569 | pacemaker which allows extending the cluster beyond the usual limits by | 594 | feature of pacemaker, which allows for extending the cluster beyond the |
570 | using the pacemaker monitoring capabilities, essentially creating a new | 595 | usual limits by using the pacemaker monitoring capabilities. It |
571 | type of resource which enables adding light weight nodes to the cluster. | 596 | essentially creates a new type of resource which enables adding light |
572 | More information about pacemaker-remote can be found on the official | 597 | weight nodes to the cluster. More information about pacemaker-remote can |
573 | clusterlabs website.</para> | 598 | be found on the official clusterlabs website.</para> |
574 | 599 | ||
575 | <para>Please note that at this moment pacemaker remote must be | 600 | <para>Please note that at this moment pacemaker remote must be |
576 | configured manually after deployment. Here are the manual steps for | 601 | configured manually after deployment. Here are the manual steps for |
@@ -578,13 +603,13 @@ root@node-6:~# systemctl restart vitrage-graph</programlisting> | |||
578 | 603 | ||
579 | <orderedlist> | 604 | <orderedlist> |
580 | <listitem> | 605 | <listitem> |
581 | <para>Logon to the Fuel Master using the default credentials if not | 606 | <para>Log onto the Fuel Master using the default credentials, if |
582 | changed (root/r00tme)</para> | 607 | they have not been changed (root/r00tme).</para> |
583 | </listitem> | 608 | </listitem> |
584 | 609 | ||
585 | <listitem> | 610 | <listitem> |
586 | <para>Type fuel node to obtain the list of nodes, their roles and | 611 | <para>Type fuel node to obtain the list of nodes, their roles and |
587 | the IP addresses</para> | 612 | the IP addresses.</para> |
588 | 613 | ||
589 | <programlisting>[root@fuel ~]# fuel node | 614 | <programlisting>[root@fuel ~]# fuel node |
590 | id | status | name | cluster | ip | mac | roles / | 615 | id | status | name | cluster | ip | mac | roles / |
@@ -604,10 +629,10 @@ controller, vitrage | | 1 | 1</programlisting> | |||
604 | </listitem> | 629 | </listitem> |
605 | 630 | ||
606 | <listitem> | 631 | <listitem> |
607 | <para>Each controller has a unique pacemaker authkey, we need to | 632 | <para>Each controller has a unique pacemaker authkey. One needs to |
608 | keep one an propagate it to the other servers. Assuming node-1, | 633 | be kept and propagated to the other servers. Assuming node-1, node-2 |
609 | node-2 and node-3 are the controllers, execute the following from | 634 | and node-3 are the controllers, execute the following from the Fuel |
610 | the Fuel console:</para> | 635 | console:</para> |
611 | 636 | ||
612 | <programlisting>[root@fuel ~]# scp node-1:/etc/pacemaker/authkey . | 637 | <programlisting>[root@fuel ~]# scp node-1:/etc/pacemaker/authkey . |
613 | [root@fuel ~]# scp authkey node-2:/etc/pacemaker/ | 638 | [root@fuel ~]# scp authkey node-2:/etc/pacemaker/ |
@@ -619,7 +644,7 @@ controller, vitrage | | 1 | 1</programlisting> | |||
619 | 644 | ||
620 | <listitem> | 645 | <listitem> |
621 | <para>For each compute node, log on to it using the corresponding | 646 | <para>For each compute node, log on to it using the corresponding |
622 | IP.</para> | 647 | IP</para> |
623 | </listitem> | 648 | </listitem> |
624 | 649 | ||
625 | <listitem> | 650 | <listitem> |
@@ -629,7 +654,7 @@ controller, vitrage | | 1 | 1</programlisting> | |||
629 | </listitem> | 654 | </listitem> |
630 | 655 | ||
631 | <listitem> | 656 | <listitem> |
632 | <para>Copy the authkey from the Fuel master and make sure the right | 657 | <para>Copy the authkey from the Fuel Master and make sure the right |
633 | permissions are set:</para> | 658 | permissions are set:</para> |
634 | 659 | ||
635 | <programlisting>[root@node-4:~]# cp authkey /etc/pacemaker | 660 | <programlisting>[root@node-4:~]# cp authkey /etc/pacemaker |
@@ -637,21 +662,22 @@ controller, vitrage | | 1 | 1</programlisting> | |||
637 | </listitem> | 662 | </listitem> |
638 | 663 | ||
639 | <listitem> | 664 | <listitem> |
640 | <para>Add iptables rule for the default port (3121). Also save it to | 665 | <para>Add an iptables rule for the default port (3121). Save it also |
641 | /etc/iptables/rules.v4 to make it persistent:</para> | 666 | to <literal>/etc/iptables/rules.v4</literal> to make it |
667 | persistent:</para> | ||
642 | 668 | ||
643 | <programlisting>root@node-4:~# iptables -A INPUT -s 192.168.0.0/24 -p tcp -m multiport / | 669 | <programlisting>root@node-4:~# iptables -A INPUT -s 192.168.0.0/24 -p tcp -m multiport / |
644 | --dports 3121 -m comment --comment "pacemaker_remoted from 192.168.0.0/24" -j ACCEPT </programlisting> | 670 | --dports 3121 -m comment --comment "pacemaker_remoted from 192.168.0.0/24" -j ACCEPT</programlisting> |
645 | </listitem> | 671 | </listitem> |
646 | 672 | ||
647 | <listitem> | 673 | <listitem> |
648 | <para>Start the pacemaker-remote service</para> | 674 | <para>Start the pacemaker-remote service:</para> |
649 | 675 | ||
650 | <programlisting>[root@node-4:~]# systemctl start pacemaker-remote.service</programlisting> | 676 | <programlisting>[root@node-4:~]# systemctl start pacemaker-remote.service</programlisting> |
651 | </listitem> | 677 | </listitem> |
652 | 678 | ||
653 | <listitem> | 679 | <listitem> |
654 | <para>Log on one of the controller nodes and configure the | 680 | <para>Log onto one of the controller nodes and configure the |
655 | pacemaker-remote resources:</para> | 681 | pacemaker-remote resources:</para> |
656 | 682 | ||
657 | <programlisting>[root@node-1:~]# pcs resource create node-4.domain.tld remote | 683 | <programlisting>[root@node-1:~]# pcs resource create node-4.domain.tld remote |
@@ -685,20 +711,21 @@ RemoteOnline: [ node-4.domain.tld node-5.domain.tld ]</programlisting> | |||
685 | <title>Pacemaker Fencing</title> | 711 | <title>Pacemaker Fencing</title> |
686 | 712 | ||
687 | <para>ENEA NFV Core 1.0 makes use of the fencing capabilities of | 713 | <para>ENEA NFV Core 1.0 makes use of the fencing capabilities of |
688 | Pacemaker to isolate faulty nodes and trigger recovery actions by means | 714 | pacemaker to isolate faulty nodes and trigger recovery actions by means |
689 | of power cycling the failed nodes. Fencing is configured by creating | 715 | of power cycling the failed nodes. Fencing is configured by creating |
690 | STONITH type resources for each of the servers in the cluster, both | 716 | <literal>STONITH</literal> type resources for each of the servers in the |
691 | Controller nodes and Compute nodes. The STONITH adapter for fencing the | 717 | cluster, both Controller nodes and Compute nodes. The |
692 | nodes is fence_ipmilan, which makes use of the IPMI capabilities of the | 718 | <literal>STONITH</literal> adapter for fencing the nodes is |
693 | Cavium ThunderX servers.</para> | 719 | <literal>fence_ipmilan</literal>, which makes use of the IPMI |
720 | capabilities of the ThunderX servers.</para> | ||
694 | 721 | ||
695 | <para>Here are the steps for enabling fencing capabilities in the | 722 | <para>Here are the steps for enabling fencing capabilities on a |
696 | cluster:</para> | 723 | cluster:</para> |
697 | 724 | ||
698 | <orderedlist> | 725 | <orderedlist> |
699 | <listitem> | 726 | <listitem> |
700 | <para>Logon to the Fuel Master using the default credentials if not | 727 | <para>Log onto the Fuel Master using the default credentials, if not |
701 | changed (root/r00tme).</para> | 728 | they have not been changed (root/r00tme).</para> |
702 | </listitem> | 729 | </listitem> |
703 | 730 | ||
704 | <listitem> | 731 | <listitem> |
@@ -719,18 +746,17 @@ id | status | name | cluster | ip | mac | roles | |||
719 | 2 | ready | Untitled (8b:64) | 1 | 10.20.0.3 | 68:05:ca:46:8b:64 | / | 746 | 2 | ready | Untitled (8b:64) | 1 | 10.20.0.3 | 68:05:ca:46:8b:64 | / |
720 | controller, mongo, tacker | | 1 | 1 | 747 | controller, mongo, tacker | | 1 | 1 |
721 | 3 | ready | Untitled (8c:45) | 1 | 10.20.0.5 | 68:05:ca:46:8c:45 | / | 748 | 3 | ready | Untitled (8c:45) | 1 | 10.20.0.5 | 68:05:ca:46:8c:45 | / |
722 | controller, vitrage | | 1 | 1 | 749 | controller, vitrage | | 1 | 1</programlisting> |
723 | </programlisting> | ||
724 | </listitem> | 750 | </listitem> |
725 | 751 | ||
726 | <listitem> | 752 | <listitem> |
727 | <para>Logon to each server to install additional packages:</para> | 753 | <para>Log onto each server to install additional packages:</para> |
728 | 754 | ||
729 | <programlisting>[root@node-1:~]# apt-get install fence-agents ipmitool</programlisting> | 755 | <programlisting>[root@node-1:~]# apt-get install fence-agents ipmitool</programlisting> |
730 | </listitem> | 756 | </listitem> |
731 | 757 | ||
732 | <listitem> | 758 | <listitem> |
733 | <para>Configure pacemaker fencing resources; this needs to be done | 759 | <para>Configure pacemaker fencing resources. This needs to be done |
734 | once on one of the controllers. The parameters will vary, depending | 760 | once on one of the controllers. The parameters will vary, depending |
735 | on the BMC addresses of each node and credentials.</para> | 761 | on the BMC addresses of each node and credentials.</para> |
736 | 762 | ||
@@ -752,9 +778,9 @@ ipaddr=10.0.100.155 login=ADMIN passwd=ADMIN op monitor interval="60s"</programl | |||
752 | </listitem> | 778 | </listitem> |
753 | 779 | ||
754 | <listitem> | 780 | <listitem> |
755 | <para>Activate fencing by enabling stonith property in pacemaker (by | 781 | <para>Activate fencing by enabling the <literal>stonith</literal> |
756 | default it is disabled); this also needs to be done only once, on | 782 | property in pacemaker (disabled by default). This also needs to be |
757 | one of the controllers.</para> | 783 | done only once, on one of the controllers.</para> |
758 | 784 | ||
759 | <programlisting>[root@node-1:~]# pcs property set stonith-enabled=true</programlisting> | 785 | <programlisting>[root@node-1:~]# pcs property set stonith-enabled=true</programlisting> |
760 | </listitem> | 786 | </listitem> |
@@ -767,28 +793,25 @@ ipaddr=10.0.100.155 login=ADMIN passwd=ADMIN op monitor interval="60s"</programl | |||
767 | 793 | ||
768 | <para>The OpenStack community has been working for some time on | 794 | <para>The OpenStack community has been working for some time on |
769 | identifying possible solutions for enabling High Availability for Compute | 795 | identifying possible solutions for enabling High Availability for Compute |
770 | nodes, although initially the subject of HA on compute node was very | 796 | nodes, after a period of belief that this subject was not something that |
771 | controversial as not being something that should concern the cloud | 797 | should concern the cloud platform. Over time it became obvious that even |
772 | platform. Over time it became obvious that even on a true cloud platform, | 798 | on a true cloud platform, where services are designed to run without being |
773 | where services are designed to run without being affected by the | 799 | affected by the availability of the cloud platform, fault management and |
774 | availability of the cloud platform, fault management and recovery is still | 800 | recovery are still very important and desirable. This is also the case for |
775 | very important and desirable. This is very much the case for NFV | 801 | NFV applications, where in the good tradition of telecom applications, the |
776 | applications, where, in the good tradition of telecom applications, the | 802 | operators must have complete engineering control over the resources they |
777 | operators must have complete engineering control over the resources it | 803 | own and manage.</para> |
778 | owns and manages.</para> | 804 | |
779 | 805 | <para>The work for Compute node High Availability is captured in an | |
780 | <para>The work for compute node high availability is captured in an | ||
781 | OpenStack user story and documented upstream, showing proposed solutions, | 806 | OpenStack user story and documented upstream, showing proposed solutions, |
782 | summit talks and presentations.</para> | 807 | summit talks and presentations. A number of these solutions make use of |
783 | 808 | OpenStack Resource Agents, which are a set of specialized pacemaker | |
784 | <para>A number of these solutions make use of OpenStack Resource Agents, | 809 | resources capable of identifying failures in compute nodes and can perform |
785 | which are basically a set of specialized pacemaker resources which are | 810 | automatic evacuation of the instances affected by these failures.</para> |
786 | capable of identifying failures in compute nodes and can perform automatic | ||
787 | evacuation of the instances affected by these failures.</para> | ||
788 | 811 | ||
789 | <para>ENEA NFV Core 1.0 aims to validate and integrate this work and to | 812 | <para>ENEA NFV Core 1.0 aims to validate and integrate this work and to |
790 | make this feature available in the platform to be used as an alternative | 813 | make this feature available in the platform aimed as an alternative to the |
791 | to the Doctor framework, where simple, autonomous recovery of the running | 814 | Doctor framework, where simple, autonomous recovery of running instances |
792 | instances is desired.</para> | 815 | is desired.</para> |
793 | </section> | 816 | </section> |
794 | </chapter> \ No newline at end of file | 817 | </chapter> \ No newline at end of file |