Proofed entire installation guide, added all new need images.

USERDOCAP-240 Signed-off-by: Miruna Paun <Miruna.Paun@enea.com>
author: Miruna Paun <Miruna.Paun@enea.com> 2017-09-28 18:53:02 +0200
committer: Miruna Paun <Miruna.Paun@enea.com> 2017-09-28 18:53:02 +0200
commit: cc001420304566cd252f2c6323dec3a826a12954 (patch)
tree: 299a124c12a96df1000e0b87753df9caef34b06a /book-enea-nfv-core-installation-guide/doc/high_availability.xml
parent: 380e975b1b93e83705c8ed30197b1c23f8193814 (diff)
download: doc-enea-nfv-cc001420304566cd252f2c6323dec3a826a12954.tar.gz
1 files changed, 311 insertions, 288 deletions
diff --git a/book-enea-nfv-core-installation-guide/doc/high_availability.xml b/book-enea-nfv-core-installation-guide/doc/high_availability.xml
index e489101..93f6468 100644
--- a/book-enea-nfv-core-installation-guide/doc/high_availability.xml
+++ b/book-enea-nfv-core-installation-guide/doc/high_availability.xml
@@ -2,27 +2,24 @@
 <chapter id="high_availability">
  <title>High Availability Guide</title>
-  <para>ENEA NFV Core 1.0 has been designed to provide high availability
+  <para>Enea NFV Core 1.0 has been designed to provide high availability
  characteristics that are needed for developing and deploying telco-grade NFV
-  solutions on top of our OPNFV based platform.</para>
+  solutions on top of our OPNFV based platform. The High Availability subject
+  in general is very wide and still an important focus in both opensource
-  <para>The High Availability subject in general is very wide and still an
+  communities and the independent/proprietary solutions market.</para>
-  important focus in both opensource communities and independent/proprietary
-  solutions market. ENEA NFV Core 1.0 aims to initially leverage the efforts
+  <para>Enea NFV Core 1.0 aims to initially leverage the efforts in the
-  in the upstream OPNFV and OpenStack opensource projects, combining solutions
+  upstream OPNFV and OpenStack opensource projects, combining solutions from
-  from both worlds in an effort to provide flexibility and a wide enough use
+  both worlds in an effort to provide flexibility and use-case coverage. Enea
-  case coverage. ENEA has a long time expertise and proprietary solutions
+  has long term expertise and proprietary solutions addressing High
-  addressing High Availability for telco applications, which are subject to
+  Availability for telco applications, which are subject to integration with
-  integrating with the NFV based solutions, however the initial scope for ENEA
+  the NFV based solutions.</para>
-  NFV Core is to leverage as much as possible the OPNFV Reference Platform and
-  open source projects in general, such as it will be seen further ahead in
-  this chapter.</para>
  <section id="levels">
    <title>High Availability Levels</title>
-    <para>The base for the feature set in ENEA NFV Core is divided into three
+    <para>The foundation for the feature set available in Enea NFV Core is
-    levels:</para>
+    divided into three levels:</para>
    <itemizedlist>
      <listitem>
@@ -30,7 +27,7 @@
      </listitem>
      <listitem>
-        <para>NFV Platform HA</para>
+        <para>NFV Platform H.A.</para>
      </listitem>
      <listitem>
@@ -38,118 +35,126 @@
      </listitem>
    </itemizedlist>
-    <para>The same division of levels of fault management can be seen in the
+    <para>The same division of levels for fault management can be seen in the
-    scope of the High Availability for OPNFV (Availability) project. OPNFV
+    scope of the High Availability for OPNFV ("Availability") project. OPNFV
-    also hosts the Doctor Project which is a fault management and maintenance
+    also hosts Doctor, a fault management and maintenance project designed to
-    project to develop and realize the consequent implementation for the OPNFV
+    develop and perform the consequent implementation of the OPNFV reference
-    reference platform.</para>
+    platform. These two projects complement each other.</para>
-    <para>These two projects complement each other.</para>
+    <para>The Availability project addresses H.A. requirements and solutions
+    from the perspective of the three levels mentioned above. It produces high
-    <para>The Availability project addresses HA requirement and solutions from
+    level requirements and API definitions for High Availability for OPNFV, a
-    the perspective of the three levels mentioned above and produces high
+    H.A. Gap Analysis Report for OpenStack and more recently, works on
-    level requirements and API definitions for High Availability of OPNFV, HA
+    optimizing existing OPNFV test frameworks, such as Yardstick, developing
-    Gap Analysis Report for OpenStack and more recently works on optimizing
+    test cases which realize H.A.-specific use-cases and scenarios derived
-    existing OPNFV test frameworks, such as Yardstick, and develops test cases
+    from the H.A. requirements.</para>
-    which realize HA specific use cases and scenarios such as derived from the
-    HA requirements.</para>
+    <para>The Doctor project aims to build fault management and maintenance
+    framework for the high availability of Network Services, on top of a
-    <para>The Doctor Project on the other hand aims to build fault management
+    virtualized infrastructure. The key feature is immediate notification of
-    and maintenance framework for high availability of Network Services on top
+    unavailability of virtualized resources from VIM, to process recovery of
-    of virtualized infrastructure; the key feature is immediate notification
+    VNFs on them. </para>
-    of unavailability of virtualized resources from VIM, to process recovery
-    of VNFs on them. The Doctor project has also collaborated with the
+    <para>The Doctor project has also collaborated with the Availability
-    Availability project on identifying gaps in upstream project, mainly
+    project on identifying gaps in upstream projects, such as but not
-    OpenStack but not exclusive, and has worked towards implementing missing
+    exclusively OpenStack. It has also worked towards implementing missing
-    features or improving the functionality, one good example being the Aodh
+    features and improving functionality, with a good example being the Aodh
-    event based alarms, which allows for fast notifications when certain
+    event based alarms, which allow for fast notifications when certain
-    predefined events occur. The Doctor project also produced an architecture
+    predefined events occur. </para>
-    design and a reference implementation based on opensource components,
-    which will be presented later on in this document.</para>
+    <para>The Doctor project also produced an architectural design and a
+    reference implementation based on opensource components, which will be
+    presented later on in this document.</para>
  </section>
  <section id="doctor_arch">
    <title>Doctor Architecture</title>
-    <para>The Doctor documentation shows the detailed architecture for Fault
+    <para>The Doctor project documentation shows the detailed architecture for
-    Management and NFVI Maintenance . The two are very similar so we will
+    Fault Management and NFVI Maintenance. Being quite similar with each
-    focus on the Fault Management.</para>
+    other, the focus in the following sections shall remain on Fault
+    Management.</para>
    <para>The architecture specifies a set of functional blocks:</para>
    <itemizedlist>
      <listitem>
-        <para>Monitor - monitors the virtualized infrastructure capturing
+        <para><emphasis role="bold">Monitor</emphasis> - monitors the
-        fault events in the Software and Hardware; for this particular
+        virtualized infrastructure, capturing fault events in software and
-        component we chose Zabbix which is integrated into the platform by
+        hardware. For this component we choose <emphasis
-        means of the Fuel Zabbix Plugin, available upstream.</para>
+        role="bold">Zabbix</emphasis> which is integrated into the platform
+        through the Fuel Zabbix Plugin, available upstream.</para>
      </listitem>
      <listitem>
-        <para>Inspector - this component is able to receive notifications from
+        <para><emphasis role="bold">Inspector</emphasis> - this component
-        Monitor components and also OpenStack core components, which allows it
+        receives notifications from Monitor components and OpenStack core
-        to create logic relationships between entities, identify affected
+        components, allowing it to create logical relationships between
-        resources when faults occur, and communicates with Controllers to
+        entities, identify affected resources when faults occur, and to
-        update the states of the virtual and physical resources. For this
+        communicate with Controllers in order to update the states of the
-        component ENEA NFV Core 1.0 makes use of Vitrage , an OpenStack
+        virtual and physical resources.</para>
-        related project used for Root Cause Analysis, which has been adapted
-        to server as a Doctor Inspector. The integration into the platform is
+        <para>For this component Enea NFV Core 1.0 makes use of Vitrage, an
-        realized with the help of a Fuel Plugin which has been developed
+        OpenStack related project used for Root Cause Analysis. The
-        internally by ENEA.</para>
+        integration into the platform is done with the help of a Fuel Plugin
+        which has been developed internally by Enea.</para>
      </listitem>
      <listitem>
-        <para>Controller - OpenStack core components act as Controllers, which
+        <para><emphasis role="bold">Controller - </emphasis>OpenStack core
-        are responsible for maintaining the resource map between physical and
+        components act as Controllers responsible for maintaining the resource
-        virtual resources, they accept update requests from the Inspector and
+        map between physical and virtual resources. They accept update
-        are responsible for sending failure event notifications to the
+        requests from the Inspector and are responsible for sending failure
-        Notifier. Components such as Nova, Neutron, Glance, Heat act as
+        event notifications to the Notifier. Components such as Nova, Neutron,
-        Controllers in the Doctor Architecture.</para>
+        Glance, and Heat, act as Controllers in the Doctor
+        Architecture.</para>
      </listitem>
      <listitem>
-        <para>Notifier - the focus of this component is on selecting and
+        <para><emphasis role="bold">Notifier</emphasis> - the focus of this
-        aggregating failure events received from the controller based on
+        component is on selecting and aggregating failure events received from
-        policies mandated by the Consumer. The role of the Notifier is
+        the controller, based on policies mandated by the Consumer. The role
-        accomplished by the Aodh component in OpenStack.</para>
+        of the Notifier is filled by the Aodh component in OpenStack.</para>
      </listitem>
    </itemizedlist>
-    <para>Besides the Doctor components there are a couple other blocks
+    <para>Alongside the Doctor components, there are a few other blocks
-    mentioned in the architecture:</para>
+    mentioned:</para>
    <itemizedlist>
      <listitem>
-        <para>Administrator - this represents the human role of administrating
+        <para><emphasis role="bold">Administrator</emphasis> - this represents
-        the platform by means of dedicated interfaces, either visual
+        the human role of administrating the platform by means of dedicated
-        dashboards, like OpenStack Horizon or Fuel Dashboard, or via CLI
+        interfaces. These can be visual dashboards like OpenStack Horizon or
-        tools, like the OpenStack unified CLI that can be accessed
+        Fuel Dashboard, or via CLI tools like the OpenStack unified CLI, that
-        traditionally from one of the servers that act as OpenStack Controller
+        can be accessed from one of the servers that act as OpenStack
-        nodes. In the case of ENEA NFV Core 1.0, the Administrator can also
+        Controller nodes. </para>
-        access the Zabbix dashboard for doing further configurations. The same
+        <para>In Enea NFV Core 1.0 the Administrator can also access the
+        Zabbix dashboard to perform supplementary configurations. The same
        applies for the Vitrage tool, which comes with its own Horizon
-        dashboard which enables the user to visually inspect the faults
+        dashboard, enabling the user to visually inspect the faults reported
-        reported by the monitoring tools and also creates visual
+        by the monitoring tools through visual representations of the virtual
-        representations of the virtual and physical resources, the
+        and physical resources, the relationships between them and the fault
-        relationships between them and the fault correlation. For Vitrage,
+        correlation. </para>
-        users will usually want to configure additional usecases and describe
-        relationships between components, via template files written in yaml
+        <para>For Vitrage, users will usually want to configure additional
-        format. More information about using Vitrage will be presented in a
+        use-cases and describe relationships between components via template
-        following section.</para>
+        files written in <literal>yaml</literal> format.</para>
      </listitem>
      <listitem>
-        <para>Consumer - this block is vaguely described in the Doctor
+        <para><emphasis role="bold">Consumer</emphasis> - this block is
-        Architecture and it's out of its scope. Doctor only deals with fault
+        vaguely described in the Doctor Architecture and is out of its current
-        detection and management, making sure faults are handled as soon as
+        scope. Doctor only deals with fault detection and management, but
-        possible after detection, identifies affected virtual resources and
+        since the actual VNFs are managed, according to the ETSI architecture,
-        updates the states of them, but since the actual VNFs are managed,
+        by a different entity, Doctor does not deal with recovery actions of
-        according to the ETSI architecture, by a different entity, Doctor does
+        the VNFs. The role of the Consumer thus falls to that of a VNF Manager
-        not deal with recovery actions of the VNFs. The role of the Consumer
+        and Orchestrator.</para>
-        thus falls in the task of a VNF Manager and Orchestrator. ENEA NFV
-        Core 1.0 provides VNF management capabilities using Tacker, which is
+        <para>Enea NFV Core 1.0 provides VNF management capabilities using
-        an OpenStack project that implements a generic VNF Manager and
+        Tacker, which is an OpenStack project that implements a generic VNF
-        Orchestrator according to the ETSI MANO Architectural
+        Manager and Orchestrator, according to the ETSI MANO Architectural
        Framework.</para>
      </listitem>
    </itemizedlist>
@@ -170,12 +175,12 @@
      <para>The architecture described in the Doctor project has been
      demonstrated in various PoCs and demos, but always using sample
-      components for either the consumer or the monitor. ENEA has worked with
+      components for either the consumer or the monitor. Enea has worked with
-      upstream projects, Doctor and Vitrage, to realize the goals of the
+      upstream projects Doctor and Vitrage, to realize the goals of the Doctor
-      Doctor project by using real components, as described before.</para>
+      project by using real components as described above.</para>
-      <para>The two pictures below show a typical fault management scenario,
+      <para>The two pictures below show a typical fault management
-      as described in the Doctor documentation.</para>
+      scenario:</para>
      <mediaobject>
        <imageobject>
@@ -189,70 +194,81 @@
        </imageobject>
      </mediaobject>
-      <para>ENEA NFV Core 1.0 uses the same approach described above, but it's
+      <para>Enea NFV Core 1.0 uses the same approach described above:</para>
-      worth going through each step and detail them.</para>
      <orderedlist>
        <listitem>
          <para>When creating a VNF, the user will have to enable the
-          monitoring capabilities of Tacker, by passing a template which
+          monitoring capabilities of Tacker by passing a template, which
          specifies that an alarm will be created when the VM represented by
          this VNF changes state. The support for alarm monitoring in Tacker
-          is captured in the Alarm Monitoring Framework spec in OpenStack
+          is detailed in the Alarm Monitoring Framework spec in the OpenStack
-          documentation. In a few words, Tacker should be able to create a VNF
+          documentation.</para>
-          and then create an Aodh alarm of type event which triggers when the
-          instance is in state ERROR. The action to take when this event
+          <para>Tacker should be able to create a VNF and then an Aodh alarm
-          triggers is to perform an HTTP call, to an URL managed by Tacker. As
+          of type event, triggerable when the instance is in a state of ERROR.
-          a result of this action, Tacker can detect when an instance has
+          When this event is triggered perform an HTTP call to a URL managed
-          failed (for whatever reasons) and will respawn it somewhere
+          by Tacker. As a result of this action, Tacker can detect when an
-          else.</para>
+          instance has failed (for whatever reason) and will respawn it
+          somewhere else.</para>
        </listitem>
        <listitem>
-          <para>The subscribe response in this case is an empty operation, the
+          <para>The subscribed response in this case is an empty operation,
-          Notifier (Aodh) only has to confirm that the alarm has been
+          the Notifier (Aodh) only has to confirm that the alarm has been
          created.</para>
        </listitem>
        <listitem>
-          <para>The NFVI sends monitoring events for resources the VIM has
+          <para>The NFVI sends monitoring events for the resources the VIM has
-          been subscribed to. Note: this subscription message exchange between
+          been subscribed to. </para>
-          the VIM and NFVI is not shown in this message flow. This steps is
-          related to Vitrage's capability of receiving notifications from
+          <note>
-          OpenStack services, at this moment Vitrage supports notifications
+            <para>This subscription message exchange between the VIM and NFVI
-          from nova.host, nova.instances, nova.zone, cinder.volume,
+            is not shown in this message flow. This step is related to
-          neutron.network, neutron.port and heat.stack OpenStack
+            Vitrage's capability of receiving notifications from OpenStack
-          datasources.</para>
+            services. At this moment Vitrage supports notifications from
+            <literal>nova.host</literal>, <literal>nova.instances</literal>,
+            <literal>nova.zone</literal>, <literal>cinder.volume</literal>,
+            <literal>neutron.network</literal>,
+            <literal>neutron.port</literal> and <literal>heat.stack</literal>
+            OpenStack datasources.</para>
+          </note>
        </listitem>
        <listitem>
-          <para>This steps describes faults being detected by Zabbix which are
+          <para>This step describes faults detected by Zabbix which are sent
-          sent to the Inspector (Vitrage) as soon as detected, using a push
+          to the Inspector (Vitrage) as soon as detected. This is done using a
-          approach by means of sending an AMQP message to a dedicated message
+          push approach by means of sending an AMQP message to a dedicated
-          queue managed by Vitrage. For example, if nova-compute fails on one
+          message queue managed by Vitrage. For example, if
-          of the compute nodes, Zabbix will format a message specifying all
+          <literal>nova-compute</literal> fails on one of the compute nodes,
-          the needed details needed for processing the fault, e.g. a
+          Zabbix will format a message specifying all the needed details
-          timestamp, what host failed, what event occurred and others.</para>
+          required for processing the fault: a timestamp, what host failed,
+          what event occurred etc.</para>
        </listitem>
        <listitem>
-          <para>Database lookup to find the virtual resources affected by the
+          <para>This step shows database lookup geared to find the virtual
-          detected fault. In this step Vitrage will perform various
+          resources affected by the detected fault. Vitrage will perform
-          calculations to detect what virtual resources are affected by the
+          various calculations to detect what virtual resources are affected
-          raw failure presented by Zabbix. Vitrage can be configured via
+          by the raw failure presented by Zabbix. </para>
-          templates to correlate instances with the physical hosts they are
-          running on, so that if a compute node fails, then instances running
+          <para>Vitrage can be configured via templates to correlate instances
-          on that host will be affected. A typical usecase is to mark the
+          with the physical hosts they are running on, so that if a compute
-          compute node down (a.k.a mark_host_down) and update the states of
+          node fails, then instances running on that host will be affected. A
-          all instances running on them, by issuing Nova API calls for each of
+          typical use-case is to mark the compute node down
-          these instances. Step 5c) shows the Controller (Nova in this case)
+          (<literal>mark_host_down</literal>) and update the states of all
-          acting upon the state change of the instance and issues an event
+          instances running on them. This is done by issuing Nova API calls
-          alarm to Aodh.</para>
+          for each of these instances. </para>
+          <para>Step 5c. shows the Controller (Nova in this case) acting upon
+          the state change of the instance and issuing an event alarm to
+          Aodh.</para>
        </listitem>
        <listitem>
          <para>The Notifier will acknowledge the alarm event request from
-          Nova and will trigger the alarm(s) created by Tacker in step 1).
+          Nova and will trigger the alarm(s) created by Tacker in step 1.
          Since Tacker has configured the alarm to send an HTTP request, Aodh
          will perform that HTTP call at the URL managed by Tacker.</para>
        </listitem>
@@ -268,7 +284,7 @@
        </listitem>
      </orderedlist>
-      <note>
+      <note condition="hidden">
        <para>The ENEA NFV Core 1.0 Pre-Release fully covers the required
        Doctor functionality only for the Vitrage and Zabbix
        components.</para>
@@ -280,27 +296,25 @@
      <para>Vitrage supports Zabbix datasource by means of regularly polling
      the Zabbix agents, which need to be configured in advance. The Vitrage
-      plugin developed internally by ENEA can automatically configure Zabbix
+      plugin developed internally by Enea can automatically configure Zabbix
-      so that everything works as expected.</para>
+      so that everything works as expected. Polling however, is not fast
+      enough for a telco use-case, so it is necessary to configure push
-      <para>However, polling is not fast enough for a telco usecase, so it is
+      notifications for Zabbix . This requires manual configuration on one of
-      necessary to configure pushed notifications for Zabbix . This requires
+      the controller nodes, since Zabbix uses a centralized database which
-      manual configuration on one of the controller nodes, since Zabbix uses a
+      makes the configuration available on all the other nodes.</para>
-      centralized database which makes the configuration available on all the
-      other nodes.</para>
      <para>The Zabbix configuration dashboard is available at the same IP
      address where OpenStack can be reached, e.g.
-      http://&lt;vip__zbx_vip_mgmt&gt;/zabbix.</para>
+      <literal>http://&lt;vip__zbx_vip_mgmt&gt;/zabbix</literal>.</para>
-      <para>To forward zabbix events to Vitrage a new media script needs to be
+      <para>To forward zabbix events to Vitrage, a new media script needs to
-      created and associated with a user. Follow the steps below as a Zabbix
+      be created and associated with a user. Follow the steps below as a
-      Admin user:</para>
+      Zabbix Admin user:</para>
      <orderedlist>
        <listitem>
-          <para>Create a new media type [Admininstration Media Types Create
+          <para>Create a new media type [Admininstration Media Types &gt;
-          Media Type]</para>
+          Create Media Type]</para>
          <itemizedlist>
            <listitem>
@@ -312,7 +326,7 @@
            </listitem>
            <listitem>
-              <para>Script name: zabbix_vitrage.py</para>
+              <para>Script name: <filename>zabbix_vitrage.py</filename></para>
            </listitem>
          </itemizedlist>
        </listitem>
@@ -327,14 +341,15 @@
            </listitem>
            <listitem>
-              <para>Send to: rabbit://rabbit_user:rabbit_pass@127.0.0.1:5672/
+              <para>Send to:
-              --- Vitrage message bus url (you need to search for this in
+              <literal>rabbit://rabbit_user:rabbit_pass@127.0.0.1:5672/
-              /etc/vitrage/vitrage.conf or /etc/nova/nova.conf
+              ---</literal> Vitrage message bus url (you need to search for
-              transport_url)</para>
+              this in <literal>/etc/vitrage/vitrage.conf or
+              /etc/nova/nova.conf transport_url</literal>)</para>
            </listitem>
            <listitem>
-              <para>When active: 1-7,00:00-24:00</para>
+              <para>When active: 1-7, 00:00-24:00</para>
            </listitem>
            <listitem>
@@ -348,8 +363,8 @@
        </listitem>
        <listitem>
-          <para>Configure Action [Configuration Actions Create Action
+          <para>Configure Action [Configuration &gt; Actions &gt; Create
-          Action]</para>
+          Action &gt; Action]</para>
          <itemizedlist>
            <listitem>
@@ -361,19 +376,25 @@
            </listitem>
            <listitem>
-              <para>Default Message: host={HOST.NAME1} hostid={HOST.ID1}
+              <para>Default Message:</para>
-              hostip={HOST.IP1} triggerid={TRIGGER.ID}
-              description={TRIGGER.NAME} rawtext={TRIGGER.NAME.ORIG}
+              <programlisting>host={HOST.NAME1}
-              expression={TRIGGER.EXPRESSION} value={TRIGGER.VALUE}
+hostid={HOST.ID1}
-              priority={TRIGGER.NSEVERITY} lastchange={EVENT.DATE}
+hostip={HOST.IP1}
-              {EVENT.TIME}</para>
+triggerid={TRIGGER.ID}
+description={TRIGGER.NAME}
+rawtext={TRIGGER.NAME.ORIG}
+expression={TRIGGER.EXPRESSION}
+value={TRIGGER.VALUE}
+priority={TRIGGER.NSEVERITY}
+lastchange={EVENT.DATE} {EVENT.TIME}</programlisting>
            </listitem>
          </itemizedlist>
        </listitem>
        <listitem>
-          <para>To send events add under the Conditions tab: 'Maintenance
+          <para>To send events add under the <literal>Conditions</literal>
-          status not in 'maintenance'".</para>
+          tab: "Maintenance status not in "maintenance"".</para>
        </listitem>
        <listitem>
@@ -391,32 +412,34 @@
        </listitem>
      </orderedlist>
-      <para>Using these instructions, Zabbix will call the zabbix_vitrage.py
+      <para>Using these instructions, Zabbix will call the
-      script, which is made readily available by the Fuel Vitrage Plugin,
+      <literal>zabbix_vitrage.py</literal> script, made readily available by
-      passing the arguments described in step 3). The zabbix_vitrage.py script
+      the Fuel Vitrage Plugin, to pass the arguments described in step 3. The
-      will then interpret the parameters and format an AMQP message will be
+      <literal>zabbix_vitrage.py</literal> script will then interpret the
-      sent to the vitrage.notifications queue, which is managed by the
+      parameters and format an AMQP message to be sent to the
+      <literal>vitrage.notifications</literal> queue, managed by the
      vitrage-graph service.</para>
    </section>
    <section id="vitrage_config">
      <title>Vitrage Configuration</title>
-      <para>The Vitrage team has been collaborating with OPNFV Doctor Project
+      <para>The Vitrage team has been collaborating with the OPNFV Doctor
-      in order to support Vitrage as an Inspector Component. The Doctor
+      project in order to support Vitrage as an Inspector Component. The
-      usecase for Vitrage is described in an OpenStack blueprint .
+      Doctor use-case for Vitrage is described in an OpenStack blueprint. Enea
-      Additionally, ENEA NFV Core has complemented Vitrage with the capability
+      NFV Core has complemented Vitrage with the ability to set the states of
-      of setting states of failed instances by implementing an action type in
+      failed instances by implementing an action type in Vitrage. This action
-      Vitrage which calls Nova APIs to set instances in error state. There is
+      calls Nova APIs to set instances in error state. An action type which
-      also an action type which allows fencing failed hosts.</para>
+      allows fencing failed hosts also exists.</para>
      <para>In order to make use of these features, Vitrage supports
-      additional configurations via yaml templates that must be placed in
+      additional configurations via <literal>yaml</literal> templates that
-      /etc/vitrage/templates on the nodes have the Vitrage role.</para>
+      must be placed in <literal>/etc/vitrage/templates</literal> on the nodes
+      have the Vitrage role.</para>
      <para>The example below shows how to program Vitrage to mark failed
      compute hosts as down and then to change the state of the instances to
-      Error, by creating Vitrage deduced alarms.</para>
+      ERROR, by creating Vitrage deduced alarms.</para>
      <programlisting>metadata:
 name: test_nova_mark_instance_err
@@ -466,7 +489,7 @@ scenarios:
        properties:
         state: ERROR</programlisting>
-      <para>For the action type of fencing a similar action item must be
+      <para>For the action type of fencing, a similar action item must be
      added:</para>
      <programlisting>- scenario:
@@ -477,8 +500,9 @@ scenarios:
        action_target:
         target: host</programlisting>
-      <para>After a template is configured, it is required to restart the
+      <para>After a template is configured, a restart of the
-      vitrage-api and vitrage-graph services:</para>
+      <literal>vitrage-api</literal> and <literal>vitrage-graph</literal>
+      services is needed:</para>
      <programlisting>root@node-6:~# systemctl restart vitrage-api
 root@node-6:~# systemctl restart vitrage-graph</programlisting>
@@ -487,12 +511,12 @@ root@node-6:~# systemctl restart vitrage-graph</programlisting>
    <section id="vitrage_custom">
      <title>Vitrage Customizations</title>
-      <para>ENEA NFV Core 1.0 has added custom features for Vitrage which
+      <para>Enea NFV Core 1.0 has added custom features for Vitrage which
-      allow two kinds of action:</para>
+      allow two kinds of actions:</para>
      <orderedlist>
        <listitem>
-          <para>Perform actions Northbound of the VIM</para>
+          <para>Perform actions Northbound of the VIM:</para>
          <itemizedlist>
            <listitem>
@@ -500,23 +524,23 @@ root@node-6:~# systemctl restart vitrage-graph</programlisting>
            </listitem>
            <listitem>
-              <para>Setting instance state to error in nova; this is used in
+              <para>Setting instance state to ERROR in nova. This is used in
              conjunction with an alarm created by Tacker, as described
-              before, should allow Tacker to detect when an instance is
+              before, and should allow Tacker to detect when an instance is
              affected and take proper actions.</para>
            </listitem>
          </itemizedlist>
        </listitem>
        <listitem>
-          <para>Perform actions Southbound of the VIM.</para>
+          <para>Perform actions Southbound of the VIM:</para>
          <para>Vitrage templates allow us to program fencing actions for
-          hosts with failed services. In the event of that systemd is unable
+          hosts with failed services. In the event that
-          to recover from a critical process or other type of sofware error
+          <literal>systemd</literal> is unable to recover from a critical
-          ocurs on Hardware supporting them, we can program a fencing of that
+          process or a type of sofware error ocurs on the Hardware supporting
-          Node which will perform a reboot thus attempting to recover a failed
+          them, the fencing of Node can be programmed, and it in turn will
-          node.</para>
+          perform a reboot, attempting to recover the failed node.</para>
        </listitem>
      </orderedlist>
    </section>
@@ -529,48 +553,49 @@ root@node-6:~# systemctl restart vitrage-graph</programlisting>
    characteristics employ pacemaker for achieving highly available OpenStack
    services. Traditionally pacemaker has been used for managing only the
    control plane services, so it can effectively provide redundancy and
-    recovery for the Controller nodes only. One reason for this is that
+    recovery for the Controller nodes only. A reason for this is that
    Controller nodes and Compute nodes essentially have very different High
-    Availability requirements that need to be considered. Typically, for
+    Availability requirements that need to be considered. </para>
-    Controller nodes, the services that run on them are stateless, with few
-    exceptions, where only one instance of a given service is allowed, but for
+    <para>Typically, for Controller nodes, the services that run on them are
-    which redundancy is still desired, one good example being an AMQP service
+    stateless, with few exceptions, where only one instance of a given service
-    (e.g. RabbitMQ). Compute nodes HA requirements depend on the type of
+    is allowed, but for which redundancy is still desired. A good example
-    services that run on them, but typically it is desired that failures on
+    would be an AMQP service (e.g. RabbitMQ). Compute nodes H.A. requirements
-    these nodes is detected as soon as possible so that the instances that run
+    depend on the type of services that run on them, but typically it is
-    on them can be either migrated, resurrected or restarted. One other aspect
+    desired that failures on these nodes be detected as soon as possible so
-    is that sometimes failures on the physical hosts do not necessarily cause
+    that the instances that run on them can be either migrated, resurrected or
-    a failure on the services (VNFs), but having these services incapacitated
+    restarted. Sometimes failures on the physical hosts do not necessarily
-    can prevent accessing and controlling the services.</para>
+    cause a failure on the services (VNFs), but having these services
+    incapacitated can prevent access to and controlling the services.</para>
-    <para>So Controller High Availability is one subject which is in general
-    well understood and experimented with, and the base of achieving this is
+    <para>Controller High Availability is thus a subject generally well
-    Pacemaker using Corosync underneath.</para>
+    understood and experimented with, and the basis for this is Pacemaker
+    using Corosync underneath.</para>
    <para>Extending the use of pacemaker to Compute nodes was thought as a
-    possible solution for providing VNF high availability, but this turns out
+    possible solution for providing VNF high availability, but the problem
-    to be a problem which is not easy to solve. On one hand pacemaker as a
+    turned out to be more complicated. On one hand, pacemaker as a clustering
-    clustering tool can only scale properly up to limited number of nodes,
+    tool, can only scale properly up to a limited number of nodes, usually
-    usually less than 128. This poses a problem for large scale deployments
+    less than 128. This poses a problem for large scale deployments where
-    where hundreds of compute nodes are required. On the other hand, Compute
+    hundreds of compute nodes are required. On the other hand, Compute node
-    node HA requires other considerations and calls for specially designed
+    H.A. requires other considerations and calls for specially designed
    solutions.</para>
    <section id="pm_remote">
      <title>Pacemaker Remote</title>
      <para>As mentioned earlier, pacemaker and corosync do not scale well
-      over a large cluster, because each node has to talk to everyone,
+      over a large cluster, since each node has to talk to every other,
-      essentially creating a mesh configuration. Some solution to this problem
+      essentially creating a mesh configuration. A solution to this problem
-      could be partitioning the cluster into smaller groups, but this solution
+      could be partitioning the cluster into smaller groups, but this has its
-      has its limitation and it's generally difficult to manage.</para>
+      limitations and it is generally difficult to manage. </para>
-      <para>A better solution is using pacemaker-remote, a feature of
+      <para>A better solution is using <literal>pacemaker-remote</literal>, a
-      pacemaker which allows extending the cluster beyond the usual limits by
+      feature of pacemaker, which allows for extending the cluster beyond the
-      using the pacemaker monitoring capabilities, essentially creating a new
+      usual limits by using the pacemaker monitoring capabilities. It
-      type of resource which enables adding light weight nodes to the cluster.
+      essentially creates a new type of resource which enables adding light
-      More information about pacemaker-remote can be found on the official
+      weight nodes to the cluster. More information about pacemaker-remote can
-      clusterlabs website.</para>
+      be found on the official clusterlabs website.</para>
      <para>Please note that at this moment pacemaker remote must be
      configured manually after deployment. Here are the manual steps for
@@ -578,13 +603,13 @@ root@node-6:~# systemctl restart vitrage-graph</programlisting>
      <orderedlist>
        <listitem>
-          <para>Logon to the Fuel Master using the default credentials if not
+          <para>Log onto the Fuel Master using the default credentials, if
-          changed (root/r00tme)</para>
+          they have not been changed (root/r00tme).</para>
        </listitem>
        <listitem>
          <para>Type fuel node to obtain the list of nodes, their roles and
-          the IP addresses</para>
+          the IP addresses.</para>
          <programlisting>[root@fuel ~]# fuel node
 id | status | name             | cluster | ip        | mac               | roles    /
@@ -604,10 +629,10 @@ controller, vitrage       |               |      1 |        1</programlisting>
        </listitem>
        <listitem>
-          <para>Each controller has a unique pacemaker authkey, we need to
+          <para>Each controller has a unique pacemaker authkey. One needs to
-          keep one an propagate it to the other servers. Assuming node-1,
+          be kept and propagated to the other servers. Assuming node-1, node-2
-          node-2 and node-3 are the controllers, execute the following from
+          and node-3 are the controllers, execute the following from the Fuel
-          the Fuel console:</para>
+          console:</para>
          <programlisting>[root@fuel ~]# scp node-1:/etc/pacemaker/authkey .
 [root@fuel ~]# scp authkey node-2:/etc/pacemaker/
@@ -619,7 +644,7 @@ controller, vitrage       |               |      1 |        1</programlisting>
        <listitem>
          <para>For each compute node, log on to it using the corresponding
-          IP.</para>
+          IP</para>
        </listitem>
        <listitem>
@@ -629,7 +654,7 @@ controller, vitrage       |               |      1 |        1</programlisting>
        </listitem>
        <listitem>
-          <para>Copy the authkey from the Fuel master and make sure the right
+          <para>Copy the authkey from the Fuel Master and make sure the right
          permissions are set:</para>
          <programlisting>[root@node-4:~]# cp authkey /etc/pacemaker
@@ -637,21 +662,22 @@ controller, vitrage       |               |      1 |        1</programlisting>
        </listitem>
        <listitem>
-          <para>Add iptables rule for the default port (3121). Also save it to
+          <para>Add an iptables rule for the default port (3121). Save it also
-          /etc/iptables/rules.v4 to make it persistent:</para>
+          to <literal>/etc/iptables/rules.v4</literal> to make it
+          persistent:</para>
          <programlisting>root@node-4:~# iptables -A INPUT -s 192.168.0.0/24 -p tcp -m multiport /
--dports 3121 -m comment --comment "pacemaker_remoted from 192.168.0.0/24" -j ACCEPT </programlisting>
+--dports 3121 -m comment --comment "pacemaker_remoted from 192.168.0.0/24" -j ACCEPT</programlisting>
        </listitem>
        <listitem>
-          <para>Start the pacemaker-remote service</para>
+          <para>Start the pacemaker-remote service:</para>
          <programlisting>[root@node-4:~]# systemctl start pacemaker-remote.service</programlisting>
        </listitem>
        <listitem>
-          <para>Log on one of the controller nodes and configure the
+          <para>Log onto one of the controller nodes and configure the
          pacemaker-remote resources:</para>
          <programlisting>[root@node-1:~]# pcs resource create node-4.domain.tld remote
@@ -685,20 +711,21 @@ RemoteOnline: [ node-4.domain.tld node-5.domain.tld ]</programlisting>
      <title>Pacemaker Fencing</title>
      <para>ENEA NFV Core 1.0 makes use of the fencing capabilities of
-      Pacemaker to isolate faulty nodes and trigger recovery actions by means
+      pacemaker to isolate faulty nodes and trigger recovery actions by means
      of power cycling the failed nodes. Fencing is configured by creating
-      STONITH type resources for each of the servers in the cluster, both
+      <literal>STONITH</literal> type resources for each of the servers in the
-      Controller nodes and Compute nodes. The STONITH adapter for fencing the
+      cluster, both Controller nodes and Compute nodes. The
-      nodes is fence_ipmilan, which makes use of the IPMI capabilities of the
+      <literal>STONITH</literal> adapter for fencing the nodes is
-      Cavium ThunderX servers.</para>
+      <literal>fence_ipmilan</literal>, which makes use of the IPMI
+      capabilities of the ThunderX servers.</para>
-      <para>Here are the steps for enabling fencing capabilities in the
+      <para>Here are the steps for enabling fencing capabilities on a
      cluster:</para>
      <orderedlist>
        <listitem>
-          <para>Logon to the Fuel Master using the default credentials if not
+          <para>Log onto the Fuel Master using the default credentials, if not
-          changed (root/r00tme).</para>
+          they have not been changed (root/r00tme).</para>
        </listitem>
        <listitem>
@@ -719,18 +746,17 @@ id | status | name             | cluster | ip        | mac               | roles
 2 | ready  | Untitled (8b:64) |       1 | 10.20.0.3 | 68:05:ca:46:8b:64 | /
 controller, mongo, tacker |               |      1 |        1
 3 | ready  | Untitled (8c:45) |       1 | 10.20.0.5 | 68:05:ca:46:8c:45 | /
-controller, vitrage       |               |      1 |        1
+controller, vitrage       |               |      1 |        1</programlisting>
-</programlisting>
        </listitem>
        <listitem>
-          <para>Logon to each server to install additional packages:</para>
+          <para>Log onto each server to install additional packages:</para>
          <programlisting>[root@node-1:~]# apt-get install fence-agents ipmitool</programlisting>
        </listitem>
        <listitem>
-          <para>Configure pacemaker fencing resources; this needs to be done
+          <para>Configure pacemaker fencing resources. This needs to be done
          once on one of the controllers. The parameters will vary, depending
          on the BMC addresses of each node and credentials.</para>
@@ -752,9 +778,9 @@ ipaddr=10.0.100.155 login=ADMIN passwd=ADMIN op monitor interval="60s"</programl
        </listitem>
        <listitem>
-          <para>Activate fencing by enabling stonith property in pacemaker (by
+          <para>Activate fencing by enabling the <literal>stonith</literal>
-          default it is disabled); this also needs to be done only once, on
+          property in pacemaker (disabled by default). This also needs to be
-          one of the controllers.</para>
+          done only once, on one of the controllers.</para>
          <programlisting>[root@node-1:~]# pcs property set stonith-enabled=true</programlisting>
        </listitem>
@@ -767,28 +793,25 @@ ipaddr=10.0.100.155 login=ADMIN passwd=ADMIN op monitor interval="60s"</programl
    <para>The OpenStack community has been working for some time on
    identifying possible solutions for enabling High Availability for Compute
-    nodes, although initially the subject of HA on compute node was very
+    nodes, after a period of belief that this subject was not something that
-    controversial as not being something that should concern the cloud
+    should concern the cloud platform. Over time it became obvious that even
-    platform. Over time it became obvious that even on a true cloud platform,
+    on a true cloud platform, where services are designed to run without being
-    where services are designed to run without being affected by the
+    affected by the availability of the cloud platform, fault management and
-    availability of the cloud platform, fault management and recovery is still
+    recovery are still very important and desirable. This is also the case for
-    very important and desirable. This is very much the case for NFV
+    NFV applications, where in the good tradition of telecom applications, the
-    applications, where, in the good tradition of telecom applications, the
+    operators must have complete engineering control over the resources they
-    operators must have complete engineering control over the resources it
+    own and manage.</para>
-    owns and manages.</para>
+    <para>The work for Compute node High Availability is captured in an
-    <para>The work for compute node high availability is captured in an
    OpenStack user story and documented upstream, showing proposed solutions,
-    summit talks and presentations.</para>
+    summit talks and presentations. A number of these solutions make use of
+    OpenStack Resource Agents, which are a set of specialized pacemaker
-    <para>A number of these solutions make use of OpenStack Resource Agents,
+    resources capable of identifying failures in compute nodes and can perform
-    which are basically a set of specialized pacemaker resources which are
+    automatic evacuation of the instances affected by these failures.</para>
-    capable of identifying failures in compute nodes and can perform automatic
-    evacuation of the instances affected by these failures.</para>
    <para>ENEA NFV Core 1.0 aims to validate and integrate this work and to
-    make this feature available in the platform to be used as an alternative
+    make this feature available in the platform aimed as an alternative to the
-    to the Doctor framework, where simple, autonomous recovery of the running
+    Doctor framework, where simple, autonomous recovery of running instances
-    instances is desired.</para>
+    is desired.</para>
  </section>
 </chapter>
 \ No newline at end of file
author	Miruna Paun <Miruna.Paun@enea.com>	2017-09-28 18:53:02 +0200
committer	Miruna Paun <Miruna.Paun@enea.com>	2017-09-28 18:53:02 +0200
commit	cc001420304566cd252f2c6323dec3a826a12954 (patch)
tree	299a124c12a96df1000e0b87753df9caef34b06a /book-enea-nfv-core-installation-guide/doc/high_availability.xml
parent	380e975b1b93e83705c8ed30197b1c23f8193814 (diff)
download	doc-enea-nfv-cc001420304566cd252f2c6323dec3a826a12954.tar.gz