High Availability Guide

High Availability Guide Enea NFV Core 1.0 has been designed to provide high availability characteristics that are needed for developing and deploying telco-grade NFV solutions on top of our OPNFV based platform. The High Availability subject in general is very wide and still an important focus in both opensource communities and the independent/proprietary solutions market. Enea NFV Core 1.0 aims to initially leverage the efforts in the upstream OPNFV and OpenStack opensource projects, combining solutions from both worlds in an effort to provide flexibility and use-case coverage. Enea has long term expertise and proprietary solutions addressing High Availability for telco applications, which are subject to integration with the NFV based solutions.

High Availability Levels The foundation for the feature set available in Enea NFV Core is divided into three levels: Hardware Fault NFV Platform H.A. VNF High Availability The same division of levels for fault management can be seen in the scope of the High Availability for OPNFV ("Availability") project. OPNFV also hosts Doctor, a fault management and maintenance project designed to develop and perform the consequent implementation of the OPNFV reference platform. These two projects complement each other. The Availability project addresses H.A. requirements and solutions from the perspective of the three levels mentioned above. It produces high level requirements and API definitions for High Availability for OPNFV, a H.A. Gap Analysis Report for OpenStack and more recently, works on optimizing existing OPNFV test frameworks, such as Yardstick, developing test cases which realize H.A.-specific use-cases and scenarios derived from the H.A. requirements. The Doctor project aims to build fault management and maintenance framework for the high availability of Network Services, on top of a virtualized infrastructure. The key feature is immediate notification of unavailability of virtualized resources from VIM, to process recovery of VNFs on them. The Doctor project has also collaborated with the Availability project on identifying gaps in upstream projects, such as but not exclusively OpenStack. It has also worked towards implementing missing features and improving functionality, with a good example being the Aodh event based alarms, which allow for fast notifications when certain predefined events occur. The Doctor project also produced an architectural design and a reference implementation based on opensource components, which will be presented later on in this document.

Doctor Architecture The Doctor project documentation shows the detailed architecture for Fault Management and NFVI Maintenance. Being quite similar with each other, the focus in the following sections shall remain on Fault Management. The architecture specifies a set of functional blocks: Monitor - monitors the virtualized infrastructure, capturing fault events in software and hardware. For this component we choose Zabbix which is integrated into the platform through the Fuel Zabbix Plugin, available upstream. Inspector - this component receives notifications from Monitor components and OpenStack core components, allowing it to create logical relationships between entities, identify affected resources when faults occur, and to communicate with Controllers in order to update the states of the virtual and physical resources. For this component Enea NFV Core 1.0 makes use of Vitrage, an OpenStack related project used for Root Cause Analysis. The integration into the platform is done with the help of a Fuel Plugin which has been developed internally by Enea. Controller - OpenStack core components act as Controllers responsible for maintaining the resource map between physical and virtual resources. They accept update requests from the Inspector and are responsible for sending failure event notifications to the Notifier. Components such as Nova, Neutron, Glance, and Heat, act as Controllers in the Doctor Architecture. Notifier - the focus of this component is on selecting and aggregating failure events received from the controller, based on policies mandated by the Consumer. The role of the Notifier is filled by the Aodh component in OpenStack. Alongside the Doctor components, there are a few other blocks mentioned: Administrator - this represents the human role of administrating the platform by means of dedicated interfaces. These can be visual dashboards like OpenStack Horizon or Fuel Dashboard, or via CLI tools like the OpenStack unified CLI, that can be accessed from one of the servers that act as OpenStack Controller nodes. In Enea NFV Core 1.0 the Administrator can also access the Zabbix dashboard to perform supplementary configurations. The same applies for the Vitrage tool, which comes with its own Horizon dashboard, enabling the user to visually inspect the faults reported by the monitoring tools through visual representations of the virtual and physical resources, the relationships between them and the fault correlation. For Vitrage, users will usually want to configure additional use-cases and describe relationships between components via template files written in yaml format. Consumer - this block is vaguely described in the Doctor Architecture and is out of its current scope. Doctor only deals with fault detection and management, but since the actual VNFs are managed, according to the ETSI architecture, by a different entity, Doctor does not deal with recovery actions of the VNFs. The role of the Consumer thus falls to that of a VNF Manager and Orchestrator. Enea NFV Core 1.0 provides VNF management capabilities using Tacker, which is an OpenStack project that implements a generic VNF Manager and Orchestrator, according to the ETSI MANO Architectural Framework. The functional blocks overview in the picture below has been complemented to show the components used for realizing the Doctor Architecture:

Doctor Fault Management The architecture described in the Doctor project has been demonstrated in various PoCs and demos, but always using sample components for either the consumer or the monitor. Enea has worked with upstream projects Doctor and Vitrage, to realize the goals of the Doctor project by using real components as described above. The two pictures below show a typical fault management scenario: Enea NFV Core 1.0 uses the same approach described above: When creating a VNF, the user will have to enable the monitoring capabilities of Tacker by passing a template, which specifies that an alarm will be created when the VM represented by this VNF changes state. The support for alarm monitoring in Tacker is detailed in the Alarm Monitoring Framework spec in the OpenStack documentation. Tacker should be able to create a VNF and then an Aodh alarm of type event, triggerable when the instance is in a state of ERROR. When this event is triggered perform an HTTP call to a URL managed by Tacker. As a result of this action, Tacker can detect when an instance has failed (for whatever reason) and will respawn it somewhere else. The subscribed response in this case is an empty operation, the Notifier (Aodh) only has to confirm that the alarm has been created. The NFVI sends monitoring events for the resources the VIM has been subscribed to. This subscription message exchange between the VIM and NFVI is not shown in this message flow. This step is related to Vitrage's capability of receiving notifications from OpenStack services. At this moment Vitrage supports notifications from nova.host, nova.instances, nova.zone, cinder.volume, neutron.network, neutron.port and heat.stack OpenStack datasources. This step describes faults detected by Zabbix which are sent to the Inspector (Vitrage) as soon as detected. This is done using a push approach by means of sending an AMQP message to a dedicated message queue managed by Vitrage. For example, if nova-compute fails on one of the compute nodes, Zabbix will format a message specifying all the needed details required for processing the fault: a timestamp, what host failed, what event occurred etc. This step shows database lookup geared to find the virtual resources affected by the detected fault. Vitrage will perform various calculations to detect what virtual resources are affected by the raw failure presented by Zabbix. Vitrage can be configured via templates to correlate instances with the physical hosts they are running on, so that if a compute node fails, then instances running on that host will be affected. A typical use-case is to mark the compute node down (mark_host_down) and update the states of all instances running on them. This is done by issuing Nova API calls for each of these instances. Step 5c. shows the Controller (Nova in this case) acting upon the state change of the instance and issuing an event alarm to Aodh. The Notifier will acknowledge the alarm event request from Nova and will trigger the alarm(s) created by Tacker in step 1. Since Tacker has configured the alarm to send an HTTP request, Aodh will perform that HTTP call at the URL managed by Tacker. The Consumer (Tacker) will react to the HTTP call and perform the action configured by the user (e.g. respawn the VNF). The action is sent to the Controller (Nova) so that the VNF is recreated. The ENEA NFV Core 1.0 Pre-Release fully covers the required Doctor functionality only for the Vitrage and Zabbix components.

Zabbix Configuration for Push Notifications Vitrage supports Zabbix datasource by means of regularly polling the Zabbix agents, which need to be configured in advance. The Vitrage plugin developed internally by Enea can automatically configure Zabbix so that everything works as expected. Polling however, is not fast enough for a telco use-case, so it is necessary to configure push notifications for Zabbix . This requires manual configuration on one of the controller nodes, since Zabbix uses a centralized database which makes the configuration available on all the other nodes. The Zabbix configuration dashboard is available at the same IP address where OpenStack can be reached, e.g. http://10.0.6.42/zabbix. To forward zabbix events to Vitrage, a new media script needs to be created and associated with a user. Follow the steps below as a Zabbix Admin user: Create a new media type [Admininstration Media Types > Create Media Type] Name: Vitrage Notifications Type: Script Script name: zabbix_vitrage.py Modify the Media for the Admin user [Administration Users] Type: Vitrage Notifications Send to: rabbit://rabbit_user:rabbit_pass@127.0.0.1:5672/ --- Vitrage message bus url (you need to search for this in /etc/vitrage/vitrage.conf or /etc/nova/nova.conf transport_url) When active: 1-7, 00:00-24:00 Use if severity: (all) Status: Enabled Configure Action [Configuration > Actions > Create Action > Action] Name: Forward to Vitrage Default Subject: {TRIGGER.STATUS} Default Message: host={HOST.NAME1} hostid={HOST.ID1} hostip={HOST.IP1} triggerid={TRIGGER.ID} description={TRIGGER.NAME} rawtext={TRIGGER.NAME.ORIG} expression={TRIGGER.EXPRESSION} value={TRIGGER.VALUE} priority={TRIGGER.NSEVERITY} lastchange={EVENT.DATE} {EVENT.TIME} To send events add under the Conditions tab: "Maintenance status not in "maintenance"". Finally, add an operation: Send to Users: Admin Send only to: Vitrage Notifications Using these instructions, Zabbix will call the zabbix_vitrage.py script, made readily available by the Fuel Vitrage Plugin, to pass the arguments described in step 3. The zabbix_vitrage.py script will then interpret the parameters and format an AMQP message to be sent to the vitrage.notifications queue, managed by the vitrage-graph service.

Vitrage Configuration The Vitrage team has been collaborating with the OPNFV Doctor project in order to support Vitrage as an Inspector Component. The Doctor use-case for Vitrage is described in an OpenStack blueprint. Enea NFV Core has complemented Vitrage with the ability to set the states of failed instances by implementing an action type in Vitrage. This action calls Nova APIs to set instances in error state. An action type which allows fencing failed hosts also exists. In order to make use of these features, Vitrage supports additional configurations via yaml templates that must be placed in /etc/vitrage/templates on the nodes have the Vitrage role. The example below shows how to program Vitrage to mark failed compute hosts as down and then to change the state of the instances to ERROR, by creating Vitrage deduced alarms. metadata: name: test_nova_mark_instance_err description: test description definitions: entities: - entity: category: ALARM type: zabbix rawtext: Nova Compute process is not running on {HOST.NAME} template_id: zabbix_alarm - entity: category: RESOURCE type: nova.host template_id: host - entity: category: RESOURCE type: nova.instance template_id: instance relationships: - relationship: source: zabbix_alarm relationship_type: on target: host template_id: nova_process_not_running - relationship: source: host target: instance relationship_type: contains template_id : host_contains_instance scenarios: - scenario: condition: nova_process_not_running and host_contains_instance actions: - action: action_type: mark_down action_target: target: host - action: action_type: set_instance_state action_target: target: instance - action: action_type: set_state action_target: target: instance properties: state: ERROR For the action type of fencing, a similar action item must be added: - scenario: condition: critical_problem_on_host actions: - action: action_type: fence action_target: target: host After a template is configured, a restart of the vitrage-api and vitrage-graph services is needed: root@node-6:~# systemctl restart vitrage-api root@node-6:~# systemctl restart vitrage-graph

Vitrage Customizations Enea NFV Core 1.0 has added custom features for Vitrage which allow two kinds of actions: Perform actions Northbound of the VIM: Nova force host down on compute Setting instance state to ERROR in nova. This is used in conjunction with an alarm created by Tacker, as described before, and should allow Tacker to detect when an instance is affected and take proper actions. Perform actions Southbound of the VIM: Vitrage templates allow us to program fencing actions for hosts with failed services. In the event that systemd is unable to recover from a critical process or a type of sofware error ocurs on the Hardware supporting them, the fencing of Node can be programmed, and it in turn will perform a reboot, attempting to recover the failed node.

Pacemaker High Availability Many of the OpenStack solutions which offer High Availability characteristics employ Pacemaker for achieving highly available OpenStack services. Traditionally Pacemaker has been used for managing only the control plane services, so it can effectively provide redundancy and recovery for the Controller nodes only. A reason for this is that Controller nodes and Compute nodes essentially have very different High Availability requirements that need to be considered. Typically, for Controller nodes, the services that run on them are stateless, with few exceptions, where only one instance of a given service is allowed, but for which redundancy is still desired. A good example would be an AMQP service (e.g. RabbitMQ). Compute nodes H.A. requirements depend on the type of services that run on them, but typically it is desired that failures on these nodes be detected as soon as possible so that the instances that run on them can be either migrated, resurrected or restarted. Sometimes failures on the physical hosts do not necessarily cause a failure on the services (VNFs), but having these services incapacitated can prevent access to and controlling the services. Controller High Availability is thus a subject generally well understood and experimented with, and the basis for this is Pacemaker using Corosync underneath. Extending the use of Pacemaker to Compute nodes was thought as a possible solution for providing VNF high availability, but the problem turned out to be more complicated. On one hand, Pacemaker as a clustering tool, can only scale properly up to a limited number of nodes, usually less than 128. This poses a problem for large scale deployments where hundreds of compute nodes are required. On the other hand, Compute node H.A. requires other considerations and calls for specially designed solutions.

Pacemaker Remote As mentioned earlier, Pacemaker and corosync do not scale well over a large cluster, since each node has to talk to every other, essentially creating a mesh configuration. A solution to this problem could be partitioning the cluster into smaller groups, but this has its limitations and it is generally difficult to manage. A better solution is using pacemaker-remote, a feature of Pacemaker, which allows for extending the cluster beyond the usual limits by using the Pacemaker monitoring capabilities. It essentially creates a new type of resource which enables adding light weight nodes to the cluster. More information about pacemaker-remote can be found on the official clusterlabs website. Please note that at this moment Pacemaker remote must be configured manually after deployment. Here are the manual steps for doing so: Log onto the Fuel Master using the default credentials, if they have not been changed (root/r00tme). Type fuel node to obtain the list of nodes, their roles and the IP addresses. [root@fuel ~]# fuel node id | status | name | cluster | ip | mac | roles / | pending_roles | online | group_id ---+--------+------------------+---------+-----------+-------------------+----------/ -----------------+---------------+--------+--------- 1 | ready | Untitled (8c:d4) | 1 | 10.20.0.4 | 68:05:ca:46:8c:d4 | ceph-osd,/ controller | | 1 | 1 4 | ready | Untitled (8c:c2) | 1 | 10.20.0.6 | 68:05:ca:46:8c:c2 | ceph-osd,/ compute | | 1 | 1 5 | ready | Untitled (8c:c9) | 1 | 10.20.0.7 | 68:05:ca:46:8c:c9 | ceph-osd,/ compute | | 1 | 1 2 | ready | Untitled (8b:64) | 1 | 10.20.0.3 | 68:05:ca:46:8b:64 | / controller, mongo, tacker | | 1 | 1 3 | ready | Untitled (8c:45) | 1 | 10.20.0.5 | 68:05:ca:46:8c:45 | / controller, vitrage | | 1 | 1 Each controller has a unique Pacemaker authkey. One needs to be kept and propagated to the other servers. Assuming node-1, node-2 and node-3 are the controllers, execute the following from the Fuel console: [root@fuel ~]# scp node-1:/etc/pacemaker/authkey . [root@fuel ~]# scp authkey node-2:/etc/pacemaker/ [root@fuel ~]# scp authkey node-3:/etc/pacemaker/ [root@fuel ~]# scp authkey node-3:/etc/pacemaker/ [root@fuel ~]# scp authkey node-4:~ [root@fuel ~]# scp authkey node-5:~ For each compute node, log on to it using the corresponding IP Install the required packages: root@node-4:~# apt-get install pacemaker-remote resource-agents crmsh Copy the authkey from the Fuel Master and make sure the right permissions are set: [root@node-4:~]# cp authkey /etc/pacemaker [root@node-4:~]# chown root:haclient /etc/pacemaker/authkey Add an iptables rule for the default port (3121). Save it also to /etc/iptables/rules.v4 to make it persistent: root@node-4:~# iptables -A INPUT -s 192.168.0.0/24 -p tcp -m multiport / --dports 3121 -m comment --comment "pacemaker_remoted from 192.168.0.0/24" -j ACCEPT Start the pacemaker-remote service: [root@node-4:~]# systemctl start pacemaker-remote.service Log onto one of the controller nodes and configure the pacemaker-remote resources: [root@node-1:~]# pcs resource create node-4.domain.tld remote [root@node-1:~]# pcs constraint location node-4.domain.tld prefers / node-1.domain.tld=100 node-2.domain.tld=100 node-3.domain.tld=100 [root@node-1:~]# pcs constraint location node-4.domain.tld avoids node-5.domain.tld [root@node-1:~]# pcs resource create node-5.domain.tld remote [root@node-1:~]# pcs constraint location node-5.domain.tld prefers / node-1.domain.tld=100 node-2.domain.tld=100 node-3.domain.tld=100 [root@node-1:~]# pcs constraint location node-5.domain.tld avoids node-4.domain.tld Remote nodes should now appear online: [root@node-1:~]# pcs status Cluster name: OpenStack Last updated: Thu Aug 24 12:00:21 2017 Last change: Thu Aug 24 11:57:32 2017 / by root via cibadmin on node-1.domain.tld Stack: corosync Current DC: node-1.domain.tld (version 1.1.14-70404b0) - partition with quorum 5 nodes and 78 resources configured Online: [ node-1.domain.tld node-2.domain.tld node-3.domain.tld ] RemoteOnline: [ node-4.domain.tld node-5.domain.tld ]

Pacemaker Fencing ENEA NFV Core 1.0 makes use of the fencing capabilities of Pacemaker to isolate faulty nodes and trigger recovery actions by means of power cycling the failed nodes. Fencing is configured by creating STONITH type resources for each of the servers in the cluster, both Controller nodes and Compute nodes. The STONITH adapter for fencing the nodes is fence_ipmilan, which makes use of the IPMI capabilities of the ThunderX servers. Here are the steps for enabling fencing capabilities on a cluster: Log onto the Fuel Master using the default credentials, if not they have not been changed (root/r00tme). Type fuel node to obtain the list of nodes, their roles and the IP addresses: [root@fuel ~]# fuel node id | status | name | cluster | ip | mac | roles / | pending_roles | online | group_id ---+--------+------------------+---------+-----------+-------------------+----------/ -----------------+---------------+--------+--------- 1 | ready | Untitled (8c:d4) | 1 | 10.20.0.4 | 68:05:ca:46:8c:d4 | ceph-osd,/ controller | | 1 | 1 4 | ready | Untitled (8c:c2) | 1 | 10.20.0.6 | 68:05:ca:46:8c:c2 | ceph-osd,/ compute | | 1 | 1 5 | ready | Untitled (8c:c9) | 1 | 10.20.0.7 | 68:05:ca:46:8c:c9 | ceph-osd,/ compute | | 1 | 1 2 | ready | Untitled (8b:64) | 1 | 10.20.0.3 | 68:05:ca:46:8b:64 | / controller, mongo, tacker | | 1 | 1 3 | ready | Untitled (8c:45) | 1 | 10.20.0.5 | 68:05:ca:46:8c:45 | / controller, vitrage | | 1 | 1 Log onto each server to install additional packages: [root@node-1:~]# apt-get install fence-agents ipmitool Configure Pacemaker fencing resources. This needs to be done once on one of the controllers. The parameters will vary, depending on the BMC addresses of each node and credentials. [root@node-1:~]# crm configure primitive ipmi-fencing-node-1 / stonith::fence_ipmilan params pcmk_host_list="node-1.domain.tld" / ipaddr=10.0.100.151 login=ADMIN passwd=ADMIN op monitor interval="60s" [root@node-1:~]# crm configure primitive ipmi-fencing-node-2 / stonith::fence_ipmilan params pcmk_host_list="node-2.domain.tld" / ipaddr=10.0.100.152 login=ADMIN passwd=ADMIN op monitor interval="60s" [root@node-1:~]# crm configure primitive ipmi-fencing-node-3 / stonith::fence_ipmilan params pcmk_host_list="node-3.domain.tld" / ipaddr=10.0.100.153 login=ADMIN passwd=ADMIN op monitor interval="60s" [root@node-1:~]# crm configure primitive ipmi-fencing-node-4 / stonith::fence_ipmilan params pcmk_host_list="node-4.domain.tld" / ipaddr=10.0.100.154 login=ADMIN passwd=ADMIN op monitor interval="60s" [root@node-1:~]# crm configure primitive ipmi-fencing-node-5 / stonith::fence_ipmilan params pcmk_host_list="node-5.domain.tld" / ipaddr=10.0.100.155 login=ADMIN passwd=ADMIN op monitor interval="60s" Activate fencing by enabling the stonith property in Pacemaker (disabled by default). This also needs to be done only once, on one of the controllers. [root@node-1:~]# pcs property set stonith-enabled=true

OpenStack Resource Agents The OpenStack community has been working for some time on identifying possible solutions for enabling High Availability for Compute nodes, after a period of belief that this subject was not something that should concern the cloud platform. Over time it became obvious that even on a true cloud platform, where services are designed to run without being affected by the availability of the cloud platform, fault management and recovery are still very important and desirable. This is also the case for NFV applications, where in the good tradition of telecom applications, the operators must have complete engineering control over the resources they own and manage. The work for Compute node High Availability is captured in an OpenStack user story and documented upstream, showing proposed solutions, summit talks and presentations. A number of these solutions make use of OpenStack Resource Agents, which are a set of specialized Pacemaker resources capable of identifying failures in compute nodes and can perform automatic evacuation of the instances affected by these failures. ENEA NFV Core 1.0 aims to validate and integrate this work and to make this feature available in the platform aimed as an alternative to the Doctor framework, where simple, autonomous recovery of running instances is desired.