2.2.5. Detailed architecture and interface specification¶

This section describes a detailed implementation plan, which is based on the high level architecture introduced in Section 3. Section 5.1 describes the functional blocks of the Doctor architecture, which is followed by a high level message flow in Section 5.2. Section 5.3 provides a mapping of selected existing open source components to the building blocks of the Doctor architecture. Thereby, the selection of components is based on their maturity and the gap analysis executed in Section 4. Sections 5.4 and 5.5 detail the specification of the related northbound interface and the related information elements. Finally, Section 5.6 provides a first set of blueprints to address selected gaps required for the realization functionalities of the Doctor project.

2.2.5.1. Functional Blocks¶

This section introduces the functional blocks to form the VIM. OpenStack was selected as the candidate for implementation. Inside the VIM, 4 different building blocks are defined (see figure6).

Functional blocks

2.2.5.1.1. Monitor¶

The Monitor module has the responsibility for monitoring the virtualized infrastructure. There are already many existing tools and services (e.g. Zabbix) to monitor different aspects of hardware and software resources which can be used for this purpose.

2.2.5.1.2. Inspector¶

The Inspector module has the ability a) to receive various failure notifications regarding physical resource(s) from Monitor module(s), b) to find the affected virtual resource(s) by querying the resource map in the Controller, and c) to update the state of the virtual resource (and physical resource).

The Inspector has drivers for different types of events and resources to integrate any type of Monitor and Controller modules. It also uses a failure policy database to decide on the failure selection and aggregation from raw events. This failure policy database is configured by the Administrator.

The reason for separation of the Inspector and Controller modules is to make the Controller focus on simple operations by avoiding a tight integration of various health check mechanisms into the Controller.

2.2.5.1.3. Controller¶

The Controller is responsible for maintaining the resource map (i.e. the mapping from physical resources to virtual resources), accepting update requests for the resource state(s) (exposing as provider API), and sending all failure events regarding virtual resources to the Notifier. Optionally, the Controller has the ability to force the state of a given physical resource to down in the resource mapping when it receives failure notifications from the Inspector for that given physical resource. The Controller also re-calculates the capacity of the NVFI when receiving a failure notification for a physical resource.

In a real-world deployment, the VIM may have several controllers, one for each resource type, such as Nova, Neutron and Cinder in OpenStack. Each controller maintains a database of virtual and physical resources which shall be the master source for resource information inside the VIM.

2.2.5.1.4. Notifier¶

The focus of the Notifier is on selecting and aggregating failure events received from the controller based on policies mandated by the Consumer. Therefore, it allows the Consumer to subscribe for alarms regarding virtual resources using a method such as API endpoint. After receiving a fault event from a Controller, it will notify the fault to the Consumer by referring to the alarm configuration which was defined by the Consumer earlier on.

To reduce complexity of the Controller, it is a good approach for the Controllers to emit all notifications without any filtering mechanism and have another service (i.e. Notifier) handle those notifications properly. This is the general philosophy of notifications in OpenStack. Note that a fault message consumed by the Notifier is different from the fault message received by the Inspector; the former message is related to virtual resources which are visible to users with relevant ownership, whereas the latter is related to raw devices or small entities which should be handled with an administrator privilege.

The northbound interface between the Notifier and the Consumer/Administrator is specified in Detailed northbound interface specification.

2.2.5.2. Sequence¶

2.2.5.2.1. Fault Management¶

The detailed work flow for fault management is as follows (see also figure7):

Request to subscribe to monitor specific virtual resources. A query filter can be used to narrow down the alarms the Consumer wants to be informed about.
Each subscription request is acknowledged with a subscribe response message. The response message contains information about the subscribed virtual resources, in particular if a subscribed virtual resource is in “alarm” state.
The NFVI sends monitoring events for resources the VIM has been subscribed to. Note: this subscription message exchange between the VIM and NFVI is not shown in this message flow.
Event correlation, fault detection and aggregation in VIM.
Database lookup to find the virtual resources affected by the detected fault.
Fault notification to Consumer.
The Consumer switches to standby configuration (STBY)
Instructions to VIM requesting certain actions to be performed on the affected resources, for example migrate/update/terminate specific resource(s). After reception of such instructions, the VIM is executing the requested action, e.g. it will migrate or terminate a virtual resource.

Query request from Consumer to VIM to get information about the current status of a resource.
Response to the query request with information about the current status of the queried resource. In case the resource is in “fault” state, information about the related fault(s) is returned.

In order to allow for quick reaction to failures, the time interval between fault detection in step 3 and the corresponding recovery actions in step 7 and 8 shall be less than 1 second.

Fault management work flow

Fault management scenario

figure8 shows a more detailed message flow (Steps 4 to 6) between the 4 building blocks introduced in Functional Blocks.

The Monitor observed a fault in the NFVI and reports the raw fault to the Inspector. The Inspector filters and aggregates the faults using pre-configured failure policies.
a) The Inspector queries the Resource Map to find the virtual resources affected by the raw fault in the NFVI. b) The Inspector updates the state of the affected virtual resources in the Resource Map. c) The Controller observes a change of the virtual resource state and informs the Notifier about the state change and the related alarm(s). Alternatively, the Inspector may directly inform the Notifier about it.
The Notifier is performing another filtering and aggregation of the changes and alarms based on the pre-configured alarm configuration. Finally, a fault notification is sent to northbound to the Consumer.

2.2.5.2.2. NFVI Maintenance¶

NFVI maintenance work flow

The detailed work flow for NFVI maintenance is shown in figure9 and has the following steps. Note that steps 1, 2, and 5 to 8a in the NFVI maintenance work flow are very similar to the steps in the fault management work flow and share a similar implementation plan in Release 1.

Subscribe to fault/maintenance notifications.
Response to subscribe request.
Maintenance trigger received from administrator.
VIM switches NFVI resources to “maintenance” state. This, e.g., means they should not be used for further allocation/migration requests
Database lookup to find the virtual resources affected by the detected maintenance operation.
Maintenance notification to Consumer.
The Consumer switches to standby configuration (STBY)
Instructions from Consumer to VIM requesting certain recovery actions to be performed (step 8a). After reception of such instructions, the VIM is executing the requested action in order to empty the physical resources (step 8b).
Maintenance response from VIM to inform the Administrator that the physical machines have been emptied (or the operation resulted in an error state).
Administrator is coordinating and executing the maintenance operation/work on the NFVI.

Query request from Administrator to VIM to get information about the current state of a resource.
Response to the query request with information about the current state of the queried resource(s). In case the resource is in “maintenance” state, information about the related maintenance operation is returned.

NFVI Maintenance scenario

figure10 shows a more detailed message flow (Steps 3 to 6 and 9) between the 4 building blocks introduced in Section 5.1..

The Administrator is sending a StateChange request to the Controller residing in the VIM.
The Controller queries the Resource Map to find the virtual resources affected by the planned maintenance operation.
a) The Controller updates the state of the affected virtual resources in the Resource Map database.

b) The Controller informs the Notifier about the virtual resources that will be affected by the maintenance operation.
A maintenance notification is sent to northbound to the Consumer.

…

The Controller informs the Administrator after the physical resources have been freed.

2.2.5.3. Information elements¶

This section introduces all attributes and information elements used in the messages exchange on the northbound interfaces between the VIM and the VNFO and VNFM.

Note: The information elements will be aligned with current work in ETSI NFV IFA working group.

Simple information elements:

SubscriptionID (Identifier): identifies a subscription to receive fault or maintenance notifications.
NotificationID (Identifier): identifies a fault or maintenance notification.
VirtualResourceID (Identifier): identifies a virtual resource affected by a fault or a maintenance action of the underlying physical resource.
PhysicalResourceID (Identifier): identifies a physical resource affected by a fault or maintenance action.
VirtualResourceState (String): state of a virtual resource, e.g. “normal”, “maintenance”, “down”, “error”.
PhysicalResourceState (String): state of a physical resource, e.g. “normal”, “maintenance”, “down”, “error”.
VirtualResourceType (String): type of the virtual resource, e.g. “virtual machine”, “virtual memory”, “virtual storage”, “virtual CPU”, or “virtual NIC”.
FaultID (Identifier): identifies the related fault in the underlying physical resource. This can be used to correlate different fault notifications caused by the same fault in the physical resource.
FaultType (String): Type of the fault. The allowed values for this parameter depend on the type of the related physical resource. For example, a resource of type “compute hardware” may have faults of type “CPU failure”, “memory failure”, “network card failure”, etc.
Severity (Integer): value expressing the severity of the fault. The higher the value, the more severe the fault.
MinSeverity (Integer): value used in filter information elements. Only faults with a severity higher than the MinSeverity value will be notified to the Consumer.
EventTime (Datetime): Time when the fault was observed.
EventStartTime and EventEndTime (Datetime): Datetime range that can be used in a FaultQueryFilter to narrow down the faults to be queried.
ProbableCause (String): information about the probable cause of the fault.
CorrelatedFaultID (Integer): list of other faults correlated to this fault.
isRootCause (Boolean): Parameter indicating if this fault is the root for other correlated faults. If TRUE, then the faults listed in the parameter CorrelatedFaultID are caused by this fault.
FaultDetails (Key-value pair): provides additional information about the fault, e.g. information about the threshold, monitored attributes, indication of the trend of the monitored parameter.
FirmwareVersion (String): current version of the firmware of a physical resource.
HypervisorVersion (String): current version of a hypervisor.
ZoneID (Identifier): Identifier of the resource zone. A resource zone is the logical separation of physical and software resources in an NFVI deployment for physical isolation, redundancy, or administrative designation.
Metadata (Key-value pair): provides additional information of a physical resource in maintenance/error state.

Complex information elements (see also UML diagrams in figure13 and figure14):

VirtualResourceInfoClass:
- VirtualResourceID [1] (Identifier)
- VirtualResourceState [1] (String)
- Faults [0..*] (FaultClass): For each resource, all faults including detailed information about the faults are provided.
FaultClass: The parameters of the FaultClass are partially based on ETSI TS 132 111-2 (V12.1.0) [*], which is specifying fault management in 3GPP, in particular describing the information elements used for alarm notifications.
- FaultID [1] (Identifier)
- FaultType [1] (String)
- Severity [1] (Integer)
- EventTime [1] (Datetime)
- ProbableCause [1] (String)
- CorrelatedFaultID [0..*] (Identifier)
- FaultDetails [0..*] (Key-value pair)

[*]	http://www.etsi.org/deliver/etsi_ts/132100_132199/13211102/12.01.00_60/ts_13211102v120100p.pdf

SubscribeFilterClass
- VirtualResourceType [0..*] (String)
- VirtualResourceID [0..*] (Identifier)
- FaultType [0..*] (String)
- MinSeverity [0..1] (Integer)
FaultQueryFilterClass: narrows down the FaultQueryRequest, for example it limits the query to certain physical resources, a certain zone, a given fault type/severity/cause, or a specific FaultID.
- VirtualResourceType [0..*] (String)
- VirtualResourceID [0..*] (Identifier)
- FaultType [0..*] (String)
- MinSeverity [0..1] (Integer)
- EventStartTime [0..1] (Datetime)
- EventEndTime [0..1] (Datetime)
PhysicalResourceStateClass:
- PhysicalResourceID [1] (Identifier)
- PhysicalResourceState [1] (String): mandates the new state of the physical resource.
- Metadata [0..*] (Key-value pair)
PhysicalResourceInfoClass:
- PhysicalResourceID [1] (Identifier)
- PhysicalResourceState [1] (String)
- FirmwareVersion [0..1] (String)
- HypervisorVersion [0..1] (String)
- ZoneID [0..1] (Identifier)
- Metadata [0..*] (Key-value pair)
StateQueryFilterClass: narrows down a StateQueryRequest, for example it limits the query to certain physical resources, a certain zone, or a given resource state (e.g., only resources in “maintenance” state).
- PhysicalResourceID [1] (Identifier)
- PhysicalResourceState [1] (String)
- ZoneID [0..1] (Identifier)

2.2.5.4. Detailed northbound interface specification¶

This section is specifying the northbound interfaces for fault management and NFVI maintenance between the VIM on the one end and the Consumer and the Administrator on the other ends. For each interface all messages and related information elements are provided.

Note: The interface definition will be aligned with current work in ETSI NFV IFA working group .

All of the interfaces described below are produced by the VIM and consumed by the Consumer or Administrator.

2.2.5.4.1. Fault management interface¶

This interface allows the VIM to notify the Consumer about a virtual resource that is affected by a fault, either within the virtual resource itself or by the underlying virtualization infrastructure. The messages on this interface are shown in figure13 and explained in detail in the following subsections.

Note: The information elements used in this section are described in detail in Section 5.4.

Fault management NB I/F messages

2.2.5.4.1.1. SubscribeRequest (Consumer -> VIM)¶

Subscription from Consumer to VIM to be notified about faults of specific resources. The faults to be notified about can be narrowed down using a subscribe filter.

Parameters:

SubscribeFilter [1] (SubscribeFilterClass): Optional information to narrow down the faults that shall be notified to the Consumer, for example limit to specific VirtualResourceID(s), severity, or cause of the alarm.

2.2.5.4.1.2. SubscribeResponse (VIM -> Consumer)¶

Response to a subscribe request message including information about the subscribed resources, in particular if they are in “fault/error” state.

Parameters:

SubscriptionID [1] (Identifier): Unique identifier for the subscription. It can be used to delete or update the subscription.
VirtualResourceInfo [0..*] (VirtualResourceInfoClass): Provides additional information about the subscribed resources, i.e., a list of the related resources, the current state of the resources, etc.

2.2.5.4.1.3. FaultNotification (VIM -> Consumer)¶

Notification about a virtual resource that is affected by a fault, either within the virtual resource itself or by the underlying virtualization infrastructure. After reception of this request, the Consumer will decide on the optimal action to resolve the fault. This includes actions like switching to a hot standby virtual resource, migration of the fault virtual resource to another physical machine, termination of the faulty virtual resource and instantiation of a new virtual resource in order to provide a new hot standby resource. In some use cases the Consumer can leave virtual resources on failed host to be booted up again after fault is recovered. Existing resource management interfaces and messages between the Consumer and the VIM can be used for those actions, and there is no need to define additional actions on the Fault Management Interface.

Parameters:

NotificationID [1] (Identifier): Unique identifier for the notification.
VirtualResourceInfo [1..*] (VirtualResourceInfoClass): List of faulty resources with detailed information about the faults.

2.2.5.4.1.4. FaultQueryRequest (Consumer -> VIM)¶

Request to find out about active alarms at the VIM. A FaultQueryFilter can be used to narrow down the alarms returned in the response message.

Parameters:

FaultQueryFilter [1] (FaultQueryFilterClass): narrows down the FaultQueryRequest, for example it limits the query to certain physical resources, a certain zone, a given fault type/severity/cause, or a specific FaultID.

2.2.5.4.1.5. FaultQueryResponse (VIM -> Consumer)¶

List of active alarms at the VIM matching the FaultQueryFilter specified in the FaultQueryRequest.

Parameters:

VirtualResourceInfo [0..*] (VirtualResourceInfoClass): List of faulty resources. For each resource all faults including detailed information about the faults are provided.

2.2.5.4.2. NFVI maintenance¶

The NFVI maintenance interfaces Consumer-VIM allows the Consumer to subscribe to maintenance notifications provided by the VIM. The related maintenance interface Administrator-VIM allows the Administrator to issue maintenance requests to the VIM, i.e. requesting the VIM to take appropriate actions to empty physical machine(s) in order to execute maintenance operations on them. The interface also allows the Administrator to query the state of physical machines, e.g., in order to get details in the current status of the maintenance operation like a firmware update.

The messages defined in these northbound interfaces are shown in figure14 and described in detail in the following subsections.

NFVI maintenance NB I/F messages

2.2.5.4.2.1. SubscribeRequest (Consumer -> VIM)¶

Subscription from Consumer to VIM to be notified about maintenance operations for specific virtual resources. The resources to be informed about can be narrowed down using a subscribe filter.

Parameters:

SubscribeFilter [1] (SubscribeFilterClass): Information to narrow down the faults that shall be notified to the Consumer, for example limit to specific virtual resource type(s).

2.2.5.4.2.2. SubscribeResponse (VIM -> Consumer)¶

Response to a subscribe request message, including information about the subscribed virtual resources, in particular if they are in “maintenance” state.

Parameters:

SubscriptionID [1] (Identifier): Unique identifier for the subscription. It can be used to delete or update the subscription.
VirtualResourceInfo [0..*] (VirtalResourceInfoClass): Provides additional information about the subscribed virtual resource(s), e.g., the ID, type and current state of the resource(s).

2.2.5.4.2.3. MaintenanceNotification (VIM -> Consumer)¶

Notification about a physical resource switched to “maintenance” state. After reception of this request, the Consumer will decide on the optimal action to address this request, e.g., to switch to the standby (STBY) configuration.

Parameters:

VirtualResourceInfo [1..*] (VirtualResourceInfoClass): List of virtual resources where the state has been changed to maintenance.

2.2.5.4.2.4. StateChangeRequest (Administrator -> VIM)¶

Request to change the state of a list of physical resources, e.g. to “maintenance” state, in order to prepare them for a planned maintenance operation.

Parameters:

PhysicalResourceState [1..*] (PhysicalResourceStateClass)

2.2.5.4.2.5. StateChangeResponse (VIM -> Administrator)¶

Response message to inform the Administrator that the requested resources are now in maintenance state (or the operation resulted in an error) and the maintenance operation(s) can be executed.

Parameters:

PhysicalResourceInfo [1..*] (PhysicalResourceInfoClass)

2.2.5.4.2.6. StateQueryRequest (Administrator -> VIM)¶

In this procedure, the Administrator would like to get the information about physical machine(s), e.g. their state (“normal”, “maintenance”), firmware version, hypervisor version, update status of firmware and hypervisor, etc. It can be used to check the progress during firmware update and the confirmation after update. A filter can be used to narrow down the resources returned in the response message.

Parameters:

StateQueryFilter [1] (StateQueryFilterClass): narrows down the StateQueryRequest, for example it limits the query to certain physical resources, a certain zone, or a given resource state.

2.2.5.4.2.7. StateQueryResponse (VIM -> Administrator)¶

List of physical resources matching the filter specified in the StateQueryRequest.

Parameters:

PhysicalResourceInfo [0..*] (PhysicalResourceInfoClass): List of physical resources. For each resource, information about the current state, the firmware version, etc. is provided.

2.2.5.4.3. NFV IFA, OPNFV Doctor and AODH alarms¶

This section compares the alarm interfaces of ETSI NFV IFA with the specifications of this document and the alarm class of AODH.

ETSI NFV specifies an interface for alarms from virtualised resources in ETSI GS NFV-IFA 005 [ENFV]. The interface specifies an Alarm class and two notifications plus operations to query alarm instances and to subscribe to the alarm notifications.

The specification in this document has a structure that is very similar to the ETSI NFV specifications. The notifications differ in that an alarm notification in the NFV interface defines a single fault for a single resource while the notification specified in this document can contain multiple faults for multiple resources. The Doctor specification is lacking the detailed time stamps of the NFV specification essential for synchronizaion of the alarm list using the query operation. The detailed time stamps are also of value in the event and alarm history DBs.

AODH defines a base class for alarms, not the notifications. This means that some of the dynamic attributes of the ETSI NFV alarm type, like alarmRaisedTime, are not applicable to the AODH alarm class but are attributes of in the actual notifications. (Description of these attributes will be added later.) The AODH alarm class is lacking some attributes present in the NFV specification, fault details and correlated alarms. Instead the AODH alarm class has attributes for actions, rules and user and project id.

ETSI NFV Alarm Type	OPNFV Doctor Requirement Specs	AODH Event Alarm Notification	Description / Comment	Recommendations
alarmId	FaultId	alarm_id	Identifier of an alarm.	-
-	-	alarm_name	Human readable alarm name.	May be added in ETSI NFV Stage 3.
managedObjectId	VirtualResourceId	(reason)	Identifier of the affected virtual resource is part of the AODH reason parameter.	-
-	-	user_id, project_id	User and project identifiers.	May be added in ETSI NFV Stage 3.
alarmRaisedTime	-	-	Timestamp when alarm was raised.	To be added to Doctor and AODH. May be derived (e.g. in a shimlayer) from the AODH alarm history.
alarmChangedTime	-	-	Timestamp when alarm was changed/updated.	see above
alarmClearedTime	-	-	Timestamp when alarm was cleared.	see above
eventTime	-	-	Timestamp when alarm was first observed by the Monitor.	see above
-	EventTime	generated	Timestamp of the Notification.	Update parameter name in Doctor spec. May be added in ETSI NFV Stage 3.
state: E.g. Fired, Updated Cleared	VirtualResourceState: E.g. normal, down maintenance, error	current: ok, alarm, insufficient_data	ETSI NFV IFA 005/006 lists example alarm states.	Maintenance state is missing in AODH. List of alarm states will be specified in ETSI NFV Stage 3.
perceivedSeverity: E.g. Critical, Major, Minor, Warning, Indeterminate, Cleared	Severity (Integer)	Severity: low (default), moderate, critical	ETSI NFV IFA 005/006 lists example perceived severity values.	List of alarm states will be specified in ETSI NFV Stage 3. OPNFV: Severity (Integer): update OPNFV Doctor specification to Enum perceivedSeverity=Indetermined: remove value Indetermined in IFA and map undefined values to “minor” severity, or add value indetermined in AODH and make it the default value. perceivedSeverity=Cleared: remove value Cleared in IFA as the information about a cleared alarm alarm can be derived from the alarm state parameter, or add value cleared in AODH and set a rule that the severity is “cleared” when the state is ok.
faultType	FaultType	event_type in reason_data	Type of the fault, e.g. “CPU failure” of a compute resource, in machine interpretable format.	OpenStack Alarming (Aodh) can use a fuzzy matching with wildcard string, “compute.cpu.failure”.
N/A	N/A	type = “event”	Type of the notification. For fault notifications the type in AODH is “event”.	-
probableCause	ProbableCause	-	Probable cause of the alarm.	May be provided (e.g. in a shimlayer) based on Vitrage topology awareness / root-cause-analysis.
isRootCause	IsRootCause	-	Boolean indicating whether the fault is the root cause of other faults.	see above
correlatedAlarmId	CorrelatedFaultId	-	List of IDs of correlated faults.	see above
faultDetails	FaultDetails	-	Additional details about the fault/alarm.	FaultDetails information element will be specified in ETSI NFV Stage 3.
-	-	action, previous	Additional AODH alarm related parameters.	-

Table: Comparison of alarm attributes

The primary area of improvement should be alignment of the perceived severity. This is important for a quick and accurate evaluation of the alarm. AODH thus should support also the X.733 values Critical, Major, Minor, Warning and Indeterminate.

The detailed time stamps (raised, changed, cleared) which are essential for synchronizing the alarm list using a query operation should be added to the Doctor specification.

Other areas that need alignment is the so called alarm state in NFV. Here we must however consider what can be attributes of the notification vs. what should be a property of the alarm instance. This will be analyzed later.

2.2.5.5. Detailed southbound interface specification¶

This section is specifying the southbound interfaces for fault management between the Monitors and the Inspector. Although southbound interfaces should be flexible to handle various events from different types of Monitors, we define unified event API in order to improve interoperability between the Monitors and the Inspector. This is not limiting implementation of Monitor and Inspector as these could be extended in order to support failures from intelligent inspection like prediction.

Note: The interface definition will be aligned with current work in ETSI NFV IFA working group.

2.2.5.5.1. Fault event interface¶

This interface allows the Monitors to notify the Inspector about an event which was captured by the Monitor and may effect resources managed in the VIM.

2.2.5.5.1.1. EventNotification¶

Event notification including fault description. The entity of this notification is event, and not fault or error specifically. This allows us to use generic event format or framework build out of Doctor project. The parameters below shall be mandatory, but keys in ‘Details’ can be optional.

Parameters:

Time [1]: Datetime when the fault was observed in the Monitor.
Type [1]: Type of event that will be used to process correlation in Inspector.
Details [0..1]: Details containing additional information with Key-value pair style. Keys shall be defined depending on the Type of the event.

E.g.:

{
    'event': {
        'time': '2016-04-12T08:00:00',
        'type': 'compute.host.down',
        'details': {
            'hostname': 'compute-1',
            'source': 'sample_monitor',
            'cause': 'link-down',
            'severity': 'critical',
            'status': 'down',
            'monitor_id': 'monitor-1',
            'monitor_event_id': '123',
        }
    }
}

Optional parameters in ‘Details’:

Hostname: the hostname on which the event occurred.
Source: the display name of reporter of this event. This is not limited to monitor, other entity can be specified such as ‘KVM’.
Cause: description of the cause of this event which could be different from the type of this event.
Severity: the severity of this event set by the monitor.
Status: the status of target object in which error occurred.
MonitorID: the ID of the monitor sending this event.
MonitorEventID: the ID of the event in the monitor. This can be used by operator while tracking the monitor log.
RelatedTo: the array of IDs which related to this event.

Also, we can have bulk API to receive multiple events in a single HTTP POST message by using the ‘events’ wrapper as follows:

{
    'events': [
        'event': {
            'time': '2016-04-12T08:00:00',
            'type': 'compute.host.down',
            'details': {},
        },
        'event': {
            'time': '2016-04-12T08:00:00',
            'type': 'compute.host.nic.error',
            'details': {},
        }
    ]
}

2.2.5.6. Blueprints¶

This section is listing a first set of blueprints that have been proposed by the Doctor project to the open source community. Further blueprints addressing other gaps identified in Section 4 will be submitted at a later stage of the OPNFV. In this section the following definitions are used:

“Event” is a message emitted by other OpenStack services such as Nova and Neutron and is consumed by the “Notification Agents” in Ceilometer.
“Notification” is a message generated by a “Notification Agent” in Ceilometer based on an “event” and is delivered to the “Collectors” in Ceilometer that store those notifications (as “sample”) to the Ceilometer “Databases”.

2.2.5.6.1. Instance State Notification (Ceilometer) [†]¶

The Doctor project is planning to handle “events” and “notifications” regarding Resource Status; Instance State, Port State, Host State, etc. Currently, Ceilometer already receives “events” to identify the state of those resources, but it does not handle and store them yet. This is why we also need a new event definition to capture those resource states from “events” created by other services.

This BP proposes to add a new compute notification state to handle events from an instance (server) from nova. It also creates a new meter “instance.state” in OpenStack.

[†]	https://etherpad.opnfv.org/p/doctor_bps

2.2.5.6.2. Event Publisher for Alarm (Ceilometer) [‡]¶

Problem statement:

The existing “Alarm Evaluator” in OpenStack Ceilometer is periodically querying/polling the databases in order to check all alarms independently from other processes. This is adding additional delay to the fault notification send to the Consumer, whereas one requirement of Doctor is to react on faults as fast as possible.

The existing message flow is shown in figure12: after receiving an “event”, a “notification agent” (i.e. “event publisher”) will send a “notification” to a “Collector”. The “collector” is collecting the notifications and is updating the Ceilometer “Meter” database that is storing information about the “sample” which is capured from original “event”. The “Alarm Evaluator” is periodically polling this databases then querying “Meter” database based on each alarm configuration.

Implementation plan in Ceilometer architecture

In the current Ceilometer implementation, there is no possibility to directly trigger the “Alarm Evaluator” when a new “event” was received, but the “Alarm Evaluator” will only find out that requires firing new notification to the Consumer when polling the database.

Change/feature request:

This BP proposes to add a new “event publisher for alarm”, which is bypassing several steps in Ceilometer in order to avoid the polling-based approach of the existing Alarm Evaluator that makes notification slow to users. See figure12.

After receiving an “(alarm) event” by listening on the Ceilometer message queue (“notification bus”), the new “event publisher for alarm” immediately hands a “notification” about this event to a new Ceilometer component “Notification-driven alarm evaluator” proposed in the other BP (see Section 5.6.3).

Note, the term “publisher” refers to an entity in the Ceilometer architecture (it is a “notification agent”). It offers the capability to provide notifications to other services outside of Ceilometer, but it is also used to deliver notifications to other Ceilometer components (e.g. the “Collectors”) via the Ceilometer “notification bus”.

Implementation detail

“Event publisher for alarm” is part of Ceilometer
The standard AMQP message queue is used with a new topic string.
No new interfaces have to be added to Ceilometer.
“Event publisher for Alarm” can be configured by the Administrator of Ceilometer to be used as “Notification Agent” in addition to the existing “Notifier”
Existing alarm mechanisms of Ceilometer can be used allowing users to configure how to distribute the “notifications” transformed from “events”, e.g. there is an option whether an ongoing alarm is re-issued or not (“repeat_actions”).

[‡]	https://etherpad.opnfv.org/p/doctor_bps

2.2.5.6.3. Notification-driven alarm evaluator (Ceilometer) [§]¶

Problem statement:

Change/feature request:

This BP is proposing to add an alternative “Notification-driven Alarm Evaluator” for Ceilometer that is receiving “notifications” sent by the “Event Publisher for Alarm” described in the other BP. Once this new “Notification-driven Alarm Evaluator” received “notification”, it finds the “alarm” configurations which may relate to the “notification” by querying the “alarm” database with some keys i.e. resource ID, then it will evaluate each alarm with the information in that “notification”.

After the alarm evaluation, it will perform the same way as the existing “alarm evaluator” does for firing alarm notification to the Consumer. Similar to the existing Alarm Evaluator, this new “Notification-driven Alarm Evaluator” is aggregating and correlating different alarms which are then provided northbound to the Consumer via the OpenStack “Alarm Notifier”. The user/administrator can register the alarm configuration via existing Ceilometer API [¶]. Thereby, he can configure whether to set an alarm or not and where to send the alarms to.

Implementation detail

The new “Notification-driven Alarm Evaluator” is part of Ceilometer.
Most of the existing source code of the “Alarm Evaluator” can be re-used to implement this BP
No additional application logic is needed
It will access the Ceilometer Databases just like the existing “Alarm evaluator”
Only the polling-based approach will be replaced by a listener for “notifications” provided by the “Event Publisher for Alarm” on the Ceilometer “notification bus”.
No new interfaces have to be added to Ceilometer.

[§]	https://etherpad.opnfv.org/p/doctor_bps

[¶]	https://wiki.openstack.org/wiki/Ceilometer/Alerting

2.2.5.6.4. Report host fault to update server state immediately (Nova) [#]¶

Problem statement:

Nova state change for failed or unreachable host is slow and does not reliably state host is down or not. This might cause same server instance to run twice if action taken to evacuate instance to another host.
Nova state for server(s) on failed host will not change, but remains active and running. This gives the user false information about server state.
VIM northbound interface notification of host faults towards VNFM and NFVO should be in line with OpenStack state. This fault notification is a Telco requirement defined in ETSI and will be implemented by OPNFV Doctor project.
Openstack user cannot make HA actions fast and reliably by trusting server state and host state.

Proposed change:

There needs to be a new API for Admin to state host is down. This API is used to mark services running in host down to reflect the real situation.

Example on compute node is:

When compute node is up and running::

vm_state: activeand power_state: running
nova-compute state: up status: enabled

When compute node goes down and new API is called to state host is down::

vm_state: stopped power_state: shutdown
nova-compute state: down status: enabled

Alternatives:

There is no attractive alternative to detect all different host faults than to have an external tool to detect different host faults. For this kind of tool to exist there needs to be new API in Nova to report fault. Currently there must be some kind of workarounds implemented as cannot trust or get the states from OpenStack fast enough.

[#]	https://blueprints.launchpad.net/nova/+spec/update-server-state-immediately

2.2.5. Detailed architecture and interface specification¶

2.2.5.1. Functional Blocks¶

2.2.5.1.1. Monitor¶

2.2.5.1.2. Inspector¶

2.2.5.1.3. Controller¶

2.2.5.1.4. Notifier¶

2.2.5.2. Sequence¶

2.2.5.2.1. Fault Management¶

2.2.5.2.2. NFVI Maintenance¶

2.2.5.3. Information elements¶

2.2.5.4. Detailed northbound interface specification¶

2.2.5.4.1. Fault management interface¶

2.2.5.4.1.1. SubscribeRequest (Consumer -> VIM)¶

2.2.5.4.1.2. SubscribeResponse (VIM -> Consumer)¶

2.2.5.4.1.3. FaultNotification (VIM -> Consumer)¶

2.2.5.4.1.4. FaultQueryRequest (Consumer -> VIM)¶

2.2.5.4.1.5. FaultQueryResponse (VIM -> Consumer)¶

2.2.5.4.2. NFVI maintenance¶

2.2.5.4.2.1. SubscribeRequest (Consumer -> VIM)¶

2.2.5.4.2.2. SubscribeResponse (VIM -> Consumer)¶

2.2.5.4.2.3. MaintenanceNotification (VIM -> Consumer)¶

2.2.5.4.2.4. StateChangeRequest (Administrator -> VIM)¶

2.2.5.4.2.5. StateChangeResponse (VIM -> Administrator)¶

2.2.5.4.2.6. StateQueryRequest (Administrator -> VIM)¶

2.2.5.4.2.7. StateQueryResponse (VIM -> Administrator)¶

2.2.5.4.3. NFV IFA, OPNFV Doctor and AODH alarms¶

2.2.5.5. Detailed southbound interface specification¶

2.2.5.5.1. Fault event interface¶

2.2.5.5.1.1. EventNotification¶

2.2.5.6. Blueprints¶

2.2.5.6.1. Instance State Notification (Ceilometer) [†]¶

2.2.5.6.2. Event Publisher for Alarm (Ceilometer) [‡]¶

2.2.5.6.3. Notification-driven alarm evaluator (Ceilometer) [§]¶

2.2.5.6.4. Report host fault to update server state immediately (Nova) [#]¶

2.2.5.6.5. Other related BPs¶

2.2.5.6.5.1. pacemaker-servicegroup-driver [♠]¶