2.2.4. Gap analysis in upstream projects
This section presents the findings of gaps on existing VIM platforms. The focus
was to identify gaps based on the features and requirements specified in Section
3.3. The analysis work determined gaps that are presented here.
2.2.4.1. VIM Northbound Interface
2.2.4.1.2. Maintenance Notification
- Type: ‘missing’
- Description
- To-be
- VIM has to notify unavailability of virtual resource triggered by NFVI
maintenance to VIM user.
- Also, the following conditions/requirements have to be met:
- VIM should accept maintenance message from administrator and mark target
physical resource “in maintenance”.
- Only the owner of virtual resource hosted by target physical resource
can receive the notification that can trigger some process for
applications which are running on the virtual resource (e.g. cut off
VM).
- As-is
- OpenStack: None
- AWS (just for study)
- Gap
- VIM user cannot receive maintenance notifications.
- Solved by
2.2.4.2. VIM Southbound interface
2.2.4.2.1. Normalization of data collection models
- Type: ‘missing’
- Description
- To-be
- A normalized data format needs to be created to cope with the many data
models from different monitoring solutions.
- As-is
- Data can be collected from many places (e.g. Zabbix, Nagios, Cacti,
Zenoss). Although each solution establishes its own data models, no common
data abstraction models exist in OpenStack.
- Gap
- Normalized data format does not exist.
- Solved by
2.2.4.3. OpenStack
2.2.4.3.1. Ceilometer
OpenStack offers a telemetry service, Ceilometer, for collecting measurements of
the utilization of physical and virtual resources [CEIL]. Ceilometer can
collect a number of metrics across multiple OpenStack components and watch for
variations and trigger alarms based upon the collected data.
2.2.4.3.1.1. Scalability of fault aggregation
- Type: ‘scalability issue’
- Description
- To-be
- Be able to scale to a large deployment, where thousands of monitoring
events per second need to be analyzed.
- As-is
- Performance issue when scaling to medium-sized deployments.
- Gap
- Ceilometer seems to be unsuitable for monitoring medium and large scale
NFVI deployments.
- Solved by
2.2.4.3.1.2. Monitoring of hardware and software
- Type: ‘missing (lack of functionality)’
- Description
- To-be
- OpenStack (as VIM) should monitor various hardware and software in NFVI to
handle faults on them by Ceilometer.
- OpenStack may have monitoring functionality in itself and can be
integrated with third party monitoring tools.
- OpenStack need to be able to detect the faults listed in the Annex.
- As-is
- For each deployment of OpenStack, an operator has responsibility to
configure monitoring tools with relevant scripts or plugins in order to
monitor hardware and software.
- OpenStack Ceilometer does not monitor hardware and software to capture
faults.
- Gap
- Ceilometer is not able to detect and handle all faults listed in the Annex.
- Solved by
2.2.4.3.2. Nova
OpenStack Nova [NOVA] is a mature and widely known and used component in
OpenStack cloud deployments. It is the main part of an
“infrastructure-as-a-service” system providing a cloud computing fabric
controller, supporting a wide diversity of virtualization and container
technologies.
Nova has proven throughout these past years to be highly available and
fault-tolerant. Featuring its own API, it also provides a compatibility API with
Amazon EC2 APIs.
2.2.4.3.2.1. Correct states when compute host is down
- Type: ‘missing (lack of functionality)’
- Description
- To-be
- The API shall support to change VM power state in case host has failed.
- The API shall support to change nova-compute state.
- There could be single API to change different VM states for all VMs
belonging to a specific host.
- Support external systems that are monitoring the infrastructure and resources
that are able to call the API fast and reliable.
- Resource states are reliable such that correlation actions can be fast and automated.
- User shall be able to read states from OpenStack and trust they are correct.
- As-is
- When a VM goes down due to a host HW, host OS or hypervisor failure,
nothing happens in OpenStack. The VMs of a crashed host/hypervisor are
reported to be live and OK through the OpenStack API.
- nova-compute state might change too slowly or the state is not reliable
if expecting also VMs to be down. This leads to ability to schedule VMs
to a failed host and slowness blocks evacuation.
- Gap
- OpenStack does not change its states fast and reliably enough.
- The API does not support to have an external system to change states and to
trust the states are reliable (external system has fenced failed host).
- User cannot read all the states from OpenStack nor trust they are right.
- Solved by
2.2.4.3.2.2. Evacuate VMs in Maintenance mode
- Type: ‘missing’
- Description
- To-be
- When maintenance mode for a compute host is set, trigger VM evacuation to
available compute nodes before bringing the host down for maintenance.
- As-is
- If setting a compute node to a maintenance mode, OpenStack only schedules
evacuation of all VMs to available compute nodes if in-maintenance compute
node runs the XenAPI and VMware ESX hypervisors. Other hypervisors (e.g.
KVM) are not supported and, hence, guest VMs will likely stop running due
to maintenance actions administrator may perform (e.g. hardware upgrades,
OS updates).
- Gap
- Nova libvirt hypervisor driver does not implement automatic guest VMs
evacuation when compute nodes are set to maintenance mode (
$ nova
host-update --maintenance enable <hostname>
).
2.2.4.3.3. Monasca
Monasca is an open-source monitoring-as-a-service (MONaaS) solution that
integrates with OpenStack. Even though it is still in its early days, it is the
interest of the community that the platform be multi-tenant, highly scalable,
performant and fault-tolerant. It provides a streaming alarm engine, a
notification engine, and a northbound REST API users can use to interact with
Monasca. Hundreds of thousands of metrics per second can be processed
[MONA].
2.2.4.3.3.1. Anomaly detection
- Type: ‘missing (lack of functionality)’
- Description
- To-be
- Detect the failure and perform a root cause analysis to filter out other
alarms that may be triggered due to their cascading relation.
- As-is
- A mechanism to detect root causes of failures is not available.
- Gap
- Certain failures can trigger many alarms due to their dependency on the
underlying root cause of failure. Knowing the root cause can help filter
out unnecessary and overwhelming alarms.
- Status
- Monasca as of now lacks this feature, although the community is aware and
working toward supporting it.
2.2.4.3.3.2. Sensor monitoring
- Type: ‘missing (lack of functionality)’
- Description
- To-be
- It should support monitoring sensor data retrieval, for instance, from
IPMI.
- As-is
- Monasca does not monitor sensor data
- Gap
- Sensor monitoring is very important. It provides operators status
on the state of the physical infrastructure (e.g. temperature, fans).
- Addressed by
- Monasca can be configured to use third-party monitoring solutions (e.g.
Nagios, Cacti) for retrieving additional data.