Torrus Working Draft: Task launcher collector extension

Status: pending implementation. Last revised: Oct 31 2010

Introduction

The Torrus collector is very efficient in asynchronous data collection, such as SNMP querying. Sometimes it is neccessary to collect synchronous data too. For example, some data may be retrieved via HTTP protocol, or by executing some external program (such as Nagios plugin). Currently such data can be handled by additional Collector modules, but these synchronous tasks interfere with the SNMP collection schedule and may cause unpredictable delays.

This draft proposes a collector extension which would allow to collect synchronous data at large scale without imposing delays to SNMP collection.

Launcher overview

The task launcher is a separate group of processes running in parallel to Torrus collectors.

The Launcher's master process watches its worker child processes, starts new or stops unneeded workers, and controls the execution.

The worker processes pick up new tasks from the task queue. The queue is implemented as a Berkeley DB Queue database. Worker processes can lock and wait for new tasks to be added to the queue, thus minimizing the CPU load when idle.

Once the worker process obtains a task, it executes one of its modules, depending on the task type. This can be an execution of an external utility, or retrieving some external information directly from the worker process.

The worker then saves the numerical results and their time stamps in the results database.

The new collector type 'launcher' is implemented in the Collector process. The collecting jobs are executed in accordance with the standard Collector scheduler. For each token with collector-type set to "launcher", the collector first picks up the previous numerical value from the results database and schedules the RRD update. Then it initiates a new collection task by adding a new record in the task queue. The next available worker process then picks up the task and executes it as soon as possible.

The Launcher master process examines the task queue every 30 seconds and decides on the number of worker processes needed:

There is a fixed minimum and maximum number of worker processes, configured by the local administrator. The Launcher would start the minimum number of workers in the beginning, and then dynamically adapt it to the work load.

The workers only remove the task ID from the task queue, and later store the results. It is the master process which periodically cleans the task data from the database.

There is only one Launcher instance for the whole Torrus installation, serving requests for all the trees.

Data structures

Each launcher task has a unique 32-character identifier which is an MD5 sum of the task details and the time stamp.

launcher_task_ts.db is a Btree database containing timestamps for each task.

launcher_task_data.db is a Btree database containing each individual task's information. Keys are 32-character task IDs, and values are JSON objects with all the parameter values required the task execution. Thus the worker would not need a direct access to the ConfigTree databases.

launcher_queue.db is a Queue database containing 32-character task IDs of active tasks.

launcher_results.db is a Btree database containing the numerical results retrieved by workers. Keys are unique 32-character identifyers, and values are string representations of numeric results.

The Collector process first injects the new task data into the Timestamps and the Data databases, and then inserts a new record into the task queue.

The Task Data structure defines one or more unique Result IDs which the worker would store upon the task completion.

During the task execution, the worker would close all database handles, in order to prevent DB corruption in case of abnormal process abortion.

Usage

Most of the task processing modules will be published as Torrus plugins. The following plugins are planned in the near future:

VmWare server statistics

The VmWare server API allows to collect various performance statistics via an HTTP-based protocol. The retrieved data structures are quite complex, and each query takes several seconds. This is an ideal candidate for the launcher extension.

Nagios plugins

Most of standard Nagios plugins return performance data, such as execution time. This can be used, for example, to monitor a Web application performance. The launcher extension would allow to monitor hundreds or thousands of such performance metrics in an efficient manner.

Distributed Torrus collectors

It is quite typical in large corporate WANs that remote sites are reachable with limited bandwidth and relatively high delay. In this scenario, spoke Torrus servers collect data from local SNMP devices and share the collected data through an HTTP interface. Then the launcher workers on the central hub site retrieve the data for centralized consolidation and reports.


Author

Copyright (c) 2010 Stanislav Sinyagin <ssinyagin@k-open.com>