RRFW Working Draft: Monitor escalation levels

Status: pending implementation. Date: Nov 5 2003. Last revised: Nov 10 2003

Introduction

The initial idea comes from Francois Mikus in Cricket development team. His proposal was to raise the alarm only after several true consecutive monitor conditions.

The idea has developed into the concept of escalation levels.

Monitor events

Current implementation supports four types of monitor events: set, repeat, clear, and forget. New event type will be escalate(X). X designates a symbolic name for a certain escalation level. Each level is associated with the escalation time interval.

Given Te as the escalation interval, Ta as the monitor condition age, and P as period, the escalation event will occur simultaneously with one of repeat events, when the following condition is true:

  Te >= Ta

New event types clear(X) and forget(X) will occur at the same time as clear and forget respectively, for each escalated level.

Monitor parameters

New parameter will be introduced: escalation. Value will be a comma-separated list of name=interval parts, where name designates the escalation level, and interval specifies the escalation interval in seconds.

Example:

  <monitor name="rate-limits">
    <param name="escalation value="Medium=1800, High=7200, Critical=14400" />
    ...
  </monitor>

Another example would be Cisco TAC style priorities: P3, P2, P1.

Action parameters

launch-when parameter will be valid not for exec actions only, but also for tset actions. New valid values will be escalate(X), clear(X), and forget(X).

XML configuration validator will not verify if escalation levels in action definition match those in datasource configuration.

New optional action parameter: allowed-time. Contains an RPN expression which must be true at the time when the action is allowed to execute. Two new RPN functions may be used here: TOD and DOW.

TOD returns the current time of day as integer: HH*100+MM. For example, 830 means 8:30 AM, and 1945 means 7:45 PM.

DOW returns the current day of the week as integer between and including 0 and 6, with 0 corresponding to Sunday, 1 to Monday, and 6 to Saturday.

In this example, the action is allowed between 8 AM and 6 PM from Monday to Friday:

  <param name="allowed-time">
    TOD,800,GE, TOD,1800,LE, AND,
    DOW,1,GE, AND,
    DOW,5,LE, AND
  </param>

Implementation

monitor_alarms.db database format will change: The values will consist of five colon-separated fields. The first four fields will be as earilier, and the fifth one will be a comma-separated list of escalation level names that have already fired.

The implementation of this feature is preferred after the planned redesign of the monitor daemon. The new monitor design would support individual schedule for each datasource leaf, analogous to collector schedules.

In turn, the monitor daemon redesign is better to do after the collector daemon redesign. Then it would allow to keep similar design and architecture where possible.


Author

Copyright (c) 2003 Stanislav Sinyagin <ssinyagin@yahoo.com>