BMC TrueSight Operations Management Review

BPPM has the potential to be a market beating product however, the investment required is significant


This article is a review of BMC ProactiveNet Performance Manager (BPPM) version 8.6 and its key sub-components.

The main key sub-components include:

> ProactiveNet Analytics

> ProactiveNet Event Management (formerly Mastercell)

> ProactiveNet Performance Manager (i.e. PATROL)

Versions Reviewed

Component

Version

BPPM Event Manager

8.6

BPPM Analytics

8.6

PATROL Central

7.8.10

PATROL Central Operator – Web Edition

7.8.10

PATROL Agent

3.9.00.1i

PATROL for UNIX Servers

9.10.00.02

Key Capabilities

Event Management

BPPM Event Management (previously known as Mastercell or BEM) is the component that replaces PATROL Enterprise Manager or PEM (previously known as CommandPost).

BPPM introduces a programming language called MRL. MRL is not as flexible as PERL or REX which can both be used in PEM, but MRL does include many in-built features such as policies that make the design of rules slightly easier.

PEM used to perform event management using up to 5 transformers or scripts written in PERL. PEM was effectively a tool box whereby all the intelligence is provided by the PERL scripts which enrich the events using a number of lookup files.

Which product is better, PEM or BPPM? BPPM is arguable a better event management platform. Although MRL is frustrating to work with, the in-built capabilities mean that you don’t have to develop everything from scratch. BPPM is generally a good event management platform.

Threshold Management

PATROL Configuration Manager (PCM) is one of the best threshold management tools in the industry. The threshold management capabilities on BPPM (aka ProactiveNet) are poor in comparison. BMC state that they will include PCM functionality on the next release of BPPM.

The limitations of Threshold management in BPPM are numerous:

  • BPPM has no local thresholds that can be applied across multiple servers.
  • Local thresholds can only be defined via the GUI.
  • Local thresholds can’t be migrated from one environment to another.
  • Migration of global thresholds can be performed using a export/import utility – but it is not simple.
  • The GUI for managing thresholds is cumbersome and not intuitive.

On the plus side, the different types of thresholds in BPPM are very powerful. BPPM has Absolute, Intelligent, Signature and Predictive thresholds. These thresholds are statistically based and will generate events when a statistical anomaly is detected. The product will automatically calculate trends using linear regression and variations based upon hourly, daily or weekly patterns. However, the statistics will not eliminate threshold management as BMC have sometimes claimed. Many thresholds are Boolean in nature – either good or bad - and are therefore not approriate for statistical analysis. Statistical analysis is only appropriate for about 20% to 30% of thresholds and analysis consumes a lot for CPU cycles.

Ease of Implementation

BPPM is undeniably a complex product. Far too complex in my opinion. There are many other much simpler solutions such as HP SiteScope or CA Nimsoft which can be implemented much faster. In addition, the BMC Product Set has gradually got more and more complex over the years. The solution is really three products bundled together:

  • MasterCell which BMC purchased about 7 years ago.
  • ProactiveNet which BMC purchased about 4 years ago.
  • PATROL which BMC purchased about 20 years ago.

MasterCell is a great event management product. ProactiveNet has perhaps been oversold by BMC – and the value is overstated. The autonomous thresholds can only be applied to 20% -to 30% of parameters anyway. PATROL was originally a great product – but has become bloated and complex after years of poor product management.

As an illustration of how complex the BPPM solution has become, consider the following table:

Component / Feature

Old Solution with PEM

New BPPM Solution (version 8.6)

Number of Servers

3 (DEV, DR and PROD)

11 (3 DEV, 3 TEST, 5 PROD)

Number of Connections to the Agents

2 (PEM and RT Server)

3 (BIIP3, BPPM Adaptor, RT Server)

Number of Adaptors

1 – RT Server

3 (RT Server, BPPM Adaptor, BIIP3

Dynamic Policy Files (for Rules)

5 Rule Files

12 Rule Files

Forms for Threshold Management

1 PCM

2 (TEST and PROD BPPM Servers)

Extensibility

The PATROL agent has always been very extensible. There is a rich API and many different ways to write an interface. PATROL Central has no API and therefore can not be extended. Both BPPM and PEM are very extensible and can be extended through a variety of scripting languages such as PhP or PERL.

Blackout

BMC has never provided a web form that allows staff in the Operations Bridge to blackout servers or services for upcoming outages due to planned maintenance. This customer (mentioned in this review) had to write its own Web GUI for Blackout. This is an Apache and PhP solution that allows the shift operators to configure blackouts. It required 25 days of development to alter the blackout web form and migrate this functionality from PEM to BPPM.

Administration

Routine Daily Admin Tasks

For an environment of 500 Agents, BPPM requires from 0.5 to 1 FTE to keep the lights on - depending on the experience of the person. Typical daily tasks include the following:

  • Restarting Agents. For an environment of 500 Agents, you can expect that 1 agent will crash per day. The most common cause is probably history file corruption. History files can grow to beyond 4 GB if not managed.
  • Checking the Consoles. Most environments will end up with a hierarchy of BPPM Event cells. The Administrator needs to log into each Console to verify that events are being:
    • De-duplicated properly;
    • Propagated correctly from one cell to the next;
    • that incidents are being raised correctly - if Automtic Incident Generaion (AIG) is configured.
  • Managing Thresholds. The Administrator will get on average one request per day to change a threshold or verify that a threshold is in place. For example, an ORACLE DBA may say that there was a SEV2 incident last night related to table locking. "Could you please check that instance DW_PROD is monitorited for locking.?" It can take from 30 minutes to 2 hours to investigate each request and write an email suggesting and agreeing the new threshold. Perhaps longer if a meeting is required.
  • Managing Rules. Changes to the BPPM Rules occur about once per month and need to be performed using change control. Rule changes require a code change to the MRL and the cells will need to be bounced.
  • Commissioning and Decommission New Agents. Agent commissioning using occurs every few months and may involve up to 20 virtual hosts associated with one Physical machine. The Commissioning process is faily involved (in fact all the Admin steps are complex). See below.
  • Deploying KMs. When the support teams deploy new infrastructure software such as Websphere or ORACLE, the associated PATROL Knoweldge Module (KM) will also need to be deployed. Each deployment may take 1-3 hours and will require change control. Input will be required from the SME. For example, the ORACLE DBA may be required to type in the system password for ORACLE during the KM Configuration process.

PATROL Agent Commissioning

The Agent commissioning process for configuring monitoring for a new server consists of the steps shown below:

Step Number

Step

Description

1

Ping Host

Ping Host to very that the hostname is correct?

2

Install Agent

Install Agent Using Solaris Package

3

Update Event Rules

edit BPPM enrichment file abc_host.csv

4

Apply to PROD Cell

import abc_host.csv into PROD cell

5

Apply to TEST Cell

import abc_host.csv into TEST cell

6

Update PING Test (primary)

Update PING Test configuration on Primary Server to ensure the host is up.

7

Update PING Test (secondary)

Update PING Test configuration on Secondary Server to ensure the host is up.

8

Configure UNIX km

Use PCM to give Agent Standard Configuration for the UNIX km.

9

Update BIIP3

Update BIIP3 Config so that the Agent can talk to the Event Management Cell.

10

Agent Restart

Restart the Agent to ensure that the Agent Configuration takes affect.

11

Update PCO Web Console

Update PCO Web Console so that the Agent appears in the PATROL console.

12

Update Work request

Update the Work request to indicate the job is complete.

If additional Monitoring is required for ORACLE or WEBLOGIC or some other Application, then there are additional configuration steps that are required.

Programming Languages

There are two languages to learn with BPPM

  • MRL or Mastercell Rule Language - This is a fairly unique programming language.
  • PSL or PATROL Script Language. This language is similar to PERL. The complexity lies in the functions that need ot be learned.

Summary of Administration

Administration of BPPM is overly complex. The product has evolved over the course of the last 20 years. As another new component has been added via aquisition, the product has become increasingly complex and time consuming to administer.

Architectural Considerations

Any Solution Design for BPPM should consider the following key questions:

Question

Details

How does the design allow for rule tracing?

Using the trace log is not practical due to the volume of events. A good solution is to assign a Unique ID to each rule and then configure each rule to add an entry to a new slot called “matching_rules”.

How does the design specify rule execution order?

It is often difficult to design rules because of confusion about rule execution order. It is good practice to split all mrl files into mrl files for new rules and mrl files for refine rules. So you get: new_mcxp.mrl and refine_mcxp.mrl. The files then should be grouped in the .load file by stage, so you have refine rules followed by new rules … etc.

Does the DEV environment have the same number of cells as the TEST and OAT environments?

Don’t be tempted to have fewer cells in the DEV environment. It is tempting to have fewer cells in order to limit the number of zones (servers) required. This is a mistake. Rule execution order is greatly affected by the propagation (or not) of slots between cells and the configuration of mcell.propagate.

Does the design specify the configuration of mcell.propagate?

The design should specify the configuration of all mcell config files – including mcell.propagate, mcell.dir etc.

Is BIIP3 included in the Design?

BIIP3 is essential in order to forward PATROL events to the cells for any cells that are not event class 11 and 39. These events are explicitly generated by the PSL event_trigger() function. It is impossible for BPPM Analystics (ProactiveNet) to collect these events because they have no associated metric.

Threshold Management

If thresholds are being migrated fro PCM to BPPM, How will the thresholds be migrated from BPPM server to another? Has the export / import process been thoroughly tested? (because is has serious issues).

I would advise migrating the thresholds to BPPM as a Phase II activity or wait for BPPM v9.

Export Thresholds from PCM

Does the design specify using a tool for extracting all the thresholds from PCM into a spreadsheet? (I have a PERL tool to do this).

Testing

Does the Design provide for at least a month of end-to-end testing once the rules have been completed.

Monitoring the Monitoring

Does the Design incorporate monitoring of the monitoring? Will an event be generated if the BIIP3 Adapter fails?

Event Storm

If the BIIP3 Adaptor looses connection to multiple agents every half an hour and then regains the connection 30 seconds later this will create 200 new AGENT_DOWN events (mc_adapter_control). The de-dup rule will not work because the AGENT_UP event closes the AGENT_DOWN event. What rule is going to prevent this event storm?

Time-out Policies

Does the Design specify timeout policies for all the main top level event classes such as MC_CELL.. and EVENT. Does the cell start reasonably quickly with 2000 events? What about 20,000 events?

DDE Enrichment

Does the Design fully specify the Enrichment files that will be used?

DDE Synchronization

Are the DDE config files pulled or pushed into the cells? How are the DDE cfg files synchronized between cells?

Blackout

Has a Web site been included in the Design for Blackout by the Operations Bridge? BPPM does have a “Schedule downtime” facility – but this is entirely inappropriate for operators and does not account for BIIP3 events.

Blackout Dev

If a blackout GUI is a requirement, has a month of Development been allocated (using something like Apache and PhP)?

BPPM Analytics

Does the Design discuss the possibility of implementing BPPM Analytics as a second phase?

Reporting

Does the design include Event Reporting to drive Continuous Improvement? Key reports are total events grouped by:

  • ·Day, Week, Month
  • ·Object Class
  • ·Application
  • ·Service
  • ·Support Group

Reporting DEV

If reporting is a requirement, does the Design include time to implement the BMC reporting tool or 2 weeks of development using PhP and mquery.

AIG

Does the Design Include Automatic Incident Generation? (AIG). Semi-automatic incident generation an option – whereby an operator creates a ticket by right clicking on an event. Is this option considered and discussed in the design?

Failover

Is failover considered? How is the configuration replicated? Replicated DISK?

Training

Doe the project plan include time for Training the staff in the operations Bridge? What about 2nd level support?

Go-live

Is the Go-Live big bang or Phased? Phased is preferred for risk mitigation but will require operators to run two consoles in parallel.

Audible Alarm

Is an Audible alarm a requirement? If so, then this will require a few days of development to configure a web page that uses a sound file and “mquery –s COUNT”.

BPPM Classes

BPPM Has a number of event classes as shown below which all inherit from the CORE_EVENT class.

CORE_EVENT

  • EVENT
    • MC_CELL_EVENT
    • MC_UPDATE_EVENT
    • MC_SMC_ROOT
    • MC_MCCS
    • MC_CLIENT_BASE
      • MC_CLIENT_CONTROL
      • MC_CLIENT_ERROR
    • MC_ADAPTOR_BASE
      • MC_ADAPTER_CONTROL
      • WIN_EVENTLOG
      • LOGFILE_BASE
      • SNMP_TRAP
    • PEM_EV
    • PATROL_EV
    • PPM_EV
      • ALARM
  • MC_CELL_CONTROL
    • MC_CELL_START
    • MC_CELL_STOP
    • MC_CELL_TICK
    • MC_CELL_STATBLD_START
    • MC_CELL_STATBLD_STOP
    • MC_CELL_DB_CLEANUP
    • MC_CELL_CONNECT
    • MC_CELL_CLIENT
    • MC_CELL_DESTINATION_UNREACHABLE
    • MC_CELL_HEARTBEAT_EVT
    • MC_CELL_RESOURCES
    • MC_CELL_ACTION_RESULT
    • MC_CELL_PUBLISH_RESULT
  • IAS_EVENT
    • IAS_START
    • IAS_STOP
    • IAS_SYNCH_EVENT
    • IAS_REINIT
    • IAS_LOGIN
    • IAS_ERROR

Mastercell Rule Language (MRL)

Mastercell Rule Language (or MRL) is the language used to develop event management rules within BPPM. The administrator can develop 11 different types of rules as shown in the table in section "Rule Phases" below. The language is simple and relatively easy to learn in terms of both the syntax and the in-built functions. The most difficult concept to grasp is the execution order as explained below. One of the most common problems with the rules is to misunderstand the execution order and find that the rules are not executing in the desired sequence. The other cause of frustration is the lack of common statements such as a looping structures (do, while for until) which one takes for granted in other languages. It is possible to iterate over a list structure using the listwalk() function call. The New rule phase also has limited capability to loop over events using the Updates clause. Fortunately however, the need to loop is fairly rare. However, at times the lack of standard statements can be a cause of frustration.

The biggest problem with MRL is the slow cycling speed when debugging code. Compared to PhP or PERL, it takes at ten times as long, to stop, compile and restart. So debugging cycles are 10 times as long and productivity is similarly affected. True, it is not necessary to write pages and pages of code - but typically one will write about 8-15 pages of MRL for each project. 8 pages of PhP (tested and debugging) takes 1 to 2 days. 8 pages of MRL (tested and debugged) takes 2-4 weeks. In addition, one should allow for an additional month of End-to-End testing before production go-live to test the rules with real events - and to allow for all possible scenarios to play out and for all the bugs to emerge. This rules of thumb apply for companies of 5,000 to 10,000 employees. For larger organizations, you should allow for more time.

Execution Order

  • Rules are processing in order according to their rule phase as shown below.
  • Rules are executed in the order in which they appear in the .load file.
  • Rules are executed in the order in which they appear in the mrl file.
  • Policies are executed in order of the specified ‘execution order”.

Rule Phases

Rules are executed in the order shown below.

Execution Order

Rule Phase

Description

1

Refine

A Refine rule verifies the validity of incoming events and collects additional data for an event before it is sent through the remaining rule phases where further processing takes place.

2

Filter

Filter rules limit the number of incoming events by discarding those events that need no additional processing or analysis. Filter rules compare incoming events to the event condition formulas (ECFs) contained in the rule to determine if an event is discarded or proceeds to further processing. An incoming event is processed through each Filter rule until a Filter rule discards the event, or all Filter rules are exhausted. An event must match all the Filter rules to be accepted.

3

Regulate

Use regulate rules to handle time frequency accumulations of events or repetitive occurrences of events. An event is considered a repetition of another if the event has the same values for all the slots that are defined with the dup_detect=yes facet in the BAROC definition of its event class.

4

New

Use New rules to execute an action when a new event is received, for example increasing the severity level for an event or updating an existing event with new event data. New rules determine if an event becomes permanent and is placed in the repository.

5

Abstract

Abstract rules create high-level, or abstract, events based on low-level events. A new event starts at the new rules phase, skipping the filter and regulate rules phases. With Abstract rules, you can keep low-level events with cells in the lower-level of the cell hierarchy, abstract the data from low-level events into high-level events, and propagate them to a higher-level cell. A high-level cell in the hierarchy can consolidate abstract events from several low-level cells and prevent a large number of abstracted technical events for which no consolidating rules apply.

6

Correlate

Correlate rules build an effect-to-cause relationship between an event that occurs as a result of another event. Correlate rules execute whenever a cause or an effect event is received. The relationship between correlated events can be broken.

7

Execute

The Execute rule performs a specified action when a slot value has changed in the repository. The specified action, which is either internal to the cell or running an external executable, is based on the characteristics of one or more events.

8

Threshold

The Threshold rule counts the number of events that matches the criteria you specify if the number of these events exceeds the amount allowed within a time frame the Threshold rule executes.

An event is considered a repetition of another if the event has the same values for all the slots that are defined with the dup_detect=yes facet in the BAROC definition of its event class.

9

Propagate

A cell uses Propagate rules to forward events or messages to one or more destination cells or gateways. For example, a Propagate rule can escalate an event from a lower level cell to a higher-level cell in an environment.

10

Timer

Use Timer rules to create timed triggers to call a rule. Timer rules are evaluated when a timer expires.

11

Delete

The purpose of Delete rules is to perform actions before an event is discarded from the repository, such as a rule that suppresses data that has no meaning without an event instance. Delete rules are evaluated whenever an event is deleted from the repository or when events are deleted using the Delete flag in the mposter command.

PATROL Configuration Manager (PCM)

PATROL Configuration Manager (PCM) is a configuration tool used for PATROL agents. The tool is mainly used for configuring Thresholds and is very effective at this task.

Operation

PCM is similar in concept to the Windows registry editor. The Main Form consists of a two TreeView panes as shown below. The left TreeView is used to configure hosts which are arranged in groups such as ORACLE (shown below). The right hand TreeView is used to manage the rules which can also be arranged into groups. The RuleSets are linked to the Hosts by dragging RuleSets from right to left. The RuleSets are dragged and dropped onto the leaves marked "LinkedRuleSets". The user then invokes a command called "Apply RuleSets". The Rulesets are applied to each Agent in the same order as they appear in the hierarchy on the left. RuleSets linked to lower level nodes take precedence and "override" higher level group RuleSets.

PCM

Typical Use Case

The use of PCM typically follows a three step process. Administrators must perform the following:

  1. Select an Agent as a master and configure this Agent using the PATROL Central Operator (PCO) Console.
  2. Copy the configuration into PCM.
  3. Apply the configuration to other similar Agents using PCM.
  4. Restart the Agents in order for the configuration to take affect.

Weakness

The key weaknesses of this configuration process are the following:

  1. PCM and PCO are seperate tools. Ideally, the configuration tool (PCO) and the configuration distribution tool (PCM) should be the same product. This would eliminate step 2 above.
  2. Step 4 should not be necessary. Restarted the Agents can be easily performed using PCM - but the problem is that all active events are regenerated. This means that all agents must be blacked out for up to an hour before any restart - otherwise staff in the Operations Bridge will see hundreds of duplicate events that they have already handled over the last few hours.

Desired State Management

The key benefit of PCM is that it can be used to manage a Desired State for each Agent If you apply the configuration once or a thousand times, the result is exactly the same. The Hierarchy allows one to set global or default configuration using the higher nodes in the left TreeView an then to override the configuration with local (host specific) configuration using the lower nodes. This hierarchy works extremely well.

Policies

The Policies feature within BPPM Event Management is gnerally a well executed feature within the product and has suffcient flexibity to meet most customer's needs. The Dynamic Data Enrichment (DDE) policies allows the user to manage the rules externally using Comma Seperated Value (CSV) files.

The key thing that must be kept in mind, is that the DDE policies match based on Best Fit and not First Match. So for example, if you want to match on a hostname called "fred*" (the star is a wild card) then frederick will match before fred* even if fred* appears first in the csv file. The rules are loaded into a hash memory structure within the product. The benefit of 'Best Fit" is that the execution time for finding a match is predictable - irrespective of the number of lines in the CSV file (and there could be thousands). The disadvantage of "Best Fit" is that the matching can be out of sequence and counter-intuitive. Best Practice in this case is to keep the CSV files simple. Each Enrichment file should also have only one purpose. For example, the customer used in this review orignally started with 5 enrichment files with their old PATROL Enteprise manager (PEM) environment. After implementing BPPM, the customer ended up with 11 DDE enrichment files. The number of total lines was less, but the number of files was more.

When migrating from PEM to BPPM, the enrichment files should be "Normalized" - by minimizing the number of lookup columns in order to reduce the probability of out-of-order rule matching.

BMC Standard Policies

Policy

Description

Closure

An closure policy closes a specified event when a separate specified event is received.

Blackout Policy

A blackout policy might be used during a maintenance window or holiday period

Component Based Enrichment

enriches the definition of an event associated with a component by assigning selected component slot definitions to the event slots

Enrichment

enriches the definition of an event associated with a component by assigning selected component slot definitions to the event slots

Correlation

Correlation relates one or more cause events to an effect event, and can close the effect event The cell maintains the association between these cause-and-effect events.

Escalation

Escalation raises or lowers the priority level of an event after a specified period of time. A specified number of event recurrences can also trigger escalation of an event. For example, if the abnormally high temperature of a storage device goes unchecked for 10 minutes or if a cell receives more than five high-temperature warning events in 25 minutes, an escalation event management policy might increase the priority level of the event to critical.

Notification

Notification sends a request to an external service to notify a user or group of users of the event. A notification event management policy might notify a system administrator by means of a pager about the imminent unavailability of mission-critical piece of storage hardware.

Propagation

Propagation forwards events to other cells or to integrations to other products.

Recurrence

Recurrence combines duplicate events into one event that maintains a counter of the number of duplicates.

Remote

Remote action automatically calls a specified action rule provided the incoming event satisfies the remote execution policy’s event criteria.

Suppression

Suppression specifies which events that the receiving cell should delete. Unlike a blackout event management policy, the suppression event management policy maintains no record of the deleted event.

Threshold

Threshold specifies a minimum number of duplicate events that must occur within a specific period of time before the cell accepts the event. For events allowed to pass through to the cell, the event severity can be escalated or de-escalated a relative number of levels or set to a specific level. If the event occurrence rate falls below a specified level, the cell can take action against the event, such as changing the event to closed or acknowledged status.

Timeout

Timeout changes an event status to closed after a specified period of time elapses

Component Based

Blackout

Specifies which events the receiving cell should classify as unimportant and therefore not process . The events are logged for reporting purposes. A Component Based Blackout event management policy might specify that the cell ignore events generated from a component or device based on component selection criteria for this policy.

Typical DDE Enrichment Files

CSV File Name

Description

Lookup Columns

Data Columns

Host.csv

Assign Location and HostType (DEV, TEST or PROD) based on host name HostName Location, Physical Server, HostType

HostSuppress.csv

Filter out events based on hostname (e.g. when new Agent installed) HostName HostSuppress (YES,NO)

Application.csv

Assign an application nane to each event. ApplicationClass, Parameter Application

ObjectSuppress.csv

Filter out troublesome parameters based on Event class ApplicationClass, Parameter, EventClass ObjectSuppress (YES,NO)

ApplicationSupress.csv

Filter out events based on application Application ApplicationSuppress (YES,NO)

HostBlackout.csv

Blackout Hosts for planned outages based on timeframe HostName, PhysicalServer, Location TimeFrame

Service.csv

Assign Service Name to all events Host, Instance, HostType Service, SupportGroup

ServiceSuppress.csv

Filter Out events based on service Service ServiceSuppress (YES,NO)

ServiceBlackout.csv

Blackout services for planned outages during a particular time frame Service TimeFrame

ServiceDowngrade.csv

Downgrade severity for particular services Service SeverityCode (e.g. 12333)

TextMessage

Change message Text for certain parameters ApplicationName, Parameter, EventClass NewMesaage

Note: Severitycode of 12333 downgrades MAJOR (4) and CRITICAL (5) to MINOR (3).

Issues

PATROL Agent Restart

If the PATROL agent’s configuration is changed, then the agent usually requires a restart. Unfortunately, the PATROL Agent regenerates all active events (any parameter that exceeds a threshold) when the agent is restarted. This means that all an agent must be blacked out when the Agent is restarted.

PATROL Agent History Corruption

The Agent History file will always get corrupted if the History file exceeds 4 Gbytes. There is a 4 GB file size limit on Solaris. The history file will frequently exceed this limit on busy servers running messaging services such as Tuxedo or MQ (simply because there is a lot to monitor). The history file may get corrupted for other reasons. When the Agent gets corrupted, it will generated an event for every attempt to store a parameter value. This problem can generate hundreds of events every few minutes from just one host. This number events can easily overload a cell and a BIIP3 Adaptor (see BIIP3 Corruption below).

With 500 UNIX Agents, you should expect one agent to get corrupt history about every 2 weeks.

BIIP3 Cache File Corruption

If the BIIP3 cache file is corrupted, the BIIP3 can get stuck on one event and keep generating the event. I have seen 4 million repeated events in a cell due to this problem.

BIIP3 Cache file corruption may be caused by overloaded (see PATROL Agent History Corruption above).

I have seen this problem occur twice within 3 months.

The workaround is to clear the ache file and restart the BIIP3 Adaptor.

BIIP3 Agent Connection Drops

In certain situations, the BIIP3 Adaptor may loose connection with all the agents every half an hour. The Agent will then gain connection again almost immediately. This causes a flapping AGENT_DOWN and AGENT_UP condition that is not de-duplicated – because the AGENT_UP clears the AGENT_DOWN event. This issue can generate thousands of events and thousands of new Incidents (assuming Automatic Incident Generation is implemented).

One best workaround is to create a new rule for MC_ADAPTER_CONTROL (AGENT_DOWN) events and set them initially to severity INFO. If the Agent is truly down then the second agent down event (which occurs 3 minutes later) should be configured in the rule to set the severity back to WARNING or ALARM.

The problem is also solved by restarted the BIIP3 Adapter. I therefore suggest that all customers schedule a restart of the BIIP3 adaptors once per day. No events are lost because the BIIP3 adapter (and the PATROL Agent) caches all events.

I have seen this problem about once per month with a population of 500 agents.

BPPM Threshold Migration

The migration of both global and local thresholds from one BPPM Analystics instance to another must be performed by hand. The is an export / import mechanism for global thresholds, but as of July 2012, this mechanism is unreliable. There is no import / export mechanism for local (host specific) thresholds.

BPPM Local Instance thresholds

BPPM Analytics does not support instance specific thresholds. In other words, you can not set a default threshold for FSCapacity across all file systems and then set an instance specific threshold that applies only to the root FileSystem and htne apply this instance specific threshold to all hosts. The instance specific threshold must be individually defined on all hosts. If there re 500 hosts, this becomes unfeasible. This is no script or API that can be used to automate this task.

BPPM – Missing Hosts

With this release of BPPM, the PATROL Agents are connected to BPPM Analytics using the BPPM Adaptor. When you use the Graphing facility to graph parameters in BPPM, some of the hosts do not appear – event though they are connected via the Adapter. At the time of this writing, this case is open with BMC and is unresolved.

BPPPM does not support Custom Event Catalogues

PATROL Events that are triggered using the event_trigger() PSL function are not supported by BPPM Analytics (ProactiveNet). This forces all customers (who use PATROL agents) to implement both the BIIP3 Adapter (for event_trigger() events) and the BPPM Adapter for all standard PATROL metrics (that have an underlying parameter).

This means that the adapter layer with a BPPM implementation is quite complex. There are three Adapters attached to every agent on three separate ports. The Adapters are the RTServer, the BIIP3 Adapter, and the BPPM Adapter.

This complexity means that the implementation becomes fragile, complex to administer and fundamentally unreliable.

LOG monitoring

It is difficult to define catch-all rules using the standard BMC Log monitoring KM. For example, it is possible to create a catch-all rule that triggers on the search stirng "ALARM". You hten give htis definition a custom origin which might be something like "LOG.BANKING_app_log.alarm". You then create a custom event mesasage that inserts the line from the log file inot the text of the message. This can be done with the syntax "%1-". The problem occurs at the event management layer. All events that match this rule will get rolled up into one event as duplicates - despite the fract that each event represents a different line from the log file and a different problem.

The work-around is to change the de-duplication rules at the event managemnet layer. Be careful. if the rules are improperly defined, you can make the product vulnerable to an event storm - which may only manifest itself a month or two later.

Monitoring of the monitoring is insufficient.

Typical Project

Project Background

The review was conducted after an upgrade Project in which every component within an old PATROL environment was upgraded. The project was driven by the customers internal audit organization that review the companies products and determined that PATROL enterprise Manager (PEM) was no longer supported an therefore the whole environment should be upgraded.

Project Phases

The project consisted of a number of separate projects which could have been undertaken individually. The customer chose to performed all three projects simultaneously which increased the risk, complexity and length of the overall project.

Phase

Description

Phase 1

Solution Design

Phase 2

Upgrade of the PATROL Agents and Knowledge Modules

Phase 3

Replacement of PEM with BPPM Event Manager

Phase 4

Introduction of BPPM Analytics

Project Timescales

The Solution Design phase was conducted in late 2011 and the implementation was started immediately after the New Year in 2012. Phase 3 of the solution was finally put into production on Thursday 28th June 2012.

Phase 4 of the project has not yet been completed. Phase 4 was removed from the project scope when the customer fell behind on delivery. Currently, there are no plans to complete this phase of the project.

The customer contracted several months of consultancy from BMC Software. BMC performed the initial solution Design and much of the initial configuration of the event management rules.

Resources

The resources assigned to the project, consisted of the following:

Resource

Time Allocation

BMC Consultant

~ 3 months

Customer SME

7 Months full time

Independent Consultant

4 Months

Customer UNIX Engineers (2 Engineers)

4 Months

Customer infrastrucutre Architect

1 Month

Customer Project Manager

2 Month

Customer Deliver manager

2 Months

Management Involvement (Project Sponsor + Resource Manager)

1 Month

Total

24 Months

Lessons Learned

The project overran initial estimates – both in terms of budget and cost. The following issues were encountered:

Issue

Description

Solution Design

The Event Management Rules had to be completely redesigned which delayed the projected by about a month. The customer’s old rules used First Match – whereas BPPM only supports Best Fit. The complexity of the customer’s rules was not properly analysed or understood during the design phase.

Documentation

The design of the event management rules and were not properly documented. When it became evident that the design had to be changed, the lack of documentation slowed understanding and meant that some thinking had to be repeated and the design documented properly.

Thresholds

The customer spent over a month trying to migrate their thresholds from PATROL to BPPM. This tasks was complex due to the different format of the thresholds. The customer also experienced many issues with the migration tools which did not work properly. Managing thresholds in BPPM is not as easy as managing thresholds in PATROL (using PATROL Configuration Manager). In the end the customer abandoned the attempt to introduce BPPM analytics. The Autonomous alerts only covered 20% of the thresholds anyway, so the benefit of BPPM Analytics was not compelling.

Testing

The customer underestimated the time required for comprehensive testing. Testing should have been planned earlier, started earlier and resourced appropriately. At least a full month of end-to-end testing was required.

Technical Lead

Technical Leadership was lacking through some parts of the project. Initially, the BMC Consultant was the technical lead. Towards the end, an independent consultant was the technical lead. There were issues of continuity.

Project Phases

The project consisted of 4 project phases. Phase 2 and Phase 4 were optional and were not required in order for the custom to meet its audit deadline. In the end, Phase 4 was abandoned.

Summary and Conclusion

Component Rating (1-5 Stars)

BMC ProcativeNet Performance Manager (BPPM) is really 3 products bundles into one suite. It still makes sense to rate each component individually.

Product Summary Score 1-5
BMC BPPM v8.6 Analystics (formerly ProActiveNet) The product appears to have reasonably good quality control. The graphing is good. The threshold management features are poor - but BMC says this is being fixed in the next release. I am not convinced on the whole concept of using statistics. Statistical analysis uses a lot of CPU which makes scaleability an issue. Only about 30% of monitored metrics are appropriate for statistical analysis. BMC's claims that this product removes the need for threshold management is an exageration and 70% of thresholds will still need to be managed using absolute value (i.e. standard) thresholds. 3
BMC BPPM v8.6 Event Mgmt (formerly Mastercell) This product is one of the strongest event management products around. There are challenges with using the MRL rule language - but generally this product works well. I question BMC's bundling of this product with ProactiveNet and would like to see the product available as a stand-alone component. Develoing and debugging rules is time consuming and difficult. Only time will tell if this product continuous to be a good event management platform. 3
BMC PATROL 7.8.10 Twenty years ago, PATROL was the best monitoring solution of its type. Since then the product has become bloated and overly complex. PCM was a great addition and makes the management of thresholds realtively easy and repeatable. The product has not changed much in about 8 years. Four years ago, BMC were going to retire the product. Today PATROL is an integral part of BMC's BPPM strategy. The KMs and the breadth of monitoring saves this product from a lower rating. 3

Rating according to Capabilities (Score 1-10)

Component/Capability

Previous Version (with PEM)

Latest Version (BPPM v8.6)

Event Management

3
4

Threshold Management

5
2

Analytics / Graphs

3
5

Ease of Implementation

3
2

Extensibility / interfaces

4
4

Operator Form for Blackout

1
1

Average Score

(3.2)

(3)

Components

PATROL and associated KMs

PATROL Central Operator

PATROL Enterprise Manager (PEM)

PATROL and associated KMs

PATROL Central Operator

BPPM Event Management

BPPM Analytics (ProactiveNet)

Conclusion

The score for BPPM has not improved with this revision. The product is more complex, more difficult to implement and thresholds are more difficult to administer. The improvement in capability associated with anomaly detection is not convincing and not proven to this customer and is only relevant for 30% of parameters. BMC must work hard to improve administration and ease of implementation.

The combination of BPPM Analytics (ProactiveNet), BPPM Event Management (Mastercell) and PATROL has the potential to be a market beating product. However, the investment required is significant. Time will tell if BMC delivers on this vision.

Disclosure: I am a real user, and this review is based on my own experience and opinions.
1 Comment
ConsultantConsultant

I would like to concur with the statement "I question BMC's bundling of this product with ProactiveNet and would like to see the product available as a stand-alone component." Also, regarding MRL tracing, I have had some success using the releatively new tracewrite() function.

20 November 13
Guest
Sign Up with Email