I’m in the later stages of implementing Oracle Enterprise Manager 11g for a customer. Right now there are rather too many metric collection errors for either myself or the customer to be truly happy with. There is remarkably little other than this post by Oracle’s Werner.de.Guyter on how to deal with these. Unfortunately whilst Werner’s post is definitely the place to start, in practice life isn’t always quite so simple.
As the Metric Collection error page in EM suggests Metric Collection errors occur usually when there is a misconfiguration error on the target. Strictly however Oracle Enterprise Manager metric collection errors represent a failure by a management agent to collect a defined metric correctly. This error will remain until the metric is successfully collected. In the meantime no metric information will be available for Enterprise Manager to use. It should be obvious that the way to “clear” a metric collection error is to resolve the issue causing the error and to recollect the metric correctly.
Metric Collection errors are generally down to the following reasons:
- Configuration Error as described above
- Temporary Collection Errors
- Oracle Bugs
Details about metric collection errors are available in the Enterprise Manager repository table MGMT_CURRENT_METRIC_ERRORS which reports the detailed error message and other useful information to help resolve the error.
UPDATE: As pointed out in the comments (thanks Rich) the table contains the raw data on the error. The view MGMT$METRIC_ERRORS_CURRENT is a better starting point. In fact the report I usually use is
select target_type , target_name , metric_name , coll_name , collection_timestamp , error_message from MGMT$METRIC_ERROR_CURRENT ORDER BY collection_timestamp desc;
In general configuration errors – for example missing or incorrect passwords can be resolved by selecting the “Monitoring Configuration” link at the foot of the home page for the target being monitored. In some cases the error message shown in the report will indicate a misconfiguration in the target itself – for example an empty opmn.xml file in an HTTP Server Oracle Home. In such cases the fix is to correct the product misconfiguration.
In some cases – most notably in the case of the SOA management pack - metric collections are configured to collect data whilst the target is down. In such cases the metrics will often error precisely due to the unavailability of the Target. In this case a potential fix might be to set the CollectWhenDown property of the metric in the $AGENT_HOME/sysman/admin/metadata/<target_type>.xml configuration file to false. Doing so is unsupported by Oracle and so should be considered only in cases where metric errors are obscuring the overall health status of the Oracle infrastructure that is being monitored. A supported fix is to disable the metric collection entirely.
At the moment I have come across 2 candidates for Oracle Bugs which I am working with Oracle Support
1) When collecting LDAP information about the ldap database Oracle returns 2 identical rows one for each instance. This causes the metric collection to fail with repeating key error. This has been worked around by modifying the SQL that collects the metric to reurn unique records.
2) An SSO target running on a clustered database has been discovered without a database – this leads to a number of metric collection errors. Unfortunately it is not possible to manually enter SSO Database details via the enterprise manager interface for this target type
Forcing a metric collection and upload
Once the condition has been resolved then the issue should be resolved at the next metric collection. In some cases this maybe as many as 24 hours away. To force a metric collection – and to check that the fix has been effective follow the below procedure:
1) Determine Target Name
2) Determine Target Type
3) Determine Collection Name
These can be determined by logging onto the server and running
$AGENT_HOME/bin/emctl status agent scheduler (|grep part of target name)
or from the Target Name, Target Type and Coll Name columns in the MGMT_CURRENT_METRIC_ERRORS table
The collection can be manually run by issuing
$AGENT_HOME/bin/emctl control agent runCollection target_name:target_type collection_name
$AGENT_HOME/bin/emctl upload agent