Archive for August, 2010
Metric Collection Error
I’m in the later stages of implementing Oracle Enterprise Manager 11g for a customer. Right now there are rather too many metric collection errors for either myself or the customer to be truly happy with. There is remarkably little other than this post by Oracle’s Werner.de.Guyter on how to deal with these. Unfortunately whilst Werner’s post is definitely the place to start, in practice life isn’t always quite so simple.
As the Metric Collection error page in EM suggests Metric Collection errors occur usually when there is a misconfiguration error on the target. Strictly however Oracle Enterprise Manager metric collection errors represent a failure by a management agent to collect a defined metric correctly. This error will remain until the metric is successfully collected. In the meantime no metric information will be available for Enterprise Manager to use. It should be obvious that the way to “clear” a metric collection error is to resolve the issue causing the error and to recollect the metric correctly.
Metric Collection errors are generally down to the following reasons:
- Configuration Error as described above
- Temporary Collection Errors
- Oracle Bugs
Details about metric collection errors are available in the Enterprise Manager repository table MGMT_CURRENT_METRIC_ERRORS which reports the detailed error message and other useful information to help resolve the error.
UPDATE: As pointed out in the comments (thanks Rich) the table contains the raw data on the error. The view MGMT$METRIC_ERRORS_CURRENT is a better starting point. In fact the report I usually use is
select target_type , target_name , metric_name , coll_name , collection_timestamp , error_message from MGMT$METRIC_ERROR_CURRENT ORDER BY collection_timestamp desc;
In general configuration errors – for example missing or incorrect passwords can be resolved by selecting the “Monitoring Configuration” link at the foot of the home page for the target being monitored. In some cases the error message shown in the report will indicate a misconfiguration in the target itself – for example an empty opmn.xml file in an HTTP Server Oracle Home. In such cases the fix is to correct the product misconfiguration.
In some cases – most notably in the case of the SOA management pack - metric collections are configured to collect data whilst the target is down. In such cases the metrics will often error precisely due to the unavailability of the Target. In this case a potential fix might be to set the CollectWhenDown property of the metric in the $AGENT_HOME/sysman/admin/metadata/<target_type>.xml configuration file to false. Doing so is unsupported by Oracle and so should be considered only in cases where metric errors are obscuring the overall health status of the Oracle infrastructure that is being monitored. A supported fix is to disable the metric collection entirely.
At the moment I have come across 2 candidates for Oracle Bugs which I am working with Oracle Support
1) When collecting LDAP information about the ldap database Oracle returns 2 identical rows one for each instance. This causes the metric collection to fail with repeating key error. This has been worked around by modifying the SQL that collects the metric to reurn unique records.
2) An SSO target running on a clustered database has been discovered without a database – this leads to a number of metric collection errors. Unfortunately it is not possible to manually enter SSO Database details via the enterprise manager interface for this target type
Forcing a metric collection and upload
Once the condition has been resolved then the issue should be resolved at the next metric collection. In some cases this maybe as many as 24 hours away. To force a metric collection – and to check that the fix has been effective follow the below procedure:
1) Determine Target Name
2) Determine Target Type
3) Determine Collection Name
These can be determined by logging onto the server and running
$AGENT_HOME/bin/emctl status agent scheduler (|grep part of target name)
or from the Target Name, Target Type and Coll Name columns in the MGMT_CURRENT_METRIC_ERRORS table
The collection can be manually run by issuing
$AGENT_HOME/bin/emctl control agent runCollection target_name:target_type collection_name
$AGENT_HOME/bin/emctl upload agent
Possibly Related Posts:
A Study in Tweeting
I follow @oracledatabase on Twitter for obvious reasons. They tweeted a “case study” last week on the use of Advanced Compression to save money. You can find the case study here The end customer migrated from MSSQL to Oracle for a low terabytes size datawarehouse. Unfortunately we don’t get details of the old hardware or setup, but we do discover that the new hardware consists of a 16 processor AIX system and that 1.5tb of the available 2.75tb disk space is used. (and that a 2:1 compression ratio is achieved so we get a current saving in disk space of 1.5tb approximately).
The tweet chooses to major on “Customer migrates from #ms_sql_server and gains cost savings with #Oracle Advanced Compression.” Cost savings are indeed mentioned in the white paper, although it is difficult to see that a 2:1 compression ratio is likely to significantly outperform NTFS compression which can of course be used transparently with the old technology, and in fact there are strong indications that the driver was strategic rather than cost sensitive.
I don’t particularly have any beef with the case study, though it isn’t the strongest case study I’ve ever seen. I do have a beef with the cost savings argument. The Advanced Compression option costs $156,000 for 16 processors for the first year which works out at approximately $100k per terabyte saved. That sort of sum of money will buy you an extraordinary amount of storage. In addition you’ll be paying $34k per year each year to offset against the reduced storage administration time needed each year. I’d suggest that if you are spending $34k per year on storage management time for a 1.5tb database then you’ve got something badly wrong.