Distinctly Odd
I recently got involved with a performance investigation for an Oracle 9.2 database. The process of investigation threw up some interesting information for me regarding the accuracy of statistics collection in Oracle. It also highlights how different defaults in different versions of Oracle can lead to remarkably different statistics and hence execution plans. Finally, it might give you pause when trying to determine what value to use for estimate_percent when gathering object statistics.
Problem
The users of the database in question had observed significantly higher than usual CPU usage following some data maintenance. This was causing significant slowdown for the end users. As this is a 9.2 database the performance tools available to us were statspack reports, the statspack schema and of course the dynamic performance views and our monitoring scripts. Had this been a 10.1 or higher database then the AWR data would enable us to drill directly into the high users of CPU resource to pinpoint the problem SQL. In 9.2 the statspack report does not explicitly list SQL by CPU usage. In many cases, including this one however, the section of the report headed SQL Ordered by Gets is a somewhat useful proxy for high CPU usage statements, especially if the load on the database is generated by conventional SQL (select, insert etc) and not by, for example, calls to Java Stored procedures and so on. This is because a logical i/o is a CPU intensive operation. Sure enough the statspack reports for the period after the maintenance listed some statements in this section that had not previously appeared. One in particular was of the form
Select <column_list> From parent, child Where parent.pk = child.parent_pk And parent.key_column in (:b1,:b2,:b3,:b4);
The bind variables in this case occurred because the parameter CURSOR_SHARING has been set to the value of ‘SIMILAR’. In production this query had the following explain plan
------------------------------------------------------------------------------------------------ | Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | ------------------------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | 476K| 48M| | 108K (1)| 00:21:40 | |* 1 | HASH JOIN | | 476K| 48M| 4328K| 108K (1)| 00:21:40 | |* 2 | TABLE ACCESS FULL| PARENT | 52750| 3708K| | 6258 (1)| 00:01:16 | | 3 | TABLE ACCESS FULL| CHILD | 13M| 443M| | 72218 (1)| 00:14:27 | ------------------------------------------------------------------------------------------------
Whereas in development, which had not had the data maintenance carried out the plan looked like this
--------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 287 | 33292 | 100 (0)| 00:00:02 | | 1 | NESTED LOOPS | | | | | | | 2 | NESTED LOOPS | | 287 | 33292 | 100 (0)| 00:00:02 | | 3 | INLIST ITERATOR | | | | | | | 4 | TABLE ACCESS BY INDEX ROWID| PARENT | 32 | 2592 | 11 (0)| 00:00:01 | |* 5 | INDEX RANGE SCAN | IDX_PARENT_KEY | 32 | | 6 (0)| 00:00:01 | |* 6 | INDEX RANGE SCAN | IDX_CHILD_FK | 9 | | 2 (0)| 00:00:01 | | 7 | TABLE ACCESS BY INDEX ROWID | CHILD | 9 | 315 | 3 (0)| 00:00:01 | ---------------------------------------------------------------------------------------------------------------
Clearly the statistics for the two environments are considerably different, not in particular that the production database expects 52,000 rows to match the in list whereas the development database only expects to retrieve 32. This was even more suprising given that the nature of the maintenance was that an archival process had taken place in production leaving it with 1,868,541 rows, just 87% the size of the development table. With 2,138,954 rows
Object Statistics
Oracle describes object statistics like this (I’ve omitted the description of system statistics for this article)
Optimizer statistics describe details about the database and the objects in the database. The query optimizer uses these statistics to choose the best execution plan for each SQL statement.
Optimizer statistics include the following:
- Table statistics
- Number of rows
- Number of blocks
- Average row length
- Column statistics
- Number of distinct values (NDV) in column
- Number of nulls in column
- Data distribution (histogram)
- Extended statistics
- Index statistics
- Number of leaf blocks
- Levels
- Clustering factor
This article is only concerned about the table and column statistics, since it is these that drive the estimates of cardinality in our plans above.
Esimate_percent
There’s a lot of information around in the Oracle community about statistics gathering – most of it based on experience and un-evidenced. So for example at the otn forums here we can read that you never need 100% stats gathering and that 20% is a good number. The oracle applications product suggests (Note 156968.1 – MOS login required) that both sample size and collection frequency should be based on the number of rows in the table as shown below.
| NUM_ROWS | Sample Size | Frequency (weeks) |
| NULL | 30 | Once |
| 10^6 | 100 | 3 |
| 10^7 | 30 | 4 |
| 10^8 | 10 | 5 |
| 10^9 | 3 | 6 |
| >10^9 | 1 | 7 |
This is somewhat more sophisticated but somewhat begs the question “How many Oracle Applications tables have more than a billion rows, and why aren’t they partitioned anyway?” What is commonly agreed is that your statistics don’t have to be up to date, but they do have to represent the data accurately. Most commonly the degree to which the statistics are considered representative focuses upon the NUM_ROWS stat. That is people will often remark along the lines of Dave Ensor of Oracle :
I don’t know much about databases, but I do think I know the following: small tables tend to stay small, large tables tend to stay large, unique indexes have a tendency to remain unique, non-unique indexes tend to stay non-unique
In other words if the CBO knows the size of the table (and the upper and lower bounds of values in it) then the stats are “representative”.
Cardinality Estimates
When it comes to estimating the cardinality of a filter with a single bind variable the formula is
Card = ROWCOUNT/NDV
Where NDV is the number of distinct values for the column in the table. That is the optimizer assume the values are evenly distributed. For an inlist this is multiplied up by the number of members in the inlist. In our case the statement that was performing poorly had 4 members in its in list.
I quickly determined that the rowcount figures for the tables in question were in good agreement with the actual number of rows in the table. That is, by common definition the statistics were representative. This meant that the difference in must lie in the CBOs estimate of the number of distinct values in the column. (there being no histograms here)
Sure enough running a 10053 trace showed that in production Oracle was using a value for NDV for my key column of 145, for development the equivalent figure was 195292. Upon doing some further investigation with the client we found that, since they had purged a significant proportion of the table they had collected new stats using their existing statistics collection method. This it turned out was to run the venerable command
ANALYZE TABLE PARENT ESTIMATE STATISTICS
This had not however resolved the issue and a colleague had then run:
DBMS_STATS.GATHER_TABLE_STATS(owner,table,estimate_percent => dbms_stats.auto_sample_size);
In other words he had followed the recommended statistics gathering policy from Oracle. I proceeded to run the command without the estimate_percent parameter (in 9.2 this forces a compute, in 10.1 and higher the default for this parameter changes). Once a compute had been done the NDV was correctly calculated and the plan returned to normal.
Accuracy of NDV calculations
To further investigate this issue I obtained an export dump of the data. I then ran various common statistics gathering commands (plus the venerable analyse) against the data. In common with Maria Colgan at the Oracle Optimizer group I define the accuracy of NDV as
1 – (|estimated_ndv – actual_ndv|)/actual_ndv
Though as we will see this accuracy measure doesn’t range as stated there from (0-1) but from -? to 1. I have not listed the column names as this is data from a real production database but the data below was gathered using the script shown below (changing the estimate_percent value as shown by the tables below).
drop table col_stats; create table col_stats( col_id number , column_name varchar2(30) , actual_ndv number , dbms_stats_ndv number , analyze_ndv number); insert into col_stats(col_id,column_name) select column_id,column_name from user_tab_columns where table_name='PARENT'; /* analyze */ analyze table PARENT ESTIMATE STATISTICS; update col_stats s set analyze_ndv = (select num_distinct from user_tab_columns c where c.column_id = s.col_id and c.column_name = s.column_name and c.table_name = 'PARENT'); exec dbms_stats.gather_table_stats(user,'PARENT',estimate_percent => DBMS_STATS.AUTO_SAMPLE_SIZE); update col_stats s set dbms_stats_ndv = (select num_distinct from user_tab_columns c where c.column_id = s.col_id and c.column_name = s.column_name and c.table_name = 'PARENT'); declare l_cnt number; begin for c in (select column_id,column_name from user_tab_columns where table_name = 'PARENT') loop execute immediate 'select count(distinct '||c.column_name||') from PARENT' into l_cnt; update col_stats s set actual_ndv = l_cnt where s.col_id = c.column_id and s.column_name = c.column_name; end loop; end; / commit; select col_id , column_name , actual_ndv , dbms_stats_ndv , analyze_ndv , round(100*(1 - (ABS(dbms_stats_ndv-actual_ndv)/actual_ndv)),2) dbms_stats_accuracy , round(100*(1 - (ABS(analyze_ndv-actual_ndv)/actual_ndv)),2) analyze_accuracy from col_stats where actual_ndv != 0 order by col_id;
Column 16 doesn’t appear since it is entirely null so NDV = 0.
On our 9.2 system the results are:
| COL# | ACTUAL
NDV |
ANALYZE ACCURACY |
AUTO ACCURACY |
10% ACCURACY | 25% ACCURACY |
| 1 | 1868541 | 99.97 | 98.37 | 99.79 | 100.00 |
| 2 | 2 | 100.00 | 100.00 | 100.00 | 100.00 |
| 3 | 238878 | 0.06 | 0.36 | 10.23 | 25.28 |
| 4 | 26900 | 27.65 | 28.72 | 50.47 | 61.03 |
| 5 | 3 | 100.00 | 100.00 | 100.00 | 100.00 |
| 6 | 473637 | 0.09 | 0.41 | 10.27 | 25.32 |
| 7 | 300746 | 0.10 | 0.42 | 10.15 | 25.07 |
| 8 | 26900 | 27.65 | 28.72 | 50.47 | 61.03 |
| 9 | 452 | 8.41 | 24.78 | 81.19 | 94.03 |
| 10 | 1 | 100.00 | 100.00 | 100.00 | 100.00 |
| 11 | 73 | 43.84 | 52.05 | 78.08 | 84.93 |
| 12 | 52511 | 9.94 | 46.64 | 32.34 | 46.66 |
| 13 | 1051205 | 9.08 | 100.00 | 48.64 | 63.67 |
| 14 | 2 | 100.00 | 100.00 | 100.00 | 100.00 |
| 15 | 789 | 0.63 | 13.31 | 27.63 | 45.63 |
| 17 | 1 | 100.00 | 100.00 | 100.00 | 100.00 |
| 18 | 1 | 100.00 | 100.00 | 100.00 | 100.00 |
| 19 | 406 | 19.95 | 33.00 | 68.97 | 83.25 |
We can see that for several of the columns (3,6,7) the NDV calculation is only as accurate as the sample size. For others a relatively high samplesize (more than the 20% commonly quoted) is required to get reasonable accuracy.
I then repeated the test in 10.2
| COL# | ACTUAL
NDV |
ANALYZE
ACCURACY |
AUTO
ACCURACY |
10% Accuracy | 25% Accuracy |
| 1 | 1868541 | 99.87 | 100.00 | 100.00 | 99.82 |
| 2 | 2 | 100.00 | 100.00 | 100.00 | 100.00 |
| 3 | 238878 | 0.06 | 0.39 | 10.17 | 25.44 |
| 4 | 26900 | 31.11 | 29.19 | 50.43 | 60.87 |
| 5 | 3 | 100.00 | 100.00 | 100.00 | 100.00 |
| 6 | 473637 | 0.10 | 0.43 | 10.22 | 25.48 |
| 7 | 300746 | 0.10 | 0.44 | 10.13 | 25.19 |
| 8 | 26900 | 31.11 | 29.19 | 50.43 | 60.87 |
| 9 | 452 | 7.96 | 25.22 | 81.64 | 92.26 |
| 10 | 1 | 100.00 | 100.00 | 100.00 | 100.00 |
| 11 | 73 | 41.10 | 50.68 | 75.34 | 82.19 |
| 12 | 52511 | -135.58 | 46.71 | 32.19 | 46.76 |
| 13 | 1051205 | 25.59 | 100.00 | 48.62 | 63.68 |
| 14 | 2 | 100.00 | 100.00 | 100.00 | 100.00 |
| 15 | 789 | 0.76 | 14.58 | 24.46 | 48.67 |
| 17 | 1 | 100.00 | 100.00 | 100.00 | 100.00 |
| 18 | 1 | 100.00 | 100.00 | 100.00 | 100.00 |
| 19 | 406 | 19.21 | 33.25 | 69.46 | 86.45 |
There are 2 points to note here. First is that the DBMS_STATS accuracy is no better in 10.2.0.4 than it was in 9.2 for the columns that were already problematic. The second is that odd negative value for accuracy for column 12 when ANALYZE … ESTIMATE was run. Here what happened was that the ANALYZE command for some reason estimates over twice as many distinct values for this column as there actually are.
Finally I ran the same test on the same data in 11.2.0.2
| COL# | ACTUAL
NDV |
ANALYZE
ACCURACY |
AUTO
ACCURACY |
10% Accuracy | 25% Accuracy |
| 1 | 1868541 | 97.66 | 100.00 | 99.29 | 99.86 |
| 2 | 2 | 100.00 | 100.00 | 100.00 | 100.00 |
| 3 | 238878 | 0.06 | 98.42 | 10.27 | 25.33 |
| 4 | 26900 | 27.80 | 99.70 | 50.26 | 60.93 |
| 5 | 3 | 100.00 | 100.00 | 100.00 | 100.00 |
| 6 | 473637 | 0.09 | 99.95 | 10.37 | 25.37 |
| 7 | 300746 | 0.10 | 98.39 | 10.25 | 25.09 |
| 8 | 26900 | 27.80 | 99.70 | 50.26 | 60.93 |
| 9 | 452 | 9.07 | 100.00 | 82.74 | 91.37 |
| 10 | 1 | 100.00 | 100.00 | 100.00 | 100.00 |
| 11 | 73 | 34.25 | 100.00 | 79.45 | 83.56 |
| 12 | 52502 | -150.62 | 99.80 | 32.47 | 46.66 |
| 13 | 1050971 | 22.68 | 99.18 | 49.18 | 63.65 |
| 14 | 2 | 100.00 | 100.00 | 100.00 | 100.00 |
| 15 | 789 | 0.89 | 100.00 | 24.33 | 45.88 |
| 17 | 1 | 100.00 | 100.00 | 100.00 | 100.00 |
| 18 | 1 | 100.00 | 100.00 | 100.00 | 100.00 |
| 19 | 406 | 19.70 | 100.00 | 72.17 | 83.00 |
As described in the earlier link to the optimizer team’s blog 11.2 does a very, very good job of getting NDV correct. This is likely to have beneficial effects for a significant number of queries in the wild – and may go some way to reducing the need for histograms (since these historically have a side effect of making the NDV calculation much more accurate as well).
References:
http://www.dbspecialists.com/files/presentations/dbms_stats.html
This paper by Terry Sutton covers this issue for Oracle 9.2 and the then new 10.1.
This Optimizer Team paper covers 11g.
This series by Alberto D’ellera covers cardinality estimates in the presence of histograms.