Archive for May, 2011
Monotonically Increasing Indexes
Richard Foote has put up another excellent rebuttal of the myth that you should rebuild indexes based on the value of del_lf_rows/lf_rows from an appropriately valideated index structure. Greg asked a question similar to my comment on the OTN forums in this thread. Namely
Is Oracle able to reuse index space when indexed column is like current date (monolitically increasing). So if we delete some old rows (past dates) and inserted new rows are with higer (more recent) dates . Does Oracle reuse space from deleted rows ?
The short answer to Greg is that the Radiohead example Richard uses shows that index space is reused for montonically increasing indexes where the old data is completely deleted. What happens is that the block gets put back on the freelist and is eligible for reuse as a new leaf block. This pattern of insert, process, delete old is pretty common. Sometimes however not *all* the old data gets deleted. In this case Oracle won’t re-use the free space. I have modified Richard’s example so that I delete all the non-prime values below a threshold. Due to the pattern of distribution of the primes this should leave a relatively large number of almost empty index blocks. In this case we see that the old space is not reused until the laggard’s have been finally deleted. NB if you are tempted to use my prime number test, feel free but be aware it degrades as the value of n grows. In other words I only warrant accuracy for positive numbers, not performance. In my case a significant percentage of the available space is left held by these ‘old’ entries. It’s worth remembering however that
- the ratio can’t tell you if you have a pattern that matches this usage and
- in most cases you’d want to investigate why these values weren’t being deleted anyway.
NIALL @ NIALL1 >@index_deletes NIALL @ NIALL1 >drop table rush purge; Table dropped. Elapsed: 00:00:03.83 NIALL @ NIALL1 > NIALL @ NIALL1 >create table rush( 2 id number 3 , code number 4 , name varchar2(30) 5 ); Table created. Elapsed: 00:00:00.02 NIALL @ NIALL1 > NIALL @ NIALL1 >insert into rush 2 select rownum,mod(rownum,100),'MOVING PICTURES' 3 FROM dual 4 connect by level <= 1000000; 1000000 rows created. Elapsed: 00:00:03.18 NIALL @ NIALL1 >commit; Commit complete. Elapsed: 00:00:00.03 NIALL @ NIALL1 > NIALL @ NIALL1 >create unique index pk_rush on rush(id); Index created. Elapsed: 00:00:05.64 NIALL @ NIALL1 > NIALL @ NIALL1 >alter table rush 2 add constraint pk_rush 3 primary key (id) using index; Table altered. Elapsed: 00:00:00.14 NIALL @ NIALL1 > NIALL @ NIALL1 > NIALL @ NIALL1 >create or replace function is_prime(p in number) return number 2 is 3 l_limit number; 4 i number := 3; 5 retval number := 1; -- returns 0 if false, 1 if true 6 begin 7 l_limit := sqrt(p); 8 if p = 1 then 9 retval:=1; 10 end if; 11 if p = 2 then 12 retval :=1; 13 end if; 14 15 if mod(p,2) = 0 then 16 retval := 0; 17 end if; 18 19 while (i <= l_limit) loop 20 if mod(p,i) = 0 then 21 retval := 0; 22 end if; 23 i := i + 2; -- (no need to test even numbers) 24 end loop; 25 return retval; 26 end; 27 / Function created. Elapsed: 00:00:00.00 NIALL @ NIALL1 > NIALL @ NIALL1 >show errors No errors. NIALL @ NIALL1 > NIALL @ NIALL1 >delete from rush 2 where id < 400000 3 and is_prime(id) = 0; 366139 rows deleted. Elapsed: 00:01:15.54 NIALL @ NIALL1 > NIALL @ NIALL1 >commit; Commit complete. Elapsed: 00:00:00.01 NIALL @ NIALL1 > NIALL @ NIALL1 >analyze index pk_rush validate structure; Index analyzed. Elapsed: 00:00:00.97 NIALL @ NIALL1 > NIALL @ NIALL1 >select lf_rows,del_lf_rows,round(del_lf_rows/lf_rows,2) del_pct 2 from index_stats; LF_ROWS DEL_LF_ROWS DEL_PCT ---------- ----------- ---------- 1000000 366139 .37 1 row selected. Elapsed: 00:00:00.03 NIALL @ NIALL1 > NIALL @ NIALL1 >begin 2 for i in 1000001..1400000 loop 3 insert into rush 4 values(i,mod(i,100),'GRACE UNDER PRESSURE'); 5 end loop; 6 commit; 7 end; 8 / PL/SQL procedure successfully completed. Elapsed: 00:00:47.54 NIALL @ NIALL1 > NIALL @ NIALL1 >analyze index pk_rush validate structure; Index analyzed. Elapsed: 00:00:00.49 NIALL @ NIALL1 > NIALL @ NIALL1 >select lf_rows,del_lf_rows,round(del_lf_rows/lf_rows,2) del_pct 2 from index_stats; LF_ROWS DEL_LF_ROWS DEL_PCT ---------- ----------- ---------- 1400000 366139 .26 1 row selected. Elapsed: 00:00:00.02 NIALL @ NIALL1 > NIALL @ NIALL1 > NIALL @ NIALL1 > NIALL @ NIALL1 > NIALL @ NIALL1 >begin 2 for i in 1400001..2000000 loop 3 insert into rush 4 values(i,mod(i,100),'2112'); 5 end loop; 6 commit; 7 end; 8 / PL/SQL procedure successfully completed. Elapsed: 00:01:08.89 NIALL @ NIALL1 > NIALL @ NIALL1 >analyze index pk_rush validate structure; Index analyzed. Elapsed: 00:00:00.63 NIALL @ NIALL1 > NIALL @ NIALL1 >select lf_rows,del_lf_rows,round(del_lf_rows/lf_rows,2) del_pct 2 from index_stats; LF_ROWS DEL_LF_ROWS DEL_PCT ---------- ----------- ---------- 2000000 366139 .18 1 row selected. Elapsed: 00:00:00.02 NIALL @ NIALL1 >spoo off
Possibly Related Posts:
When is a Health Check not a Health Check
2. Controlfiles
3. Redolog files
4. Archiving
5. Datafiles
5.1 Autoextend
5.2 Location
6. Tablespaces
6.1 SYSTEM Tablespace
6.2 SYSAUX Tablespace (10g Release and above)
6.3 Locally vs Dictionary Managed Tablespaces
6.4 Temporary Tablespace
6.5 Tablespace Fragmentation
7. Objects
7.1 Number of Extents
7.2 Next extent
7.3 Indexes
8. AUTO vs MANUAL undo
8.1 AUTO UNDO
8.2 MANUAL UNDO
9. Memory Management
9.1 Pre Oracle 9i
9.2 Oracle 9i
9.3 Oracle 10g
9.4 Oracle 11g
10. Logging & Tracing
10.1 Alert File
10.2 Max_dump_file_size
10.3 User and core dump size parameters
10.4 Audit files
It is highly recommended to have at least two copies of the controlfile. This can be done by mirroring the controlfile, strongly recommended on different physical disks. If a controlfile is lost, due to a disk crash for example, then you can use the mirrored file to startup the database. In this way fast and easy recovery from controlfile loss is obtained.connect as sysdba SQL> select status, name from v$controlfile; STATUS NAME ------- --------------------------------- /u01/oradata/L102/control01.ctl /u02/oradata/L102/control02.ctlThe location and the number of controlfiles can be controlled by the
‘control_files’ initialization parameter.
I’m rather a pedant so would rather the document be described as a “Best Practices” or “Guidelines “document than a how to. A How to document would reword the above as
Ensure that the database has at least 2 controlfiles located on different disks by reviewing the ‘control_files’ initialisation parameter.
I.E it told you How to do something, hence the name. But my beef isn’t really that this is yet another Best Practices document, but rather with what it leaves out. This is perhaps best illustrated by referring you to this forum thread. A health check that doesn’t ensure recoverability is no health check.
Possibly Related Posts:
blogs.oracle.com.
As some of you will have noticed blogs.oracle.com has recently moved platform from Movable Type to Roller. This appears to be essentially the old blogs.sun.com platform. I do have some issues with it though – most notably the fact that the blog search now requires an oracle.com SSO account and doesn’t return results that it ought to. I’ve therefore knocked up a quick custom search engine for blogs.oracle.com that you can find at http://orawin.info/blog/search-oracle-blogs or on the local pages link on the right. Unlike Oracle I don’t intend this link to move – and if it does for some as yet unknown reason I’ll put a redirect in place.
Possibly Related Posts:
Random Thoughts
I’ve been considering carefully recently the generation of randomized, representative test data for a test case I’m working on. I will blog the worked test case as well, but in the mean time I thought it worth jotting down a few ideas about generating reasonably representative volumes of test data for Oracle. My problem is that I want to model a particular growth pattern of data, namely that from a social network, and particularly I want to attempt to see how a sensible database design would scale for this network. To that end I want to model the network with a few hundred users – as one might see in a corporation, a few tens of thousands of users – as one might see when the network starts to take off, and finally with hundreds of thousands of users (I don’t have access to a suitable test infrastructure for millions of users). I then want to look at how some of the underlying queries scale. This blog offers some ideas on how to generate reasonable test data at arbitrary scales.
Modelling
A social network is defined (by Wikipedia 20 April 2011) as
A social network service essentially consists of a representation of each user (often a profile), his/her social links, and a variety of additional services
For our model, loosely based on www.digg.com as described here ), we will model 4 entities. These are described below:
Users
These are people that join the service. I have modelled these using a “users” table
Name Null? Type ----------------------------------------- -------- ----------------------- USERID NOT NULL NUMBER(11) USERNAME VARCHAR2(15) REGISTERED DATE PADDING CHAR(50)
User id is a sequential number, username I can keep short and generate artificially , registered is the date the user account was created and padding is used to keep row lengths longer (to model other data being required in a real design)
Friends
These are the people that users choose to link to, the social links described in the wikipedia definition. This is modelled as shown below
Name Null? Type ----------------------------------------- -------- ----------------------- ID NOT NULL NUMBER(10) USERID NUMBER(10) USERNAME VARCHAR2(15) FRIENDID NUMBER(10) FRIENDNAME VARCHAR2(15) MUTUAL NUMBER(1) DATE_CREATED DATE
ID is again a meaningless key. Userid is the user creating the link, friendid is the target of the link, mutual we don’t use, and date created the date the link was created. This table is drawn from the same link as above. It’s interesting to note that the datatypes for the id columns differ in length but for our purposes in Oracle that will not matter.
Items
This is a table whose existence I have surmised and is intended to contain all items added to our social network. For this work it represents interesting items added by users. It is defined as
Name Null? Type ----------------------------------------- -------- ------------------- ITEMID NOT NULL NUMBER(11) USERID NUMBER(11) DUGDATE DATE PADDING CHAR(50)
Again item id is a meaningless primary key, userid is the id of the user who added it. Dugdate a date on which the item was ‘dug’ – we don’t use this – and padding to plump out any extra columns.
Diggs (sic)
This is a table described by digg in the blog article above and which has the following definition
Name Null? Type ----------------------------------------- -------- ----------------------- ID NOT NULL NUMBER(11) ITEMID NUMBER(11) USERID NUMBER(11) DIGDATE DATE
This table again represents a user interaction, just as friends does, namely the act of “digging” an item added to the site. Again the datatypes have that odd length difference. It’s clear from the blog entry that the id columns are primary key values generated in a meaningless way.
Problem
The problem that I have is that I want to model reasonably representative generated data. For a social network however randomly generated data – using dbms_random.value or dbms_random.normal won’t be representative. We can make the following assumptions about our putative site.
- It grows in popularity exponentially ( see for example http://www.benphoster.com/facebook-user-growth-chart-2004-2010/ )
- Users have different numbers of links but these are likely to be distributed according to a power law (http://en.wikipedia.org/wiki/Scale-free_network)
- The number of diggs per user seems likely to follow a poisson distribution. We expect a large number of users with a small but non zero number of actions and a small number with a large number of actions.
This means that I want to be able to generate my registration dates so that they are distributed exponentially, my friends per user using a power law and the number of diggs per user such that it fits a poisson distribution. In general when considering how to model your systems it’s worth bearing in mind that data often does not follow normal or uniform distribution patterns. Changing the distribution may well cause behavioural changes and oddities – particularly if you are operating on a system prior to 11g that may not estimate cardinality well
Possible Solutions
Ideally of course I’d like to generate all the data in oracle at different scales , I’m going to choose 500, 5000, 50000 and 500000 users. For all but one of the problems described above there is a ready Oracle solution, for the distribution of diggs amongst users a bit of lateral thinking is required. First the easy part, if we assume that the number of friends follows a power law. We can produce a distribution of links for testing using the following (inefficient) code extract.
DECLARE
l_usercount constant NUMBER := 500000;
l_itemcount constant NUMBER := 5000000;
l_username users.username%type;
l_friendcount NUMBER;
type friendlist IS TABLE OF pls_integer;
l_friendlist friendlist;
l_itemid NUMBER;
l_userid NUMBER;
BEGIN
-- Setup friend relationships. Power law.
FOR i IN 1..l_usercount
LOOP
-- initialize array
l_friendlist := friendlist();
-- get number of links for this users (1-100 distributed)
l_friendcount := POWER(100, dbms_random.value) ;
SELECT
username
INTO
l_username
FROM
users
WHERE
userid = i;
-- add friends
FOR j IN 1..l_friendcount LOOP
INSERT INTO friends
(userid , username , friendid , friendname , date_created )
SELECT
i , l_username , userid , username , sysdate - dbms_random.normal*dbms_random.value(1,365)
FROM
users
WHERE
userid = l_friendlist(j);
EXCEPTION
WHEN dup_val_on_index THEN
NULL; -- occasionally dbms_random will get the same friend
END;
END LOOP;
COMMIT;
l_friendlist := NULL;
END LOOP;
END;
/
For the number of diggs per user and the registration dates I found by far and away the easiest solution was to pre-generate a table of values that followed these distributions using the Excel Analysis Tool-Pak http://office.microsoft.com/en-us/excel-help/about-statistical-analysis-tools-HP005203873.aspx This tool allows you to generate numerical values that follow the described distributions, and excel’s save as csv feature allows you to easily create a file that can be loaded into the database via an external table. In my test setup I called the resultant table random_values and can populate my data using the following pseudo-code. I’ll look to publish the results of my investigations in the next blog entry but 1000 words is enough for a blog, and the principle that your test data should reflect reality however that looks is independent of the application that you are testing.
BEGIN
FOR i IN 1..l_usercount
LOOP
-- dig n items our table assumes 2.8 per user
SELECT
poisson
INTO
l_digs
FROM
random_values
WHERE
userid = i;
FOR n IN 1..l_digs
LOOP
l_itemid := floor(dbms_random.value(1,l_itemcount)); -- what do I dig
INSERT
INTO
diggs
(
itemid,
userid,
digdate
)
VALUES
(
l_itemid,
i,
sysdate - (dbms_random.normal*dbms_random.value(1,365))
);
END LOOP;
if mod(i,10000) = 0 then commit; end if
END LOOP;
commit;
END ;