Random Thoughts
I’ve been considering carefully recently the generation of randomized, representative test data for a test case I’m working on. I will blog the worked test case as well, but in the mean time I thought it worth jotting down a few ideas about generating reasonably representative volumes of test data for Oracle. My problem is that I want to model a particular growth pattern of data, namely that from a social network, and particularly I want to attempt to see how a sensible database design would scale for this network. To that end I want to model the network with a few hundred users – as one might see in a corporation, a few tens of thousands of users – as one might see when the network starts to take off, and finally with hundreds of thousands of users (I don’t have access to a suitable test infrastructure for millions of users). I then want to look at how some of the underlying queries scale. This blog offers some ideas on how to generate reasonable test data at arbitrary scales.
Modelling
A social network is defined (by Wikipedia 20 April 2011) as
A social network service essentially consists of a representation of each user (often a profile), his/her social links, and a variety of additional services
For our model, loosely based on www.digg.com as described here ), we will model 4 entities. These are described below:
Users
These are people that join the service. I have modelled these using a “users” table
Name Null? Type ----------------------------------------- -------- ----------------------- USERID NOT NULL NUMBER(11) USERNAME VARCHAR2(15) REGISTERED DATE PADDING CHAR(50)
User id is a sequential number, username I can keep short and generate artificially , registered is the date the user account was created and padding is used to keep row lengths longer (to model other data being required in a real design)
Friends
These are the people that users choose to link to, the social links described in the wikipedia definition. This is modelled as shown below
Name Null? Type ----------------------------------------- -------- ----------------------- ID NOT NULL NUMBER(10) USERID NUMBER(10) USERNAME VARCHAR2(15) FRIENDID NUMBER(10) FRIENDNAME VARCHAR2(15) MUTUAL NUMBER(1) DATE_CREATED DATE
ID is again a meaningless key. Userid is the user creating the link, friendid is the target of the link, mutual we don’t use, and date created the date the link was created. This table is drawn from the same link as above. It’s interesting to note that the datatypes for the id columns differ in length but for our purposes in Oracle that will not matter.
Items
This is a table whose existence I have surmised and is intended to contain all items added to our social network. For this work it represents interesting items added by users. It is defined as
Name Null? Type ----------------------------------------- -------- ------------------- ITEMID NOT NULL NUMBER(11) USERID NUMBER(11) DUGDATE DATE PADDING CHAR(50)
Again item id is a meaningless primary key, userid is the id of the user who added it. Dugdate a date on which the item was ‘dug’ – we don’t use this – and padding to plump out any extra columns.
Diggs (sic)
This is a table described by digg in the blog article above and which has the following definition
Name Null? Type ----------------------------------------- -------- ----------------------- ID NOT NULL NUMBER(11) ITEMID NUMBER(11) USERID NUMBER(11) DIGDATE DATE
This table again represents a user interaction, just as friends does, namely the act of “digging” an item added to the site. Again the datatypes have that odd length difference. It’s clear from the blog entry that the id columns are primary key values generated in a meaningless way.
Problem
The problem that I have is that I want to model reasonably representative generated data. For a social network however randomly generated data – using dbms_random.value or dbms_random.normal won’t be representative. We can make the following assumptions about our putative site.
- It grows in popularity exponentially ( see for example http://www.benphoster.com/facebook-user-growth-chart-2004-2010/ )
- Users have different numbers of links but these are likely to be distributed according to a power law (http://en.wikipedia.org/wiki/Scale-free_network)
- The number of diggs per user seems likely to follow a poisson distribution. We expect a large number of users with a small but non zero number of actions and a small number with a large number of actions.
This means that I want to be able to generate my registration dates so that they are distributed exponentially, my friends per user using a power law and the number of diggs per user such that it fits a poisson distribution. In general when considering how to model your systems it’s worth bearing in mind that data often does not follow normal or uniform distribution patterns. Changing the distribution may well cause behavioural changes and oddities – particularly if you are operating on a system prior to 11g that may not estimate cardinality well
Possible Solutions
Ideally of course I’d like to generate all the data in oracle at different scales , I’m going to choose 500, 5000, 50000 and 500000 users. For all but one of the problems described above there is a ready Oracle solution, for the distribution of diggs amongst users a bit of lateral thinking is required. First the easy part, if we assume that the number of friends follows a power law. We can produce a distribution of links for testing using the following (inefficient) code extract.
DECLARE
l_usercount constant NUMBER := 500000;
l_itemcount constant NUMBER := 5000000;
l_username users.username%type;
l_friendcount NUMBER;
type friendlist IS TABLE OF pls_integer;
l_friendlist friendlist;
l_itemid NUMBER;
l_userid NUMBER;
BEGIN
-- Setup friend relationships. Power law.
FOR i IN 1..l_usercount
LOOP
-- initialize array
l_friendlist := friendlist();
-- get number of links for this users (1-100 distributed)
l_friendcount := POWER(100, dbms_random.value) ;
SELECT
username
INTO
l_username
FROM
users
WHERE
userid = i;
-- add friends
FOR j IN 1..l_friendcount LOOP
INSERT INTO friends
(userid , username , friendid , friendname , date_created )
SELECT
i , l_username , userid , username , sysdate - dbms_random.normal*dbms_random.value(1,365)
FROM
users
WHERE
userid = l_friendlist(j);
EXCEPTION
WHEN dup_val_on_index THEN
NULL; -- occasionally dbms_random will get the same friend
END;
END LOOP;
COMMIT;
l_friendlist := NULL;
END LOOP;
END;
/
For the number of diggs per user and the registration dates I found by far and away the easiest solution was to pre-generate a table of values that followed these distributions using the Excel Analysis Tool-Pak http://office.microsoft.com/en-us/excel-help/about-statistical-analysis-tools-HP005203873.aspx This tool allows you to generate numerical values that follow the described distributions, and excel’s save as csv feature allows you to easily create a file that can be loaded into the database via an external table. In my test setup I called the resultant table random_values and can populate my data using the following pseudo-code. I’ll look to publish the results of my investigations in the next blog entry but 1000 words is enough for a blog, and the principle that your test data should reflect reality however that looks is independent of the application that you are testing.
BEGIN
FOR i IN 1..l_usercount
LOOP
-- dig n items our table assumes 2.8 per user
SELECT
poisson
INTO
l_digs
FROM
random_values
WHERE
userid = i;
FOR n IN 1..l_digs
LOOP
l_itemid := floor(dbms_random.value(1,l_itemcount)); -- what do I dig
INSERT
INTO
diggs
(
itemid,
userid,
digdate
)
VALUES
(
l_itemid,
i,
sysdate - (dbms_random.normal*dbms_random.value(1,365))
);
END LOOP;
if mod(i,10000) = 0 then commit; end if
END LOOP;
commit;
END ;
I haven’t gotten through the entire post yet, but have you tried SwingBench to generate data?
I haven’t, was just curious if you have used that tool at all.
chet
chet
3 May 11 at 4:00 pm