orawin.info

Niall's Oracle Pages – Oracle Opinion since 2004

Random Thoughts

with one comment

I’ve been considering carefully recently the generation of randomized, representative test data for a test case I’m working on. I will blog the worked test case as well, but in the mean time I thought it worth jotting down a few ideas about generating reasonably representative volumes of test data for Oracle. My problem is that I want to model a particular growth pattern of data, namely that from a social network, and particularly I want to attempt to see how a sensible database design would scale for this network. To that end I want to model the network with a few hundred users – as one might see in a corporation, a few tens of thousands of users – as one might see when the network starts to take off, and finally with hundreds of thousands of users (I don’t have access to a suitable test infrastructure for millions of users). I then want to look at how some of the underlying queries scale. This blog offers some ideas on how to generate reasonable test data at arbitrary scales.

Modelling

A social network is defined (by Wikipedia 20 April 2011) as

A social network service essentially consists of a representation of each user (often a profile), his/her social links, and a variety of additional services

For our model, loosely based on www.digg.com as described here ), we will model 4 entities. These are described below:

Users

These are people that join the service. I have modelled these using a “users” table

Name                                      Null?    Type
----------------------------------------- -------- -----------------------
USERID                                    NOT NULL NUMBER(11)
USERNAME                                           VARCHAR2(15)
REGISTERED                                         DATE
PADDING                                            CHAR(50)

User id is a sequential number, username I can keep short and generate artificially , registered is the date the user account was created and padding is used to keep row lengths longer (to model other data being required in a real design)

Friends

These are the people that users choose to link to, the social links described in the wikipedia definition. This is modelled as shown below

Name                                      Null?    Type
----------------------------------------- -------- -----------------------
ID                                        NOT NULL NUMBER(10)
USERID                                             NUMBER(10)
USERNAME                                           VARCHAR2(15)
FRIENDID                                           NUMBER(10)
FRIENDNAME                                         VARCHAR2(15)
MUTUAL                                             NUMBER(1)
DATE_CREATED                                       DATE

ID is again a meaningless key. Userid is the user creating the link, friendid is the target of the link, mutual we don’t use, and date created the date the link was created. This table is drawn from the same link as above. It’s interesting to note that the datatypes for the id columns differ in length but for our purposes in Oracle that will not matter.

Items

This is a table whose existence I have surmised and is intended to contain all items added to our social network. For this work it represents interesting items added by users. It is defined as

Name                                      Null?    Type
----------------------------------------- -------- -------------------
ITEMID                                    NOT NULL NUMBER(11)
USERID                                             NUMBER(11)
DUGDATE                                            DATE
PADDING                                            CHAR(50)

Again item id is a meaningless primary key, userid is the id of the user who added it. Dugdate a date on which the item was ‘dug’ – we don’t use this – and padding to plump out any extra columns.

Diggs (sic)

This is a table described by digg in the blog article above and which has the following definition

Name                                      Null?    Type
----------------------------------------- -------- -----------------------
ID                                        NOT NULL NUMBER(11)
ITEMID                                             NUMBER(11)
USERID                                             NUMBER(11)
DIGDATE                                            DATE

This table again represents a user interaction, just as friends does, namely the act of “digging” an item added to the site. Again the datatypes have that odd length difference. It’s clear from the blog entry that the id columns are primary key values generated in a meaningless way.

Problem

The problem that I have is that I want to model reasonably representative generated data. For a social network however randomly generated data – using dbms_random.value or dbms_random.normal won’t be representative. We can make the following assumptions about our putative site.

This means that I want to be able to generate my registration dates so that they are distributed exponentially, my friends per user using a power law and the number of diggs per user such that it fits a poisson distribution. In general when considering how to model your systems it’s worth bearing in mind that data often does not follow normal or uniform distribution patterns. Changing the distribution may well cause behavioural changes and oddities – particularly if you are operating on a system prior to 11g that may not estimate cardinality well

Possible Solutions

Ideally of course I’d like to generate all the data in oracle at different scales , I’m going to choose 500, 5000, 50000 and 500000 users. For all but one of the problems described above there is a ready Oracle solution, for the distribution of diggs amongst users a bit of lateral thinking is required. First the easy part, if we assume that the number of friends follows a power law. We can produce a distribution of links for testing using the following (inefficient) code extract.

DECLARE
  l_usercount constant NUMBER := 500000;
  l_itemcount constant NUMBER := 5000000;
  l_username users.username%type;
  l_friendcount NUMBER;
  type friendlist IS TABLE OF pls_integer;
  l_friendlist friendlist;
  l_itemid NUMBER;
  l_userid NUMBER;
BEGIN
    -- Setup friend relationships. Power law.
    FOR i IN 1..l_usercount
    LOOP
      -- initialize array
      l_friendlist := friendlist();
      -- get number of links for this users (1-100 distributed)
      l_friendcount := POWER(100, dbms_random.value) ;

      SELECT
        username
      INTO
        l_username
      FROM
        users
      WHERE
        userid = i;
      -- add friends
      FOR j IN 1..l_friendcount LOOP
          INSERT INTO friends
(userid , username , friendid , friendname , date_created )
          SELECT
            i , l_username , userid , username , sysdate - dbms_random.normal*dbms_random.value(1,365)
          FROM
            users
          WHERE
            userid = l_friendlist(j);
EXCEPTION
        WHEN dup_val_on_index THEN
          NULL;  -- occasionally dbms_random will get the same friend
        END;
      END LOOP;
      COMMIT;
      l_friendlist := NULL;
    END LOOP;
END;
/

For the number of diggs per user and the registration dates I found by far and away the easiest solution was to pre-generate a table of values that followed these distributions using the Excel Analysis Tool-Pak http://office.microsoft.com/en-us/excel-help/about-statistical-analysis-tools-HP005203873.aspx This tool allows you to generate numerical values that follow the described distributions, and excel’s save as csv feature allows you to easily create a file that can be loaded into the database via an external table. In my test setup I called the resultant table random_values and can populate my data using the following pseudo-code. I’ll look to publish the results of my investigations in the next blog entry but 1000 words is enough for a blog, and the principle that your test data should reflect reality however that looks is independent of the application that you are testing.

BEGIN
      FOR i IN 1..l_usercount
      LOOP
        -- dig n items our table assumes 2.8 per user
        SELECT
          poisson
        INTO
          l_digs
        FROM
          random_values
        WHERE
          userid = i;
        FOR n   IN 1..l_digs
        LOOP
          l_itemid := floor(dbms_random.value(1,l_itemcount)); -- what do I dig
          INSERT
          INTO
            diggs
            (
              itemid,
              userid,
              digdate
            )
            VALUES
            (
              l_itemid,
              i,
              sysdate - (dbms_random.normal*dbms_random.value(1,365))
            );
        END LOOP;
      if mod(i,10000) = 0 then commit; end if
      END LOOP;
   commit;
    END ;

Possibly Related Posts:

  • No Related Posts found

Written by Niall Litchfield

May 3rd, 2011 at 2:33 pm

One Response to 'Random Thoughts'

Subscribe to comments with RSS or TrackBack to 'Random Thoughts'.

  1. I haven’t gotten through the entire post yet, but have you tried SwingBench to generate data?

    I haven’t, was just curious if you have used that tool at all.

    chet

    chet

    3 May 11 at 4:00 pm

Leave a Reply