A VERY SCHOLARLY DISCUSSION OF THE EFFECT OF SYNONYMS ON PERFORMANCE by Eugene Volokh, VESOFT Presented at 1982 HPIUG Conference, Copenhagen, DANMARK Published in "Thoughts & Discourses on HP3000 Software", 1st ed.

ABSTRACT

This paper will discuss the effect of synonyms on IMAGE/3000 performance, in particular the average number of I/Os necessary to perform a DBGET mode 7 (also known as a hashed, calculated, or keyed read).

THE PROBLEM

As  is well known, IMAGE/3000 master datasets  use  an  access  method
called   HASHING  to implement keyed I/O against them.  The essence of
this  method is that when  a  DBPUT  is  performed  against  a  master
dataset,   the  key  is  translated through a special HASHING FUNCTION
(which    is   explained   in    Robelle    Consulting's    SMUG    II
micro-proceedings)   to  a record number within the dataset.  Then the
entry  to be added is inserted  into  the  dataset  at  the  resulting
record   number.   Now, when a DBGET mode 7 is performed with the same
key,  the key is translated through the same hashing function  to  the
same   record  number;  then, the entry we are looking for is just the
record  with that number.  Thus, we should ideally be able to  add  or
retrieve   any  entry in just one I/O (as opposed to considerably more
I/Os needed for other access methods, such as KSAM).

However, a problem arises: what if two keys map to the same record number (this condition is known as a COLLISION)? We can't very well put both entries into the same record number, so we will have to put one of these entries (known as the SECONDARY) somewhere else, and put a pointer to it into the other entry (known as the PRIMARY). Now, if we want to retrieve the secondary, we will have to perform two I/Os: one to retrieve the primary, and then when we see that the primary is not the entry we want, a second to retrieve the secondary. A similar situation arises when a key maps to a record number that already has a primary and a secondary attached to it; then the entry is added to the so-called SYNONYM CHAIN for that record number (the primary and all its secondaries), and we now need three I/Os to get to it. Similar things occur when a key maps to a record number that has two synonyms, three synonyms, etc.

In short, access to primaries requires only one I/O; however, access to secondaries requires two or more I/Os (unless the blocking factor is greater than one).

It is also well known that the more full a dataset is, the more secondaries there are in that dataset; and, the more secondaries there are, the more I/Os are needed (on the average) to retrieve an entry. However, how many more I/Os are needed, nobody knows.

This brings up an interesting question: what is the relationship of the average number of I/Os needed to retrieve an entry to how full the dataset is (THE FILL FACTOR), measured from 0=empty, to 1=full.

THE SOLUTION

In this paper, I will attempt to solve the above question theoretically, i.e. derive a general formula that, given the fill factor, will return the average number of I/Os. In order to do this, however, some simplifying assumptions must be made:

First  of all, I will assume that our hashing function is "good", i.e.
regardless  of the data fed to it, it will return each  record  number
with  roughly equal frequency.  Thus, for instance, a hashing function
that  returns record number 17 half of the time, or returns odd record
numbers    more   often  than  even  record  numbers  is  not  "good".
Fortunately,  IMAGE's hashing  function  is  a  relatively  good  one.
(Note:   If  the  dataset's  key  is  numeric,  e.g. I2, a not-so-good
hashing  function is used by IMAGE; thus, these  results  may  not  be
correct for datasets with numeric keys).

Secondly, I will assume that the master dataset in question has blocking factor 1. This will permit me to ignore the possibility of a synonym being in the same block as the primary, thus requiring no additional I/Os when it is retrieved. I must confess that this may be a quite fallacious assumption in that synonyms very often are in the same blocks as their primaries; however, although this fact may be decrease the average number of I/Os for a given fill factor, it should not influence the overall nature of the function we are seeking.

Now, we can get down to business. Let us define N = number of entries in the dataset, C = capacity of the dataset, F = fill factor = N/C, An X-ary = an entry that is the Xth in a synonym chain, and thus takes X I/Os to retrieve. Thus, a 1-ary is a primary, and 2-arys, 3-arys, 4-arys, etc. are all secondaries, E (X, I) = the expected number of X-aries in a dataset after I entries have been added.

Let us thus ask the following question:

How many primaries can we expect in the dataset after I entries have been added to it, i.e. what is E(1,I)?

Well, we can expect as many primaries as are expected after I-1 entries have been added (i.e. E(1,I-1)) plus either 0 or 1 more primaries (depending on whether the Ith entry becomes a primary or not), i.e.

  E(1,I) = E(1,I-1)+                                              [1]
           1*(probability that the Ith entry becomes a primary).

What is the probability that the Ith entry becomes a primary? An entry becomes a primary if it hashes to any slot in the dataset that is NOT OCCUPIED BY ANOTHER PRIMARY, i.e. that is empty or is occupied by a secondary! The probability that the Ith entry becomes a primary is thus equal to

(the number of slots that are not occupied by another primary) -------------------------------------------------------------- (the total number of slots that it could possibly hash to)

The total number of slots that it could possibly hash to is, of course, C. The number of slots that are not occupied by another primary is C - (the number of slots that are occupied by a primary). When the Ith entry is being added, the expected number of slots that are occupied by a primary is by definition E(1,I-1).

Thus, the probability that the Ith entry will become a primary is

(C-E(1,I-1))/C = 1-E(1,I-1)/C [2]

Therefore, substituting [2] into [1],

E(1,I) = E(1,I-1)+1-E(1,I-1)/C = [3] = E(1,I-1)*(1-1/C)+1

Moreover, we know that E(1,0) = expected number of primaries when the dataset has 0 entries, i.e. when the dataset is empty = 0. Thus, we have a recursive formula for E(1,I). From it we can, for instance, determine that E(1,1) = E(1,0)*(1-1/C)+1=0*(1-1/C)+1=1, that E(1,2) = E(1,1)*(1-1/C)+1=1*(1-1/C)+1=2-1/C, etc.

But, we want a non-recursive, straightforward formula. We can obtain it by further substituting [3] into itself:

  E(1,I) = E(1,I-1)*(1-1/C)+1 =                                   [4]
         = (E(1,I-2)*(1-1/C)+1)*(1-1/C)+1 =
         = E(1,I-2)*(1-1/C)^2+(1-1/C)+1 =
           ...
         = (1-1/C)^(I-1)+(1-1/C)^(I-2)+ ... +(1-1/C)^2+(1-1/C)+1

The above is what is known in mathematical circles as a geometric progression, i.e. a sum of terms each one of which is (1-1/C) times less than the previous one. It is also known that the sum of the above progression is

  (1-1/C)^I-1   (1-1/C)^I-1
  ----------- = ----------- = -C*((1-1/C)^I-1) = C*(1-(1-1/C)^I)  [5]
   (1-1/C)-1       -1/C

Thus, E(1,I) = C*(1-(1-1/C)^I). If what we are looking for is the percentage of the dataset's entries that are primaries, we should consider E(1,I)/I = (C/I)*(1-(1-1/C)^I).

Now, let us take a closer look at (1-1/C)^I. Clearly,

(1-1/C)^I = (1-1/C)^(C*I/C) = ((1-1/C)^C)^(I/C) [6]

Also, we know that when C is sufficiently large (say, in excess of 100), (1-1/C)^C is close to (and in fact gets closer as C increases) the constant 1/e = 1/2.71828... = 0.367879....

Thus, for large C, we have (substituting [6] into [5])

  E(1,I)/I = (C/I)*(1-(1-1/C)^I) = (C/I)*(1-(1/e)^(I/C)) =        [7]
           = (C/I)*(1-e^(-I/C)) = (1-e^(-I/C))/(I/C).

So, the expected fraction of primaries in our dataset, which contains N entries, is E(1,N)/N = (1-e^(-N/C))/(N/C). But, N/C = F! Therefore,

               -F
  E (1,N)   1-e
  ------- = -----                                                 [8]
     N        F

Ugh.

At last, we have a solid result -- we have determined the expected number of primaries in a dataset given its fill factor. An interesting corollary to the above is that when the dataset is full, i.e. N=C or F=1, E(1,N)/N = (1-e^(-1))/1 = .632121.... Thus, even when a dataset is full, 63% of all entries in it should be primaries.

However,  the battle is far from over yet.  What we are trying  to  do
is  to determine the average number of I/Os needed to retrieve a given
entry.    Before  doing  the  actual  derivations,  let me explain the
method that we are going to use:

The average number of I/Os needed to retrieve a given entry is equal to the total number of I/Os needed to retrieve all entries in the dataset divided by the total number of entries.

The total number of I/Os needed to retrieve all entries in the dataset is the total number of I/Os needed to retrieve all primaries + the total number of I/Os needed to retrieve all 2-aries + .... But, the total number of I/Os needed to retrieve all X-aries = the expected number of X-aries * the number of I/Os needed to retrieve an X-ary. Clearly, to retrieve an X-ary we would need X I/Os; thus, the total number of I/Os needed to retrive all X-aries = X*E(X,N).

Therefore, the total number of I/Os needed to retrieve all entries in the dataset is 1*E(1,N)+2*E(2,N)+3*E(3,N)+.... And, the average number of I/Os needed to retrieve a given entry = the number of I/Os needed to retrieve all entries / number of entries =

  1*E(1,N)+2*E(2,N)+3*E(3,N)+...
  ------------------------------                                  [9]
                N

We have already determined E(1,N). Now, if we can only determine E(2,N), E(3,N), ..., we will be almost done. So, this is what we will attempt to do below.

How are we going to determine E(X,I) for X>1?

Well, reasoning similar to the one we used for E(1,I) indicates that

  E(X,I) = E(X,I-1)+                                             [10]
           (probability that the Ith entry becomes an X-ary)

Now, what is the probability that the Ith entry becomes an X-ary? To answer this, we must ask: under what conditions would an entry become an X-ary? The answer to this is: if and only if it hashes to a slot that is currently occupied by a primary which has a synonym chain of length exactly X-1. How many entries like this are there? Well, the distinguishing feature of this kind of entry is that it has in its synonym chain an (X-1)-ary but no X-aries. Thus, the number of such primary entries = number of (X-1)-aries - number of X-aries = E(X-1,I-1) - E(X,I-1). Thus, the probability that the Ith entry becomes an X-ary is

  E(X-1,I-1)-E(X,I-1)
  -------------------                                            [11]
           C

and thus

E(X,I) = E(X,I-1)+(E(X-1),I-1)-E(X,I-1))/C = [12] = E(X,I-1)*(1-1/C)+E(X-1,I-1)/C

To obtain a non-recursive formula, we expand the above into

              I
             ---
             \   E(X-1,J-1)
  E(X,I) =    -  ---------- * (1-1/C)^(I-J)                      [13]
             /       C
             ---
             J=1

In particular, let us try to determine E(2,I) = the expected number of 2-aries after I entries have been added.

  E(2,I) = sum(J=1:I | E(1,J-1)/C * (1-1/C)^(I-J)) ~             [14]
         ~ sum(J=1:I | E(1,J)/C * (1-1/C)^(I-J)) =
         = sum(J=1:I | C*(1-e^(-J/C))/C * (1-1/C)^(I-J)) ~
         = sum(J=1:I | (1-1/C)^(I-J)) -
           sum(J=1:I | e^(-J/C) * (1-1/C)^(I-J)) =
         = sum(J=0:I-1 | (1-1/C)^J) -
           sum(J=1:I | e^(-J/C) * (1-1/C)^(I-J))
Now,  we can observe that sum(J=0:I-1 |  (1-1/C)^J)  is  just  E(1,I);
moreover,    (1-1/C)^(I-J)  =  ((1-1/C)^C)^((I-J)/C)  ~  e^(-(I-J)/C).
Thus,
  E(2,I) = E(1,I) - sum(J=1:I | e^(-J/C) * e^(-(I-J)/C)) =       [15]
         = E(1,I) - sum(J=1:I | e^(-I/C)) =
         = E(1,I) - I*e^(-I/C).

So, E(2,N) = E(1,N)-N*e^(-N/C); furthermore,

  E (2,N)   E (1,N)    -N/C   E (1,N)    -F
  ------- = ------- - e     = ------- - e                        [16]
     N         N                 N

Now, we can proceed to the general question: what is E(X,I) (or, what is E(X,I)/I)?

I claim that for X>=2,

  E (X,I)   E (X-1,I)   (I/C)^(X-2)    -I/C
  ------- = --------- - ----------- * e                          [17]
     I          I          (X-1)!

where X! = X factorial = 1*2*3*...*X.

I will prove this claim via a device known as mathematical induction; i.e. I will demonstrate that the claim is valid for X=2, and if it is valid for X=A, then it is also valid for X=A+1. Clearly, this will demonstrate that the claim is valid for X=2, and therefore for X=2+1=3, and therefore also for X=3+1=4, etc.

First of all, I must prove that the claim is valid for X=2. In the case of X=2, the claim says that

  E (2,I)   E (1,I)   (I/C)^0    -I/C   E (1,I)    -I/C
  ------- = ------- - ------- * e     = ------- - e              [18]
     I         I         0!                I

But, this was proven above in the derivation of E(2,I). Thus, we know that our claim is valid for X=2.

Now let us assume that the claim is valid for X=A, i.e.

  E (X,A)   E (X-1,A)   (A/C)^(X-2)    -A/C
  ------- = --------- - ----------- * e                          [19]
     A          A            X!

We must show that it is also valid for X=A+1.

  E(A+1,I) = SUM(J=1:I | E(A,J-1)/C * (1-1/C)^(I-J)) =           [20]
           = SUM(J=1:I | (E(A-1,J-1)-J*(J/C)^(A-2)/(A-1)!*e^(-J/C))*
                         (1-1/C)^(I-J)/C) =
           = SUM(J=1:I | E(A-1,J-1)*(1-1/C)^(I-J)/C) -
             SUM(J=1:I | (J/C)^(A-1)/(A-1)!*e^(-J/C)*
                         (1-1/C)^(I-J)/C) =
           = E(A,I) -
             SUM(J=1:I | (J/C)^(A-1)/(A-1)!*e^(-J/C)*e^(-(I-J)/C)) =
           = E(A,I) -
             SUM(J=1:I | (J/C)^(A-1)/(A-1)!*e^(-I/C)) =
           = E(A,I) - e^(-I/C)/(A-1)!*SUM(J=1:I | J^(A-1))/C^(A-1) =
We  know that SUM(J=1:I | J^(A-1)) = I^A/A + smaller powers of I,  and
is thus approximately equal to I^A/A; so,
  E(A+1,I) = E(A,I) - e^(-I/C)/(A-1)!*                           [21]
             SUM(J=1:I | J^(A-1))/C^(A-1) =
           = E(A,I) - e^(-I/C)/(A-1)!*(I^A/A)/C^(A-1) =
           = E(A,I) - e^(-I/C)/A!*I*(I/C)^(A-1)

Thus,

  E (A+1,I)   E (A,I)   (I/C)^(A-1)    -I/C
  --------- = ------- - ----------- * e                          [22]
      I          I          X!

This is indeed is the result predicted by our claim for X=A+1. Thus, we know that if the claim is valid for X=A, it is valid for X=A+1.

Thus, we are justified in saying that our claim is true. Finally, we have obtained a general (albeit recursive) formula for E (X,I).

Now comes the final phase of our proof -- the derivation of the average number of I/Os needed to retrieve a record. For the remainder of this discussion, let us for convenience use the fill factor F = N/C.

The average number of I/Os needed to retrieve a record is equal to the total number of I/Os needed to retrieve all records divided by the number of records.

The total number of I/Os needed to retrieve all records is equal to (the expected number of 1-arys)*(I/Os needed to retrieve a 1-ary)+ (the expected number of 2-arys)*(I/Os needed to retrieve a 2-ary)+...

But, we know that the expected number of X-arys is simply E(X,N); the number of I/Os needed to retrieve a X-ary is simply X; and, the number of records is N.

Thus, the average number of I/Os needed to retrieve a record is

  1*E(1,N)+2*E(2,N)+...     E (1,N)     E (2,N)
  --------------------- = 1*------- + 2*------- + ...            [23]
            N                  N           N

We have found that

  E (X,N)   E (X-1,N)   F^(X-2)    -F
  ------- = --------- - ------- * e                              [24]
     N          N        (X-1)!

Thus,

  E (X,N)   E (1,N)   F^0  -F   F^1  -F         F^(X-2)  -F
  ------- = ------- - ---*e   - ---*e   - ... - -------*e        [25]
     N         N       1!        2!              (X-1)!

Now,

  E(1,N)/N = (1-e^(-F))/F =                                      [26]
           = e^(-F)*(e^F-1)/F =
           = e^(-F)*(1+(F^1)/1!+(F^2)/2!+(F^3)/3!+...-1)/F =
           = e^(-F)*((F^1)/1!+(F^2)/2!+(F^3)/3!+...)/F =
           = e^(-F)*((F^0)/1!+(F^1)/2!+(F^2)/3!+...)

So,

  E(X,N)/N = E(1,N)/N -                                          [27]
             ((F^0)/1!+(F^1)/2!+...+(F^(X-2))/(X-1)!)*e^(-F) =
           = ((F^0)/1!+(F^1)/2!+...)*e^(-F) -
             ((F^0)/1!+(F^1)/2!+...+(F^(X-2))/(X-1)!)*e^(-F) =
           = (F^(X-1)/X!+F^X/(X+1)!+...)*e^(-F) =
           = sum(Y=X-1:infinity | F^Y/(Y+1)!)*e^(-F)

We are trying to calculate

    E (1,N)     E (2,N)     E (3,N)
  1*------- + 2*------- + 3*------- + ... =                      [28]
       N           N           N
           = 1*(F^0/1!+F^1/2!+F^2/3!+...)*e^(-F) +
             2*       (F^1/2!+F^2/3!+...)*e^(-F) +
             3*              (F^2/3!+...)*e^(-F) + ...

How many times will F^(Y-1)/Y! be included in the above sum for any given Y? Well, it will be included once in the first row, with a factor of 1; once in the second row, with a factor of 2; and so on until the Yth row, in which it will occur with a factor of Y. In the (Y+1)st row, it will no longer occur, because the (Y+1)st row contains only terms starting with (F^Y)/(Y+1)! Thus, in total, F^(Y-1)/Y! will occur in the resulting sum with a factor of (1+2+3+...+Y) = (Y+1)*Y/2. Thus, the average number of I/Os will be

  = sum(Y=1:infinity | F^(Y-1)/Y!*(Y+1)*Y/2) * e^(-F) =          [29]
  = sum(Y=1:infinity | F^(Y-1)/(Y-1)!*(Y+1)/2) * e^(-F) =
  = sum(Y=1:infinity | F^(Y-1)/(Y-1)!*(Y-1)/2+
                       F^(Y-1)/(Y-1)!*2/2) * e^(-F) =
  = sum(Y=0:infinity | F^Y/Y!*Y/2) * e^(-F) +
    sum(Y=0:infinity | F^Y/Y!) * e^(-F) =
  = sum(Y=1:infinity | F^Y/(Y-1)!)/2 * e^(-F) +
    sum(Y=0:infinity | F^Y/Y!) * e^(-F) =
  = (F/2+1) * sum(Y=0:infinity | F^Y/Y!) * e^(-F)
Here  we can again observe that  sum(Y=0:infinity  |  F^Y/Y!)  =  e^F.
Thus, the average number of I/Os is
  =  (F/2+1) * sum(Y=0:infinity | F^Y/Y!) * e^(-F) =              [30]
  = (F/2+1) * e^F * e^(-F) = = 1+F/2

Eureka! So,

THE AVERAGE NUMBER OF I/OS REQUIRED FOR A DBGET MODE 7 AGAINST AN IMAGE/3000 MASTER DATASET WITH BLOCKING FACTOR 1 AND FILL FACTOR F IS 1+F/2.

At last.

THE APPLICATIONS

Now  that all of you  have  been  dazzled  by  the  above  display  of
Incredibly    Incomprehensible  Mathematical  Formulae,  an  important
question comes up:

WHY BOTHER?

As an amateur mathematician (i.e. someone who is crazy enough to actually LIKE the above mumbo-jumbo), I can answer the above with "Why not?" or "Because it's there" or "But aren't you just overwhelmed by the incredible beauty of the result?".

However, as a computer programmer who works on what is essentially a business computer, I feel that I ought to show some applications of the above mess. Well, here they come.

For one, note that the formula generated depends only on how full the dataset is as a percentage, not on the actual number of entries in the datasets. Thus, if we compare the efficiency of IMAGE mode 7 DBGETs and KSAM FREADBYKEYs, we find that IMAGE should theoretically require only 1 to 1.5 I/Os, whereas KSAM requires on the order of log N I/Os.

Another point that this proves is that increasing the capacity of a dataset will not necessarily greatly improve the efficiency of master dataset access (as is commonly believed). For instance, if the capacity of a dataset that is 80% full is doubled, the average number of I/Os required for a DBGET mode 7 against this dataset will decrease from 1.4 to 1.2, which may not be significant enough to offset the two-fold increase in disc space usage.

On the other hand, if you see that the fraction of secondaries is much larger than 1-(1-e^(-F))/F and/or the average number of I/Os is much greater than 1+F/2, you may have come across a case in which IMAGE's hashing algorithm behaves poorly. In this case, changing the capacity very slightly may cut the number of secondaries and the average number of I/Os dramatically. In one case which I ran across, changing the capacity by 1 cut the average number of I/Os in half! The actual number of secondaries and the actual average number of I/Os may be obtained from various database utilities such as DBLOADNG, DBSTAT2, or Robelle Consulting's HowMessy.

CONCLUSION

The two major results that were derived in this paper are:

If a dataset has fill factor (number of entries/capacity) F,

the expected fraction of primaries in that dataset is:

          -F
       1-e
  P(F)=-----                                                      [8]
         F

and the expected average number of I/Os needed to retrieve an entry (if the dataset has blocking factor 1) is:

A(F)=1+F/2 [30]

EXPERIMENTAL EVIDENCE

The following is a table of the results that were expected and the results that were obtained for several test datasets:

                   --------Actual--------  -------Expected-------
--C-- --N-- --F--  %Primaries Avg num IOs  %Primaries Avg num IOs
  650   525 80.8%     67.2%      1.411        68.6%      1.404
  331   193 58.3%     75.1%      1.337        75.8%      1.291
  331   193 58.3%     75.6%      1.275        75.8%      1.291
15000 10582 70.5%     71.0%      1.387        71.8%      1.352
  331   118 35.6%     80.5%      1.203        84.1%      1.178
  250   206 82.4%     66.0%      1.456        68.1%      1.412

Of the following, some datasets (the one marked +) have primary percentages that are significantly less than expected and average numbers of I/Os that are significantly more than expected. The datasets marked - are exact copies of those datasets with slightly changed capacities -- their primary percentages and average numbers of I/Os are quite close to those expected:

                   --------Actual--------  -------Expected-------
--C-- --N-- --F--  %Primaries Avg num IOs  %Primaries Avg num IOs
+ 128   123 96.1%      7.3%     13.008        64.3%      1.480
+ 127   123 96.8%     52.0%      1.667        64.0%      1.484
- 126   123 97.6%     70.7%      1.407        63.9%      1.488
+ 320   284 88.9%     37.3%      2.810        66.3%      1.444
- 319   284 89.0%     65.5%      1.444        66.2%      1.445
+ 400   158 39.5%     69.6%      1.456        82.6%      1.198
- 399   158 39.6%     86.1%      1.158        82.6%      1.198

These datasets are good examples of instances in which slightly altering the dataset capacity (even decreasing it!) can significantly improve performance.

Go to Adager's index of technical papers ⮂ View original 1980s typography