A VERY SCHOLARLY DISCUSSION OF THE EFFECT OF SYNONYMS ON PERFORMANCE
by Eugene Volokh, VESOFT
Presented at 1982 HPIUG Conference, Copenhagen, DANMARK
Published in "Thoughts & Discourses on HP3000 Software", 1st ed.
ABSTRACT
This paper will discuss the effect of synonyms on IMAGE/3000
performance, in particular the average number of I/Os necessary to
perform a DBGET mode 7 (also known as a hashed, calculated, or keyed
read).
THE PROBLEM
As is well known, IMAGE/3000 master datasets use an access method
called HASHING to implement keyed I/O against them. The essence of
this method is that when a DBPUT is performed against a master
dataset, the key is translated through a special HASHING FUNCTION
(which is explained in Robelle Consulting's SMUG II
micro-proceedings) to a record number within the dataset. Then the
entry to be added is inserted into the dataset at the resulting
record number. Now, when a DBGET mode 7 is performed with the same
key, the key is translated through the same hashing function to the
same record number; then, the entry we are looking for is just the
record with that number. Thus, we should ideally be able to add or
retrieve any entry in just one I/O (as opposed to considerably more
I/Os needed for other access methods, such as KSAM).
However, a problem arises: what if two keys map to the same record
number (this condition is known as a COLLISION)? We can't very well
put both entries into the same record number, so we will have to put
one of these entries (known as the SECONDARY) somewhere else, and put
a pointer to it into the other entry (known as the PRIMARY). Now, if
we want to retrieve the secondary, we will have to perform two I/Os:
one to retrieve the primary, and then when we see that the primary is
not the entry we want, a second to retrieve the secondary. A similar
situation arises when a key maps to a record number that already has
a primary and a secondary attached to it; then the entry is added to
the so-called SYNONYM CHAIN for that record number (the primary and
all its secondaries), and we now need three I/Os to get to it.
Similar things occur when a key maps to a record number that has two
synonyms, three synonyms, etc.
In short, access to primaries requires only one I/O; however, access
to secondaries requires two or more I/Os (unless the blocking factor
is greater than one).
It is also well known that the more full a dataset is, the more
secondaries there are in that dataset; and, the more secondaries
there are, the more I/Os are needed (on the average) to retrieve an
entry. However, how many more I/Os are needed, nobody knows.
This brings up an interesting question: what is the relationship of
the average number of I/Os needed to retrieve an entry to how full
the dataset is (THE FILL FACTOR), measured from 0=empty, to 1=full.
THE SOLUTION
In this paper, I will attempt to solve the above question
theoretically, i.e. derive a general formula that, given the fill
factor, will return the average number of I/Os. In order to do this,
however, some simplifying assumptions must be made:
First of all, I will assume that our hashing function is "good", i.e.
regardless of the data fed to it, it will return each record number
with roughly equal frequency. Thus, for instance, a hashing function
that returns record number 17 half of the time, or returns odd record
numbers more often than even record numbers is not "good".
Fortunately, IMAGE's hashing function is a relatively good one.
(Note: If the dataset's key is numeric, e.g. I2, a not-so-good
hashing function is used by IMAGE; thus, these results may not be
correct for datasets with numeric keys).
Secondly, I will assume that the master dataset in question has
blocking factor 1. This will permit me to ignore the possibility of
a synonym being in the same block as the primary, thus requiring no
additional I/Os when it is retrieved. I must confess that this may
be a quite fallacious assumption in that synonyms very often are in
the same blocks as their primaries; however, although this fact may
be decrease the average number of I/Os for a given fill factor, it
should not influence the overall nature of the function we are
seeking.
Now, we can get down to business. Let us define
N = number of entries in the dataset,
C = capacity of the dataset,
F = fill factor = N/C,
An X-ary = an entry that is the Xth in a synonym chain, and thus
takes X I/Os to retrieve. Thus, a 1-ary is a primary,
and 2-arys, 3-arys, 4-arys, etc. are all secondaries,
E (X, I) = the expected number of X-aries in a dataset after I
entries have been added.
Let us thus ask the following question:
How many primaries can we expect in the dataset after I entries
have been added to it, i.e. what is E(1,I)?
Well, we can expect as many primaries as are expected after I-1
entries have been added (i.e. E(1,I-1)) plus either 0 or 1 more
primaries (depending on whether the Ith entry becomes a primary or
not), i.e.
E(1,I) = E(1,I-1)+ [1]
1*(probability that the Ith entry becomes a primary).
What is the probability that the Ith entry becomes a primary? An
entry becomes a primary if it hashes to any slot in the dataset that
is NOT OCCUPIED BY ANOTHER PRIMARY, i.e. that is empty or is occupied
by a secondary! The probability that the Ith entry becomes a primary
is thus equal to
(the number of slots that are not occupied by another primary)
--------------------------------------------------------------
(the total number of slots that it could possibly hash to)
The total number of slots that it could possibly hash to is, of
course, C. The number of slots that are not occupied by another
primary is C - (the number of slots that are occupied by a primary).
When the Ith entry is being added, the expected number of slots that
are occupied by a primary is by definition E(1,I-1).
Thus, the probability that the Ith entry will become a primary is
(C-E(1,I-1))/C = 1-E(1,I-1)/C [2]
Therefore, substituting [2] into [1],
E(1,I) = E(1,I-1)+1-E(1,I-1)/C = [3]
= E(1,I-1)*(1-1/C)+1
Moreover, we know that E(1,0) = expected number of primaries when the
dataset has 0 entries, i.e. when the dataset is empty = 0. Thus, we
have a recursive formula for E(1,I). From it we can, for instance,
determine that E(1,1) = E(1,0)*(1-1/C)+1=0*(1-1/C)+1=1, that E(1,2) =
E(1,1)*(1-1/C)+1=1*(1-1/C)+1=2-1/C, etc.
But, we want a non-recursive, straightforward formula. We can obtain
it by further substituting [3] into itself:
E(1,I) = E(1,I-1)*(1-1/C)+1 = [4]
= (E(1,I-2)*(1-1/C)+1)*(1-1/C)+1 =
= E(1,I-2)*(1-1/C)^2+(1-1/C)+1 =
...
= (1-1/C)^(I-1)+(1-1/C)^(I-2)+ ... +(1-1/C)^2+(1-1/C)+1
The above is what is known in mathematical circles as a geometric
progression, i.e. a sum of terms each one of which is (1-1/C) times
less than the previous one. It is also known that the sum of the
above progression is
(1-1/C)^I-1 (1-1/C)^I-1
----------- = ----------- = -C*((1-1/C)^I-1) = C*(1-(1-1/C)^I) [5]
(1-1/C)-1 -1/C
Thus, E(1,I) = C*(1-(1-1/C)^I). If what we are looking for is the
percentage of the dataset's entries that are primaries, we should
consider E(1,I)/I = (C/I)*(1-(1-1/C)^I).
Now, let us take a closer look at (1-1/C)^I. Clearly,
(1-1/C)^I = (1-1/C)^(C*I/C) = ((1-1/C)^C)^(I/C) [6]
Also, we know that when C is sufficiently large (say, in excess of
100), (1-1/C)^C is close to (and in fact gets closer as C increases)
the constant 1/e = 1/2.71828... = 0.367879....
Thus, for large C, we have (substituting [6] into [5])
E(1,I)/I = (C/I)*(1-(1-1/C)^I) = (C/I)*(1-(1/e)^(I/C)) = [7]
= (C/I)*(1-e^(-I/C)) = (1-e^(-I/C))/(I/C).
So, the expected fraction of primaries in our dataset, which contains
N entries, is E(1,N)/N = (1-e^(-N/C))/(N/C). But, N/C = F!
Therefore,
-F
E (1,N) 1-e
------- = ----- [8]
N F
Ugh.
At last, we have a solid result -- we have determined the expected
number of primaries in a dataset given its fill factor. An
interesting corollary to the above is that when the dataset is full,
i.e. N=C or F=1, E(1,N)/N = (1-e^(-1))/1 = .632121.... Thus, even
when a dataset is full, 63% of all entries in it should be primaries.
However, the battle is far from over yet. What we are trying to do
is to determine the average number of I/Os needed to retrieve a given
entry. Before doing the actual derivations, let me explain the
method that we are going to use:
The average number of I/Os needed to retrieve a given entry is equal
to the total number of I/Os needed to retrieve all entries in the
dataset divided by the total number of entries.
The total number of I/Os needed to retrieve all entries in the
dataset is the total number of I/Os needed to retrieve all primaries
+ the total number of I/Os needed to retrieve all 2-aries + ....
But, the total number of I/Os needed to retrieve all X-aries = the
expected number of X-aries * the number of I/Os needed to retrieve an
X-ary. Clearly, to retrieve an X-ary we would need X I/Os; thus, the
total number of I/Os needed to retrive all X-aries = X*E(X,N).
Therefore, the total number of I/Os needed to retrieve all entries in
the dataset is 1*E(1,N)+2*E(2,N)+3*E(3,N)+.... And, the average
number of I/Os needed to retrieve a given entry = the number of I/Os
needed to retrieve all entries / number of entries =
1*E(1,N)+2*E(2,N)+3*E(3,N)+...
------------------------------ [9]
N
We have already determined E(1,N). Now, if we can only determine
E(2,N), E(3,N), ..., we will be almost done. So, this is what we
will attempt to do below.
How are we going to determine E(X,I) for X>1?
Well, reasoning similar to the one we used for E(1,I) indicates that
E(X,I) = E(X,I-1)+ [10]
(probability that the Ith entry becomes an X-ary)
Now, what is the probability that the Ith entry becomes an X-ary? To
answer this, we must ask: under what conditions would an entry become
an X-ary? The answer to this is: if and only if it hashes to a slot
that is currently occupied by a primary which has a synonym chain of
length exactly X-1. How many entries like this are there? Well, the
distinguishing feature of this kind of entry is that it has in its
synonym chain an (X-1)-ary but no X-aries. Thus, the number of such
primary entries = number of (X-1)-aries - number of X-aries =
E(X-1,I-1) - E(X,I-1). Thus, the probability that the Ith entry
becomes an X-ary is
E(X-1,I-1)-E(X,I-1)
------------------- [11]
C
and thus
E(X,I) = E(X,I-1)+(E(X-1),I-1)-E(X,I-1))/C = [12]
= E(X,I-1)*(1-1/C)+E(X-1,I-1)/C
To obtain a non-recursive formula, we expand the above into
I
---
\ E(X-1,J-1)
E(X,I) = - ---------- * (1-1/C)^(I-J) [13]
/ C
---
J=1
In particular, let us try to determine E(2,I) = the expected number
of 2-aries after I entries have been added.
E(2,I) = sum(J=1:I | E(1,J-1)/C * (1-1/C)^(I-J)) ~ [14]
~ sum(J=1:I | E(1,J)/C * (1-1/C)^(I-J)) =
= sum(J=1:I | C*(1-e^(-J/C))/C * (1-1/C)^(I-J)) ~
= sum(J=1:I | (1-1/C)^(I-J)) -
sum(J=1:I | e^(-J/C) * (1-1/C)^(I-J)) =
= sum(J=0:I-1 | (1-1/C)^J) -
sum(J=1:I | e^(-J/C) * (1-1/C)^(I-J))
Now, we can observe that sum(J=0:I-1 | (1-1/C)^J) is just E(1,I);
moreover, (1-1/C)^(I-J) = ((1-1/C)^C)^((I-J)/C) ~ e^(-(I-J)/C).
Thus,
E(2,I) = E(1,I) - sum(J=1:I | e^(-J/C) * e^(-(I-J)/C)) = [15]
= E(1,I) - sum(J=1:I | e^(-I/C)) =
= E(1,I) - I*e^(-I/C).
So, E(2,N) = E(1,N)-N*e^(-N/C); furthermore,
E (2,N) E (1,N) -N/C E (1,N) -F
------- = ------- - e = ------- - e [16]
N N N
Now, we can proceed to the general question: what is E(X,I) (or, what
is E(X,I)/I)?
I claim that for X>=2,
E (X,I) E (X-1,I) (I/C)^(X-2) -I/C
------- = --------- - ----------- * e [17]
I I (X-1)!
where X! = X factorial = 1*2*3*...*X.
I will prove this claim via a device known as mathematical induction;
i.e. I will demonstrate that the claim is valid for X=2, and if it is
valid for X=A, then it is also valid for X=A+1. Clearly, this will
demonstrate that the claim is valid for X=2, and therefore for
X=2+1=3, and therefore also for X=3+1=4, etc.
First of all, I must prove that the claim is valid for X=2. In the
case of X=2, the claim says that
E (2,I) E (1,I) (I/C)^0 -I/C E (1,I) -I/C
------- = ------- - ------- * e = ------- - e [18]
I I 0! I
But, this was proven above in the derivation of E(2,I). Thus, we
know that our claim is valid for X=2.
Now let us assume that the claim is valid for X=A, i.e.
E (X,A) E (X-1,A) (A/C)^(X-2) -A/C
------- = --------- - ----------- * e [19]
A A X!
We must show that it is also valid for X=A+1.
E(A+1,I) = SUM(J=1:I | E(A,J-1)/C * (1-1/C)^(I-J)) = [20]
= SUM(J=1:I | (E(A-1,J-1)-J*(J/C)^(A-2)/(A-1)!*e^(-J/C))*
(1-1/C)^(I-J)/C) =
= SUM(J=1:I | E(A-1,J-1)*(1-1/C)^(I-J)/C) -
SUM(J=1:I | (J/C)^(A-1)/(A-1)!*e^(-J/C)*
(1-1/C)^(I-J)/C) =
= E(A,I) -
SUM(J=1:I | (J/C)^(A-1)/(A-1)!*e^(-J/C)*e^(-(I-J)/C)) =
= E(A,I) -
SUM(J=1:I | (J/C)^(A-1)/(A-1)!*e^(-I/C)) =
= E(A,I) - e^(-I/C)/(A-1)!*SUM(J=1:I | J^(A-1))/C^(A-1) =
We know that SUM(J=1:I | J^(A-1)) = I^A/A + smaller powers of I, and
is thus approximately equal to I^A/A; so,
E(A+1,I) = E(A,I) - e^(-I/C)/(A-1)!* [21]
SUM(J=1:I | J^(A-1))/C^(A-1) =
= E(A,I) - e^(-I/C)/(A-1)!*(I^A/A)/C^(A-1) =
= E(A,I) - e^(-I/C)/A!*I*(I/C)^(A-1)
Thus,
E (A+1,I) E (A,I) (I/C)^(A-1) -I/C
--------- = ------- - ----------- * e [22]
I I X!
This is indeed is the result predicted by our claim for X=A+1. Thus,
we know that if the claim is valid for X=A, it is valid for X=A+1.
Thus, we are justified in saying that our claim is true. Finally, we
have obtained a general (albeit recursive) formula for E (X,I).
Now comes the final phase of our proof -- the derivation of the
average number of I/Os needed to retrieve a record. For the
remainder of this discussion, let us for convenience use the fill
factor F = N/C.
The average number of I/Os needed to retrieve a record is equal to
the total number of I/Os needed to retrieve all records divided by
the number of records.
The total number of I/Os needed to retrieve all records is equal to
(the expected number of 1-arys)*(I/Os needed to retrieve a 1-ary)+
(the expected number of 2-arys)*(I/Os needed to retrieve a 2-ary)+...
But, we know that the expected number of X-arys is simply E(X,N); the
number of I/Os needed to retrieve a X-ary is simply X; and, the
number of records is N.
Thus, the average number of I/Os needed to retrieve a record is
1*E(1,N)+2*E(2,N)+... E (1,N) E (2,N)
--------------------- = 1*------- + 2*------- + ... [23]
N N N
We have found that
E (X,N) E (X-1,N) F^(X-2) -F
------- = --------- - ------- * e [24]
N N (X-1)!
Thus,
E (X,N) E (1,N) F^0 -F F^1 -F F^(X-2) -F
------- = ------- - ---*e - ---*e - ... - -------*e [25]
N N 1! 2! (X-1)!
Now,
E(1,N)/N = (1-e^(-F))/F = [26]
= e^(-F)*(e^F-1)/F =
= e^(-F)*(1+(F^1)/1!+(F^2)/2!+(F^3)/3!+...-1)/F =
= e^(-F)*((F^1)/1!+(F^2)/2!+(F^3)/3!+...)/F =
= e^(-F)*((F^0)/1!+(F^1)/2!+(F^2)/3!+...)
So,
E(X,N)/N = E(1,N)/N - [27]
((F^0)/1!+(F^1)/2!+...+(F^(X-2))/(X-1)!)*e^(-F) =
= ((F^0)/1!+(F^1)/2!+...)*e^(-F) -
((F^0)/1!+(F^1)/2!+...+(F^(X-2))/(X-1)!)*e^(-F) =
= (F^(X-1)/X!+F^X/(X+1)!+...)*e^(-F) =
= sum(Y=X-1:infinity | F^Y/(Y+1)!)*e^(-F)
We are trying to calculate
E (1,N) E (2,N) E (3,N)
1*------- + 2*------- + 3*------- + ... = [28]
N N N
= 1*(F^0/1!+F^1/2!+F^2/3!+...)*e^(-F) +
2* (F^1/2!+F^2/3!+...)*e^(-F) +
3* (F^2/3!+...)*e^(-F) + ...
How many times will F^(Y-1)/Y! be included in the above sum for any
given Y? Well, it will be included once in the first row, with a
factor of 1; once in the second row, with a factor of 2; and so on
until the Yth row, in which it will occur with a factor of Y. In the
(Y+1)st row, it will no longer occur, because the (Y+1)st row
contains only terms starting with (F^Y)/(Y+1)! Thus, in total,
F^(Y-1)/Y! will occur in the resulting sum with a factor of
(1+2+3+...+Y) = (Y+1)*Y/2. Thus, the average number of I/Os will be
= sum(Y=1:infinity | F^(Y-1)/Y!*(Y+1)*Y/2) * e^(-F) = [29]
= sum(Y=1:infinity | F^(Y-1)/(Y-1)!*(Y+1)/2) * e^(-F) =
= sum(Y=1:infinity | F^(Y-1)/(Y-1)!*(Y-1)/2+
F^(Y-1)/(Y-1)!*2/2) * e^(-F) =
= sum(Y=0:infinity | F^Y/Y!*Y/2) * e^(-F) +
sum(Y=0:infinity | F^Y/Y!) * e^(-F) =
= sum(Y=1:infinity | F^Y/(Y-1)!)/2 * e^(-F) +
sum(Y=0:infinity | F^Y/Y!) * e^(-F) =
= (F/2+1) * sum(Y=0:infinity | F^Y/Y!) * e^(-F)
Here we can again observe that sum(Y=0:infinity | F^Y/Y!) = e^F.
Thus, the average number of I/Os is
= (F/2+1) * sum(Y=0:infinity | F^Y/Y!) * e^(-F) = [30]
= (F/2+1) * e^F * e^(-F) = = 1+F/2
Eureka! So,
THE AVERAGE NUMBER OF I/OS REQUIRED FOR A DBGET MODE 7 AGAINST AN
IMAGE/3000 MASTER DATASET WITH BLOCKING FACTOR 1 AND FILL FACTOR F
IS 1+F/2.
At last.
THE APPLICATIONS
Now that all of you have been dazzled by the above display of
Incredibly Incomprehensible Mathematical Formulae, an important
question comes up:
WHY BOTHER?
As an amateur mathematician (i.e. someone who is crazy enough to
actually LIKE the above mumbo-jumbo), I can answer the above with
"Why not?" or "Because it's there" or "But aren't you just
overwhelmed by the incredible beauty of the result?".
However, as a computer programmer who works on what is essentially a
business computer, I feel that I ought to show some applications of
the above mess. Well, here they come.
For one, note that the formula generated depends only on how full the
dataset is as a percentage, not on the actual number of entries in the
datasets. Thus, if we compare the efficiency of IMAGE mode 7 DBGETs
and KSAM FREADBYKEYs, we find that IMAGE should theoretically require
only 1 to 1.5 I/Os, whereas KSAM requires on the order of log N I/Os.
Another point that this proves is that increasing the capacity of a
dataset will not necessarily greatly improve the efficiency of master
dataset access (as is commonly believed). For instance, if the
capacity of a dataset that is 80% full is doubled, the average number
of I/Os required for a DBGET mode 7 against this dataset will decrease
from 1.4 to 1.2, which may not be significant enough to offset the
two-fold increase in disc space usage.
On the other hand, if you see that the fraction of secondaries is
much larger than 1-(1-e^(-F))/F and/or the average number of I/Os is
much greater than 1+F/2, you may have come across a case in which
IMAGE's hashing algorithm behaves poorly. In this case, changing the
capacity very slightly may cut the number of secondaries and the
average number of I/Os dramatically. In one case which I ran across,
changing the capacity by 1 cut the average number of I/Os in half!
The actual number of secondaries and the actual average number of
I/Os may be obtained from various database utilities such as
DBLOADNG, DBSTAT2, or Robelle Consulting's HowMessy.
CONCLUSION
The two major results that were derived in this paper are:
If a dataset has fill factor (number of entries/capacity) F,
the expected fraction of primaries in that dataset is:
-F
1-e
P(F)=----- [8]
F
and the expected average number of I/Os needed to retrieve an entry
(if the dataset has blocking factor 1) is:
A(F)=1+F/2 [30]
EXPERIMENTAL EVIDENCE
The following is a table of the results that were expected and the
results that were obtained for several test datasets:
--------Actual-------- -------Expected-------
--C-- --N-- --F-- %Primaries Avg num IOs %Primaries Avg num IOs
650 525 80.8% 67.2% 1.411 68.6% 1.404
331 193 58.3% 75.1% 1.337 75.8% 1.291
331 193 58.3% 75.6% 1.275 75.8% 1.291
15000 10582 70.5% 71.0% 1.387 71.8% 1.352
331 118 35.6% 80.5% 1.203 84.1% 1.178
250 206 82.4% 66.0% 1.456 68.1% 1.412
Of the following, some datasets (the one marked +) have primary
percentages that are significantly less than expected and average
numbers of I/Os that are significantly more than expected. The
datasets marked - are exact copies of those datasets with slightly
changed capacities -- their primary percentages and average numbers
of I/Os are quite close to those expected:
--------Actual-------- -------Expected-------
--C-- --N-- --F-- %Primaries Avg num IOs %Primaries Avg num IOs
+ 128 123 96.1% 7.3% 13.008 64.3% 1.480
+ 127 123 96.8% 52.0% 1.667 64.0% 1.484
- 126 123 97.6% 70.7% 1.407 63.9% 1.488
+ 320 284 88.9% 37.3% 2.810 66.3% 1.444
- 319 284 89.0% 65.5% 1.444 66.2% 1.445
+ 400 158 39.5% 69.6% 1.456 82.6% 1.198
- 399 158 39.6% 86.1% 1.158 82.6% 1.198
These datasets are good examples of instances in which slightly
altering the dataset capacity (even decreasing it!) can significantly
improve performance.
Go to Adager's index of technical papers