HOW PROGRAMMING LANGUAGES DIFFER:
A CASE STUDY OF SPL, PASCAL, AND C
by Eugene Volokh, VESOFT
Presented at 1987 SCRUG Conference, Pasadena, CA
Presented at 1987 INTEREX Conference, Las Vegas, NV, USA
Published by The HP CHRONICLE, May 1987-May 1988.
ABSTRACT: The HP3000's wunderkind sets out to study Pascal, C and SPL
for the HP mini in a set of articles, using real-life examples and
plenty of tips on how to code for optimum efficiency in each language.
First in the series: ground rules for the comparison and a look at
control structures. (The HP CHRONICLE, May 1987)
INTRODUCTION
Programmers get passionate about programming languages. We spend
most of our time hacking code, exploiting the language's features,
being bitten by its silly restrictions. There are dozens of languages,
and each one has its fanatical adherents and its ardent detractors.
Some like APL, some like FORTH, LISP, C, PASCAL; some might even like
COBOL or FORTRAN, perish the thought.
In particular, a lot of fuss has recently arisen about SPL, PASCAL,
and C. All three of them are considered good "system programming"
(whatever that is) languages, and naturally people argue about which
one is the best.
HP's Spectrum project has come out in favor of PASCAL -- all new
MPE/XL code will be written in PASCAL, and HP won't even provide a
native mode SPL compiler. On the other hand, HP's also getting more
and more into UNIX, which is coded entirely in C. Especially between C
and PASCAL adherents there seems to be something like a "holy war"; it
becomes not just a matter of advantages and disadvantages, but of Good
and Evil, Right and Wrong. Strict type checking is Good, some say --
loose type checking is Evil; pointers are Wrong -- array indexing is
Right. The battle-lines are drawn and the knights are sharpening their
swords.
But, some ask -- what's the big deal? After all, it's an axiom of
computer science that all you need is an IF and a GOTO, and you can
code anything you like. Theoretically speaking, C, SPL, and PASCAL are
all equivalent; practically, is there that much of a difference?
In other words, is it just esthetics or prejudice that animate the
ardent fans of C, PASCAL, or SPL, or are there real, substantive
differences between the languages -- cases in which using one language
rather than another will make your life substantially easier? Are the
main differences between, say, C and PASCAL that PASCAL uses BEGIN and
END and C uses "{" and "}"? That C's assignment operator is "=" and
PASCAL's is ":="?
The goal of this paper is to answer just this question. I will try
to analyze each of the main areas where SPL, C, and PASCAL differ, and
point out those differences using actual programming examples. I'll
try not to emphasize vague, general statements, like "PASCAL does
strict type checking", or subjective opinions, like "C is too hard to
read"; rather, I want to use SPECIFIC EXAMPLES which can help make
clear the exact influence of strict or loose type checking on your
programming tasks.
RULES OF EVIDENCE
Saying that I'll "compare SPL, PASCAL, and C" isn't really saying a
whole lot. How will I compare them? What criteria will I use to
compare them? Will I compare how easy it is to read them or write
them? Will I compare what programming habits they instill in their
users? Which versions of these languages will I compare?
To do this, and to do this in as useful a fashion as possible, I
set myself some rules:
* I resolved to try to show the differences by use of examples,
preferably as real-life as possible. The emphasis here is on
CONCRETE SPECIFICS, not on general statements such as "C is less
readable" or "PASCAL is more restrictive".
* I decided not to go into questions of efficiency. Compiling a
certain construct using one implementation of a compiler may
generate fast code, whereas a different implementation may
generate slow code. Sure, the FOR loop in PASCAL/3000 may be less
efficient than in SPL or in CCS's C/3000, but who knows how fast
it'll be under PASCAL/XL?
For this reason, I don't wax too poetic about the efficiency
advantages of features such as C's "X++" (which increments X by
1) -- a modern optimizing compiler is quite likely to generate
equally fast code for "X:=X+1", automatically seeing that it's a
simple increment-by-one (even the 15-year-old SPL/3000 compiler
does this).
The only times when I'll mention efficiency is when some feature
is INHERENTLY more or less efficient than another (at least on a
conventional machine architecture); for instance, passing a large
array BY VALUE will almost certainly be slower than passing it BY
REFERENCE, since by-value passing would require copying all the
array data.
Even in these cases, I try to play down performance
considerations; if you're concerned about speed (as well you
should be), do your own performance measurements for the features
and compiler implementations that you know you care about.
* I resolved -- for space reasons if for no other -- not to be a
textbook for SPL, PASCAL, or C. Some of the things I say apply
equally well to almost all programming languages, and I hope that
they will be understandable even to people who've never seen SPL,
PASCAL, or C.
For other things, I rely on the relative readability of the
languages and their similarity to one another. I hope that if you
know any one of SPL, PASCAL, or C, you should be able to
understand the examples written in the other languages.
However, it may be wise for you to have manuals for these three
languages -- either their HP3000 implementations or general
standards -- at hand, in case some of the examples should prove
too arcane.
* As you can tell by the size of this paper, I also decided to be
as thorough as practical in my comparisons, and ESPECIALLY in the
evidence backing up my comparisons.
One of the main reasons I wrote this paper is that I hadn't seen
much OBJECTIVE discussion comparing C and PASCAL; I wanted not
just to present my conclusions -- which might as easily be based
on prejudice as on fact -- but also the reasons why I arrived at
them, so that you could decide for yourself.
So as not to burden you with having to read all 200-odd pages,
though, I've summarized my conclusions in the "SUMMARY" chapter.
You might want to have a look there first, and then perhaps go
back to the main body of the paper to see the supporting evidence
of the points I made.
WHAT ARE C AND PASCAL, ANYWAY?
If you think about it, SPL is a very unusual language indeed. To
the best of my knowledge, there is exactly one SPL compiler available
anywhere, on any computer (eventually, the independent SPLash! may be
available on Spectrum, but that is another story). I can say "SPL
supports this" or "SPL can't do that" and, excepting differences
between one chronological version of SPL and the next, be absolutely
precise and objectively verifiable. SPL can be said to "support"
something only because there is only one SPL compiler that we're
talking about.
To say "PASCAL can do X" is a chancy proposition indeed. ANSI
Standard PASCAL doesn't support variable-length strings, but most
modern PASCAL implementations, including HP PASCAL, have some sort of
string mechanism. What about HP's new PASCAL/XL, reputed to be even
more powerful still? Similarly, with C, there are the old "Kernighan &
Ritchie" C, the proposed new ANSI standard C, whatever it is that HP
uses on the Spectrum, AND whatever you use on the 3000, which might be
CCS's C compiler or Tymlabs' C.
On the one hand, I contemplated comparing standard C and standard
PASCAL. This is easier for me, and it also makes sense from a
portability point of view (if you want it to be portable, you're
better off using the standard, anyway).
On the other hand, portability is fine and dandy, but most people
aren't going to be porting their software any further than from an
MPE/XL machine to an MPE/V machine and back. As long as you stick to
HP3000s, you have the full power of so-called "HP PASCAL", an extended
superset of PASCAL that's supported on 3000s, 1000s, 9000s, and the
rest; it's hardly fair (or practical) to ignore this in a comparison.
Finally, what about PASCAL/XL? It'll have even more useful
features, but they may not be ported back to the MPE/V machines, at
least for a while. Should I then compare PASCAL/XL and C/XL, a
representative contest for the XL machines, but not necessarily for
MPE V machines, and certainly not if you really want to port your
software onto other machines.
This is all, incidentally, aggravated by the fact that HP's
extensions to PASCAL are more substantial than its extensions to C;
thus, comparing the "standards" is likely to put PASCAL in a
relatively worse light than comparing "supersets" (not to say that
PASCAL is worse than C in either case).
Faced with all this, I've decided to compare everything with
everything else. There are actually 7 different compilers I discuss at
one time or another:
* SPL.
There's only one, thank God.
* Standard PASCAL.
This is the original ANSI Standard, on which all other PASCALs
are based. This is also very similar to Level 0 ISO Standard
PASCAL (see next item).
* Level 1 ISO Standard PASCAL.
This standard, put out in the early 1980's, supports so-called
CONFORMANT ARRAY parameters (see the DATA STRUCTURES chapter).
The same standard document defined "Level 0 ISO Standard PASCAL"
to be much like classic "Standard PASCAL", i.e. without
conformant arrays. Compiler writers were given the choice of
which one to implement, and it isn't obvious how popular Level 1
ISO Standard will be. When I say "Standard PASCAL", I mean the
original standard, which is almost identical to the ISO Level 0
Standard.
* PASCAL/3000.
This is HP's implementation of PASCAL on the pre-Spectrum HP3000.
Although the Spectrum machines will also be called 3000's, when I
say PASCAL/3000 I mean the pre-Spectrum version. PASCAL/3000 is
itself a superset of HP Pascal, which is also implemented by HP
on HP 1000s and HP 9000s. PASCAL/3000 is a superset of the
original Standard PASCAL, not the ISO Level 1 Standard.
* PASCAL/XL.
This is HP's implementation of PASCAL on the Spectrum. It's
essentially a superset of both PASCAL/3000 and the ISO Level 1
Standard.
* Kernighan & Ritchie (K&R) C.
This is the C described by Brian Kernighan and Dennis Ritchie in
their now-classic book "The C Programming Language" (which, in
fact, is usually called "Kernighan and Ritchie"). Although never
an official standard, it is quite representative of most modern
C's. In fact, for practical purposes, it can be said that a
program written in K & R C is portable to virtually any C
implementation (assuming you avoid those things that K&R itself
describes as implementation-dependent).
* Draft ANSI Standard C.
ANSI is now working on codifying a standard of C, which will have
some (but not very many) improvements over K&R. My reference for
this was Harbison & Steele's book "C: A Reference Manual", which
also discusses various other implementations of C. Although Draft
ANSI Standard C is Standard, it is also Draft. Some of the
features described in it are implemented virtually nowhere, and
it's not clear how much of them C/XL will include.
Matters are further complicated, of course, by the lack of an
HP-provided C compiler on the pre-Spectrum HP3000. The compiler I used
to research this paper is CCS Inc.'s C/3000 compiler, which is a
super-set of K&R C and a subset of Draft ANSI Standard C. The most
conspicuous Draft Standard feature that CCS C/3000 lacks is Function
Prototypes -- an understandable lack since virtually all other C
compilers don't have them, either.
Whenever any difference exists between any of the PASCAL or C
versions, I try to point it out. Which versions you compare are up to
you:
* You can compare Standard PASCAL and K&R C.
If it isn't in these general standards that everybody implements,
you're unlikely to get much portability.
* You can compare PASCAL/XL and Draft ANSI Standard C.
These are the compilers that will most likely be available on the
Spectrum.
* You can compare PASCAL/3000 and Draft ANSI Standard or K&R C.
Even though you might not usually care about porting to, say, an
IBM or a VAX, you may very seriously care about porting from the
pre-Spectrum to the Spectrum and vice versa. HP hasn't promised
to port PASCAL/XL back to the pre-Spectrums, so PASCAL/3000 is
probably the lowest common denominator.
SPL is nice. At least until SPLash!'s promised Native Mode SPL
compiler comes out, there's only one SPL compiler to compare with.
This makes me very happy.
ARE C, PASCAL, AND SPL
FUNDAMENTALLY DIFFERENT OR
FUNDAMENTALLY ALIKE?
In my opinion, they are definitely FUNDAMENTALLY ALIKE. In the rest
of the paper, I'll tell you all about their differences, but those are
EXCEPTIONS in their fundamental similarity.
Why do I think so? Well, virtually every important construct in
either of the three languages has an almost exact parallel in the
other two (the only exception being, perhaps, record structures, which
SPL doesn't have).
* All three languages emphasize writing your program as a set of
re-usable, parameterized procedures or functions (which, for
instance, COBOL 74 and most BASICs do not);
* All three languages share virtually the same rich set of control
structures (which neither FORTRAN/IV nor BASIC/3000 possesses).
* The languages may on the surface LOOK somewhat different (PASCAL
and C certainly do), but remember that the ESSENCE is virtually
identical -- PASCAL may say "BEGIN" and "END" where C says "{"
and "}", but that's hardly a SUBSTANTIVE difference.
Despite all the differences which I'll spend all these pages
describing -- and I think many of the differences are indeed very
important ones -- I still think that SPL, PASCAL, and C are about as
close to each other as languages get.
SO, WHICH IS BETTER -- C, PASCAL, OR SPL?
You think I'm going to answer that? With all my pretensions to
objectivity, and dozens of angry language fanatics ready to berate me
for choosing the "wrong one"?
The main purpose of this paper is to show you all the differences
and let you decide for yourselves; after all, there are so many
parameters (how portable do you want the code to be? how much do you
care about error checking?) that are involved in this sort of
decision.
The closest I come to actually saying which is better is in the
"SUMMARY" chapter (at the very end of the paper); there I explain what
I think the major drawbacks and advantages of each language are. Look
there, but remember -- only you can decide which language is best for
your purposes.
TECHNICAL NOTE ABOUT C EXAMPLES
In case you didn't know, C differentiates between upper- and
lower-case. The variables "file" and "FILE" are quite different, as
are "file", "File", and "fILE". (In SPL and PASCAL, of course, case
differences are irrelevant; all of the just-given names would refer to
the same variable.)
In fact, in C programs the majority of all objects -- reserved
words, procedure names, variables, etc. -- are lower-case. The
reserved words ("if", "while", "for", "int", etc.) are required to be
lower-case by the standard; theoretically, you can name all your
variables and procedures in upper-case, but most C programmers use
lower-case for them, too (although they can sometimes use upper-case
variable names as well, perhaps to indicate their own defined types or
#define macros).
This is why all the examples of C programs in this paper are
written in lower-case. The one exception to this is when I refer to a
C object -- a variable, a procedure, or a reserved word -- within the
text of a paragraph. Then, I'll often capitalize it to set it off from
the rest of the paper, to wit:
proc (i, j)
int i, j;
{
if (i == j)
...
}
The procedure PROC takes two parameters, I, and J.
The IF statement makes sure that they're equal, ....
The fact that I refer to them in upper-case in the text doesn't
mean that you should actually use upper-case names. I just do it to
make the text more readable.
Another example of how a little lie can help reveal the greater
truth...
ACKNOWLEDGMENTS
I'd like to thank the following people for their great help in the
writing of this paper:
* CCS, Inc., authors of CCS C/3000, a C compiler for pre-Spectrum
HP3000s. All the research and testing of the C examples given in
this paper was done using their excellent compiler. In
particular, I'd also like to thank Tim Chase, who gave me a great
deal of help on some of the details of the C language.
* Steve Hoogheem of the HP Migration Center, who served as liaison
between me and the PASCAL/XL lab in answering my questions about
PASCAL/XL. * Mr. Tom Plum (of Plum Hall, Cardiff, NJ), a
recognized C expert and member of the Draft ANSI Standard C
committee, who was kind enough to answer many of the questions
that I had about the Draft Standard.
* Dennis Mitrzyk, of Hewlett-Packard, who helped me obtain much of
my PASCAL/XL information, and who was also kind enough to review
this paper.
* Joseph Brothers, David Greer (of Robelle), Dave Lange and Roger
Morsch (of State Farm Insurance), and Mark Wallace (of Robinson,
Wallace, and Company), all of whom reviewed the paper and
provided a lot of useful input and corrections.
CONTROL STRUCTURES
GOTOs, some say, are Considered Harmful. Perhaps they are and
perhaps they are not. But the major reason for the control structures
that PASCAL and C provide (as opposed to, say, FORTRAN IV, which
doesn't) is not that they replace GOTOs, but rather that they replace
them with something more convenient. If given the choice between
saying
IF FNUM = 0 THEN
PRINTERROR
ELSE
BEGIN
READFILE;
FCLOSE (FNUM, 0, 0);
END;
and
IF FNUM <> 0 THEN GOTO 10;
PRINTERROR;
GOTO 20;
10:
READFILE;
FCLOSE (FNUM, 0, 0);
20:
then I would choose the former. IF-THEN-ELSE is a common construct in
all of the algorithms we write, and it's easier for both the writer
and the reader to have a language construct that directly corresponds
to it.
C and PASCAL share some of the fundamental control structures. Both
have
* IF-THEN-ELSEs. They look slightly different:
IF FNUM=0 THEN { PASCAL }
PRINTERROR
ELSE
BEGIN
READFILE;
FCLOSE (FNUM, 0, 0);
END;
and
if (fnum==0) /* C */
printerror; /* note the semicolon */
else
{
readfile;
fclose (fnum, 0, 0);
}
but I hardly think the difference very substantial. There'll be
some who forever curse C for using lower-case or PASCAL for using
such L-O-N-G reserved words, like "BEGIN" and "END"; I can live
with either.
* WHILE-DOs, although again there are some minor differences
WHILE GETREC (FNUM, RECORD) DO
PRINTREC (RECORD);
vs.
while (getrec (fnum, record))
printrec (record);
* DO-UNTILs:
REPEAT
GETREC (FNUM, RECORD);
PRINTREC (RECORD);
UNTIL
NOMORERECS (FNUM);
and
do
{
getrec (fnum, record);
printrec (record);
}
while
(!nomorerecs (fnum)); /* "!" means "NOT" */
Note that PASCAL has a DO-UNTIL and C has a DO-WHILE. Big
difference.
* And, finally, C's and PASCAL's procedure support is comparable,
as well.
The interesting things, of course, are the points at which C and
PASCAL differ. There are some, and for those us who thought that
IF-THEN-ELSE and WHILE-DO are all the control structures we'll ever
need, the differences can be quite surprising.
THE "WHILE" LOOP AND ITS LIMITATIONS; THE "FOR" LOOP
It is, indeed, true, that all iterative constructs can be emulated
with the WHILE-DO loop. On the other hand, why do the work if someone
else can do it for you?
The PASCAL FOR loop -- a child of FORTRAN's DO -- is actually not
that hard to emulate:
FOR I:=1 TO 9 DO
WRITELN (I);
is identical, of course, to
I:=1;
WHILE I<=9 DO
BEGIN
WRITELN (I);
I:=I+1;
END;
Not such a vast savings, but, still, the FOR loop definitely looks
nicer.
Unfortunately, for all the savings that the FOR loop gives you,
I've found that it's not as useful as one might, at first glance,
believe. This is because it ALWAYS loops through all the values from
the start to the limit. How often do you need to do that, rather than
loop until EITHER a limit is reached OR another condition is found?
String searching, for instance -- you want to loop until the index is
at the end of the string OR you've found what you're searching for.
Always looping until the end is wasteful and inconvenient.
Looking through my MPEX source code, incidentally, I find 53 WHILE
loops and 8 FOR loops. In my RL, the numbers are 170 WHILEs and 38
FORs (at least 6 of these FORs should have been WHILEs if I weren't so
lazy). (How's that for an argument -- I don't use it, ERGO it is
useless. I'm rather proud of it.) In any case, though, my experience
has been that
* THE PURE "FOR" LOOP -- A LOOP THAT ALWAYS GOES ON UNTIL THE LIMIT
HAS BEEN REACHED -- IS NOT AS COMMON AS ONE MIGHT THINK IN
BUSINESS AND SYSTEM PROGRAMS (although scientific and engineering
applications, which often handle matrices and such, use pure FOR
loops more often). MORE OFTEN YOU WANT TO ALSO SPECIFY AN "UNTIL"
CONDITION WHICH WILL ALSO TERMINATE THE LOOP.
What I wanted, then, was simple -- a loop that looked like
FOR I:=START TO END UNTIL CONDITION DO
For instance,
FOR I:=1 TO STRLEN(S) UNTIL S[I]=C DO;
or
FOR I:=1 TO STRLEN(S) WHILE S[I]=' ' DO;
What I got -- and I'm not sure if I'm sorry I asked or not -- is the C
FOR loop:
for (i=1; i<=strlen(s) && s[i]!=c; i=i+1)
;
The C FOR loop -- like most things in C, accomplished with a minimum
of letters and a maximum of special characters -- looks like this:
for (initcode; testcode; inccode)
statement;
It is functionally identical to
initcode;
while (testcode)
{
statement;
inccode;
}
In other words, this is a sort of "build-your-own" FOR loop -- YOU
specify the initialization, the termination test, and the "STEP". This
is actually quite useful for loops that don't involve simple
incrementing, such as stepping through a linked list:
for (ptr=listhead; ptr!=nil; ptr=ptr.next)
fondle (ptr);
The above loop, of course, fondles every element of the linked list,
something quite analogous to what an ordinary PASCAL FOR loop would
do, but with a different kind of "stepping" action.
The standard PASCAL loop, of course, can easily be emulated --
for (i=start; i<=limit; i=i+1)
statements;
I'm sure it would be fair to conclude that C's FOR loop is clearly
more powerful than PASCAL's. On the other hand, a WHILE loop is more
powerful than a FOR loop, too; and, a GOTO is the most powerful of
them all (heresy!). The reason a PASCAL FOR loop -- or for that
matter, a C FOR loop -- is good is because simply by looking at it,
you can clearly see that it is a WHILE loop of a particular kind, with
clearly evident starting, terminating, and stepping operations.
The major argument that may be made against C's for loop is simply
one of clarity. Possible reasons include:
* The loop variable has to be repeated four (or three, if you use
"i++" instead of "i=i+1") times.
* The semicolons, adequate to delimit the three clauses for the
compiler, may not sufficiently delimit them to a human reader --
it may not be instantly obvious where one clause starts and
another ends.
* Also, the very use of semicolons instead of control keywords
(like "TO") may be irritating; in a way, it's like having to
write
FOR I,1,100
instead of
FOR I:=1 TO 100
If you think the first version isn't any worse than the second,
you shouldn't mind C; some, however, find "FOR I,1,100" slightly
less clear than "FOR I:=1 TO 100".
for (i=1; i<=10; i=i+1) FOR I:=1 TO 10 DO
or, alternatively
for (i=1; i<=10; i++) FOR I:=1 TO 10 DO
Which do you prefer? Frankly, for me, the PASCAL version is somewhat
clearer, although I'm not prepared to say that the clarity is worth
the cost in power. On the other hand, many a C programmer doesn't see
any advantage in the PASCAL style, and perhaps there isn't any. Some
of the C/PASCAL differences, I'm afraid, boil down to simply this.
THE WHILE LOOP AND ITS LIMITATIONS -- AN INTERESTING PROBLEM
Consider the following simple task -- you want to read a file until
you get a record whose first character is a "*"; for each record you
read, you want to execute some statements. Your PASCAL program might
look like this:
READLN (F, REC);
WHILE REC[1]<>'*' DO
BEGIN
PROCESS_RECORD_A (REC);
PROCESS_RECORD_B (REC);
PROCESS_RECORD_C (REC);
READLN (F, REC);
END;
All well and good? But, wait a minute -- we had to repeat the READLN
statement a second time at the end of the WHILE loop. "Lazy bum," you
might reply. "Can't handle typing an extra line." Well, what if, in
order to get the record, we had to do more than just a READLN? We
might need to, say, call FCONTROL before doing the READLN, and perhaps
have a more complicated loop test. Our program might end up looking
like:
FCONTROL (FNUM(F), EXTENDED_READ, DUMMY);
FCONTROL (FNUM(F), SET_TIMOUT, TIMEOUT);
READLN (F, REC);
GETFIELD (REC, 3, FIELD3);
WHILE FIELD3<>'*' DO
BEGIN
PROCESS_RECORD_A (REC);
PROCESS_RECORD_B (REC);
PROCESS_RECORD_C (REC);
FCONTROL (FNUM(F), EXTENDED_READ, DUMMY);
FCONTROL (FNUM(F), SET_TIMOUT, TIMEOUT);
READLN (F, REC);
GETFIELD (REC, 3, FIELD3);
END;
This is not a happy-looking program. We had to duplicate a good chunk
of code, with all the resultant perils of such a duplication; the code
was harder to write, it's now harder to read, and when we maintain it,
we're liable to change one of the occurrences of the code and not the
other.
Workarounds, of course, exist. We can say
REPEAT
FCONTROL (FNUM(F), EXTENDED_READ, DUMMY);
FCONTROL (FNUM(F), SET_TIMOUT, TIMEOUT);
READLN (F, REC);
GETFIELD (REC, 3, FIELD3);
IF FIELD3 <> '*' THEN
BEGIN
PROCESS_RECORD_A (REC);
PROCESS_RECORD_B (REC);
PROCESS_RECORD_C (REC);
END;
UNTIL
FIELD3 = '*';
although this is also rather messy -- we've had to repeat the loop
termination condition, and the resulting code is really a WHILE-DO
loop masquerading as a REPEAT-UNTIL.
Some might reply that what we ought to do is to move the FCONTROLs,
READLN, and GETFIELD into a separate function that returns just the
value of FIELD3, or perhaps even the loop test (FIELD3 <> '*'). Then,
the loop would look like:
WHILE FCONTROLS_READLN_AND_GETFIELD_CHECK_STAR (FNUM, REC) DO
BEGIN
PROCESS_RECORD_A (REC);
PROCESS_RECORD_B (REC);
PROCESS_RECORD_C (REC);
END;
This, indeed, does look nice -- but are we to be expected to create a
new procedure every time a control structure doesn't work like we want
it to? I like procedures just as much as the next man; in fact, I'm a
lot more prone to pull code out into procedures than others are (I
like my procedures to be twenty lines or shorter). On the other hand,
what if someone said that you couldn't use BEGIN .. END in
IF/THEN/ELSE statements -- if you want to do more than one thing in
the THEN clause, you have to write a procedure?
C has some advantage here. With C's "comma" operator, you can
combine any number of statements (with some restrictions) into a
single expression, whose result is the last value. Thus, what you can
do is something like this:
while ((fcontrol (fnum(f), extended_read, dummy),
fcontrol (fnum(f), set_timeout, timeout),
gets (f, rec, 80),
getfield (rec, 3, field3),
field3<>'*'))
{
process_record_a (rec);
process_record_b (rec);
process_record_c (rec);
};
Whether this is better or not, you decide. The "comma" construct can
be very confusing. In "while ((...)) do", the outside parentheses are
the WHILE's; the inner pair is the comma constructs'; and all others
belong to internal expressions and function calls. Additionally, you
have to keep track of which commas belong to the function calls and
which delimit the comma constructs' elements. &P
What is that slithering underfoot? Could it be the serpent? He
proposes this:
WHILE TRUE DO
BEGIN
FCONTROL (FNUM(F), EXTENDED_READ, DUMMY);
FCONTROL (FNUM(F), SET_TIMOUT, TIMEOUT);
READLN (F, REC);
GETFIELD (REC, 3, FIELD3);
IF FIELD3<>'*' THEN GOTO 99;
PROCESS_RECORD_A (REC);
PROCESS_RECORD_B (REC);
PROCESS_RECORD_C (REC);
END;
99:
"Sssimple and ssstraightforward, madam. Won't you have a bite?" Shame
on you! Still, it's not obvious that the old faithful "GOTO" isn't,
relatively speaking, a reasonable solution. C has its own variant,
that lets us get away without using the "forbidden word":
while (TRUE)
{
fcontrol (fnum(f), extended_read, dummy);
fcontrol (fnum(f), set_timeout, timeout);
gets (f, rec, 80);
getfield (rec, 3, field3);
if (field3='*') break;
process_record_a (rec);
process_record_b (rec);
process_record_c (rec);
};
C's "BREAK" construct gets you out of the construct that you're
immediately in, be it a WHILE loop (as in this case), a SWITCH
statement (in which it is vital), a FOR, or a DO. If you believe in
the evil of GOTOs, you probably won't much like BREAKs; again, though,
I ask -- is the above example any less muddled than the other ones I
showed?
Incidentally, the best approach that I've seen so far comes from a
certain awful, barbarian language called FORTH (OK, all you FORTHies
-- meet me in the alley after the talk and we can have it out).
Translated into civilized terms, the loop looked something like this:
DO
FCONTROL (FNUM(F), EXTENDED_READ, DUMMY);
FCONTROL (FNUM(F), SET_TIMOUT, TIMEOUT);
READLN (F, REC);
GETFIELD (REC, 3, FIELD3);
WHILE FIELD3<>'*'
PROCESS_RECORD_A (REC);
PROCESS_RECORD_B (REC);
PROCESS_RECORD_C (REC);
ENDDO;
This so-called "loop-and-a-half" solves what I think is the key
problem, present in so many WHILE loops -- that the condition often
takes more than a single expression to calculate. Well, in any case,
neither SPL, PASCAL, nor C have such a construct, so that's that.
BREAK, CONTINUE, AND RETURN -- PERFECTION OR PERVERSION?
As I mentioned briefly in the last section, C has three control
structures that PASCAL does not, and some say should not. These
structures, Comrade, are Ideologically Suspect. A Dangerous Heresy.
Still, they're there, and ought to be briefly discussed.
* BREAK -- exits the immediately enclosing loop (WHILE, DO, or FOR)
or a SWITCH statement. Essentially a GOTO to the statement
immediately following the loop.
* CONTINUE -- goes to the "next iteration" of the immediately
enclosing loop (WHILE, DO, or FOR).
* RETURN -- exits the current procedure. "RETURN <expression>"
exits the current procedure, returning the value of <expression>
as the procedure's result.
* Of course, GOTO, the old faithful.
Now, as you may or may not recall, a while ago there was much
argument made against GOTOs. Instead of GOTOs, it was said, you ought
to use only IF-THEN-ELSEs and WHILE-DOs. CASEs, FORs, and
REPEAT-UNTILs, being just variants of the other control structures,
were all right; but GOTOs were condemned, on several very good
grounds:
* First of all, with GOTOs, the "shape" of a procedure stops being
evident. If you don't use GOTOs, each procedure and block of code
will have only one ENTRY and only one EXIT. This means that you
can always assume that control will always flow from the
beginning to the end, with iterations and departures that are
always clearly defined and the conditions for which are always
evident.
* If you avoid GOTOs, then for any statement, you can tell under
what conditions it will be executed just by looking at the
control structures within which it is enclosed.
These concerns, I would say, may apply equally well to BREAKs,
CONTINUEs, and RETURNs.
Personally, I must confess, I don't use GOTOs. I don't know if it
is the appeal of reason, the lesson of experience, or fear for my
immortal soul. About five years ago I resolved to stop using them;
except for "long jumps" (which I'll talk about more later), I use
GOTOs in 1 procedure of MPEX's 40 procedures, and in 2 procedures of
my RL's 350 (both of the uses of "GOTO" are as "RETURN" statements).
However, I must say that in many cases the temptation does seem great.
Consider, for a moment, the following case. We need to write a
procedure that opens a file, reads some records, writes some records,
and closes the file. In case any of the file operations fails, we
should immediately close the file and not do anything else. The
"GOTO-less" solution:
munch_file (f)
char f[40];
{
int fnum;
fnum = fopen (f, 1);
if (error == 0) /* let's say ERROR is an error code */
{
freaddir (fnum, buffer, 128, rec_a);
if (error == 0)
{
munch_record_one_way (buffer);
fwritedir (fnum, buffer, 128, rec_a);
if (error == 0)
{
freaddir (fnum, buffer, 128, rec_b);
if (error == 0)
{
munch_record_another_way (buffer);
fwritedir (fnum, buffer, 128, rec_b);
if (error == 0)
some_more_stuff;
}
}
}
}
fclose (fnum, 0, 0);
}
Or, using "GOTO":
munch_file (f)
char f[40];
{
int fnum;
#define check_error if (error != 0) goto done
fnum = fopen (f, 1);
if (error = 0)
{
freaddir (fnum, buffer, 128, rec_a);
check_error;
munch_record_one_way (buffer);
fwritedir (fnum, buffer, 128, rec_a);
check_error;
freaddir (fnum, buffer, 128, rec_b);
check_error;
munch_record_another_way (buffer);
fwritedir (fnum, buffer, 128, rec_b);
check_error;
some_more_stuff;
}
done:
fclose (fnum, 0, 0);
}
Is the latter way really worse? I'm not so sure. Also, I can't see any
way in which I can rewrite this example without GOTOs without making
it as cumbersome as the first case.
Similar examples can be found for BREAK and RETURN. If, for
instance, I wasn't required to close the file, I'd just do a RETURN
instead of doing the "GOTO DONE"; if I had to loop through the file,
my code might look something like:
framastatify (f)
char f[40];
{
int fnum;
fnum = fopen (f, 1);
if (error = 0)
{
while (TRUE)
{
fread (fnum, rec1, 128);
if (error != 0) break;
if (frob_a (rec1) == failed)
break;
fupdate (fnum, rec1, 128);
if (error != 0) break;
freadlabel (fnum, rec1, 128, 0);
if (error != 0) break;
if (twiddle_label (rec1) == failed)
break;
fwritelabel (fnum, rec1, 128, 0);
if (error != 0) break;
fspace (fnum, 20);
if (error != 0) break;
}
fclose (fnum, 0, 0);
}
}
Just IMAGINE all those IFs you'd need to nest if you avoided BREAK!
CONTINUE, on the other hand, is a vile heresy. Everybody who uses
CONTINUE should be burned at the stake.
To summarize, "C Notes, A Guide to the C Programming Language" by
C.T. Zahn (Yourdon 1979) says:
"In practice, BREAK is needed rarely, CONTINUE never, and GOTO even
less often than that... It also is good style to minimize the
number of RETURN statements; exactly one at the end of the
function is best of all for readability."
On the other hand, I say
"If this be treason, make the most of it!"
Especially if your procedures are short enough and otherwise
well-written enough, I think that you can well make the judgment that
even with the introduction of GOTOs, the control flow will still be
clear enough.
Just don't tell anyone I told you to do it.
LONG JUMPS -- PROBLEM AND SOLUTION
Modern structured programming encourages FACTORING. Your algorithm,
it says, should be broken up into small procedures, small enough that
each one can be easily understood and digested by anybody reading it.
I'm quite fond of factoring myself, and you'll find most of my
procedures to be about 20-odd lines long or shorter. I try to make
each procedure a "black box", with a well-defined, atomic function and
no unobvious side effects. Naturally, with procedures this small, I
often end up going several levels of procedure calls deep to do a
relatively simple task.
For instance, I might have a procedure called ALTFILE that takes a
file name and a string of keywords indicating how the file is to be
altered:
* ALTFILE calls PARSE_KEYWORDS to parse the keyword string;
* PARSE_KEYWORDS separates the string into individual keywords,
calling PROCESS_KEYWORD for each one;
* PROCESS_KEYWORD figures out what keyword is being referenced, and
calls a parsing routine -- PARSE_INTEGER, PARSE_DATE,
PARSE_INT_ARRAY, etc. -- depending on the type of the value the
user specified;
* PARSE_INT_ARRAY takes a list of integer values delimited by, say
":"s, and calls PARSE_INTEGER for each one.
* PARSE_INTEGER converts the text string containing an integer
value into a number and returns the numeric value.
Not a far-fetched example, you must agree; in fact, many of my
programs (e.g. MPEX's %ALTFILE parser) nest even deeper. Now, the
question arises -- what if PARSE_INTEGER realizes that the value the
user specified isn't a valid number after all?
The solution seems clear -- PARSE_INTEGER, in addition to returning
the integer's value, also returns a true/false flag indicating whether
or not the value was actually valid. PARSE_INTEGER returns this to
PARSE_INT_ARRAY; now, PARSE_INT_ARRAY realizes that its parameter
isn't a valid integer array -- it must also return a success/failure
flag to PROCESS_KEYWORD; PROCESS_KEYWORD must pass it back up to
PARSE_KEYWORDS; PARSE_KEYWORDS should return it to ALTFILE; finally,
ALTFILE informs its caller that the operation failed.
Let's look at a particular specimen of one of these procedures;
say, the portion that handles the keyword FOOBAR, which the user
should specify in conjunction with an integer array, a string, and two
dates:
...
IF KEYWORD="FOOBAR" THEN
BEGIN
GET_SUBPARM (0, PARM_STRING);
IF PARSE_INT_ARRAY (PARM_STRING, SP0_VALUE) = FALSE THEN
PARSE_KEYWORD:=FALSE
ELSE
BEGIN
GET_SUBPARM (1, PARM_STRING);
IF PARSE_STRING (PARM_STRING, SP1_VALUE) = FALSE THEN
PARSE_KEYWORD:=FALSE
ELSE
BEGIN
GET_SUBPARM (2, PARM_STRING);
IF PARSE_DATE (PARM_STRING, SP2_VALUE) = FALSE THEN
PARSE_KEYWORD:=FALSE
ELSE
BEGIN
GET_SUBPARM (3, PARM_STRING);
PARSE_KEYWORD:=
PARSE_DATE (PARM_STRING, SP3_VALUE) = FALSE ;
END;
END;
END;
END;
...
Of course, the same sort of thing has to be repeated in every
procedure in the calling sequence; the moment an error return is
detected from one of the called procedures, the other calls have to be
skipped, and the error condition should be passed back up to the
caller.
Error handling, of course, is important business, and it would
hardly be appropriate to crash and burn just because the user inputs a
bad value (users input bad values all the time). Still, all this work
just to catch the error condition?
What we really want to do in this case is to
* HAVE WHOEVER DETECTS THE ERROR CONDITION AUTOMATICALLY RETURN ALL
THE WAY TO THE TOP OF THE CALLING SEQUENCE.
In other words, the error finder might have code that looks like:
NUM:=BINARY (STR, LEN);
IF CCODE<>CCE THEN { an error detected? }
SIGNAL_ERROR; { return to the top! }
The procedure we want to return to would indicate its desire to catch
these errors by saying something like:
ON ERROR DO
{ the code to be activated when the error is detected };
RESULT:=ALTFILE (FILE, KEYWORDS);
Finally, the intermediate procedures can now be the soul of
simplicity:
...
IF KEYWORD="FOOBAR" THEN
BEGIN
GET_SUBPARM (0, PARM_STRING);
PARSE_INT_ARRAY (PARM_STRING, SP0_VALUE);
GET_SUBPARM (1, PARM_STRING);
PARSE_STRING (PARM_STRING, SP1_VALUE);
GET_SUBPARM (2, PARM_STRING);
PARSE_DATE (PARM_STRING, SP2_VALUE);
GET_SUBPARM (3, PARM_STRING);
PARSE_DATE (PARM_STRING, SP3_VALUE);
END;
...
Thus, the three components of this scheme:
* The code that finds the error -- it "SIGNALS THE ERROR";
* The code that should be branched to in case of error is somehow
indicated, at compile time or run time (but before the error is
actually signaled).
* Finally, the intermediate code knows nothing about the possible
error condition. It's automatically exited by the error signaling
mechanism.
For want of a better name, I'll call this concept a "Long Jump". It's
also been called a "non-local GOTO", a "throw", a "signal raise", and
other unsavory things, but "Long Jump" -- which happens to be the C
name for it -- sounds more romantic.
LONG JUMPS, CONTINUED -- SOLUTIONS AND PROBLEMS
I've indicated the need -- or at least, I think it's a need -- and
a possible prototype solution. There are several implementations of
this already extant, each with its own little quirks and problems.
PASCAL -- STANDARD AND /3000
The only mechanism Standard PASCAL and PASCAL/3000 give you to
solve our problem is the GOTO. In PASCAL, you're allowed to GOTO out
of a procedure or function; however, you can only branch INTO the
main body of the program or from a nested procedure into the procedure
that contains it. In other words, if you have
PROCEDURE P;
PROCEDURE INSIDE_P; { nested in P }
BEGIN
...
END;
BEGIN
...
END;
PROCEDURE Q;
BEGIN
...
P;
...
END;
then you can branch from INSIDE_P into P, but you can't branch from P
into Q, even though Q calls P.
Even if this restriction weren't present, the GOTO to a fixed label
still wouldn't be the right answer -- what if our PARSE_KEYWORDS
procedure is called from two places? Surely we wouldn't want an error
condition to cause a branch to the same location in both cases!
Besides, if we want to compile PARSE_KEYWORDS separately from its
caller, we'd have to allow "global label variables". In reality,
PASCAL can't do these "long jumps".
SPL
SPL has a different and rather better facility. In SPL, you can't
branch from one procedure into another; however, you CAN pass a label
as a parameter to a procedure. Thus, you could write:
PROCEDURE PARSE'INT'ARRAY (PARM, RESULT, ERR'LABEL);
BYTE ARRAY PARM;
INTEGER ARRAY RESULT;
LABEL ERR'LABEL;
BEGIN
...
IF << test for error condition >> THEN
GOTO ERR'LABEL;
...
END;
Then, you might call this from within PROCESS'KEYWORD by saying
PROCEDURE PROCESS'KEYWORD (KEYWORD'AND'PARM, ERR'LABEL);
BYTE ARRAY KEYWORD'AND'PARM;
LABEL ERR'LABEL;
BEGIN
...
IF KEYWORD="FOOBAR" THEN
BEGIN
GET'SUBPARM (0, PARM'STRING);
PARSE'INT'ARRAY (PARM'STRING, SP0'VALUE, ERR'LABEL);
...
END;
...
END;
When you call PARSE'INT'ARRAY, you pass it the label to which it
should return in case of error -- in this case, also called ERR'LABEL,
which was also passed to this procedure. Finally, the topmost
procedure -- ALTFILE -- might say:
RESULT:=ALTFILE (FILENAME, KEYWORDS, GOT'ERROR);
...
GOT'ERROR:
<< report to the user that an error occurred >>
The key point here is that each procedure doesn't really return
directly to the top; rather, it returns to the error label that it was
passed by its caller. Since that may well be the label passed by the
caller's caller, and so on, you get a sort of "daisy chain" effect by
which you can easily exit ten levels of procedures in one GOTO
statement.
At this point, I think it's quite important to mention a SEVERE
PROBLEM of these "long jumps" that I think any implementation
mechanism has to be able to address:
* THE VERY ESSENCE OF A LONG JUMP IS THAT IT BYPASSES SEVERAL OF
THE PROCEDURES IN THE CALLING SEQUENCE. A PROCEDURE (say, our
PROCESS_KEYWORD) CALLS ANOTHER PROCEDURE, EXPECTING THE CALLEE TO
RETURN, BUT THE CALLEE NEVER DOES!
Imagine for a moment that PROCESS_KEYWORD opened a file, intending
to close it at the end of the operation; after the long jump branches
out of it, the file will remain open. Any other kind of cleanup --
resetting temporarily changed global variables, releasing acquired
resources -- that a procedure expects to do at the end might remain
undone because the procedure will be branched out of.
Similarly, what if a procedure EXPECTS another procedure that it
calls to detect an error condition? What is a fatal error under some
circumstances may be quite normal under others; for instance, say you
have a procedure that reads data from a file and signals an error if
the file couldn't be opened -- in some cases, you may expect the file
to be unopenable, and have a set of defaults you want to use instead.
By using the convenience of long jumps, you lose the certainty that
every procedure has complete control over its execution, and can be
sure that any procedure it calls will always return.
The advantage of SPL's approach is that you could call a procedure
passing to it any error label you want to. For instance,
PROCESS'KEYWORD might look like:
PROCEDURE PROCESS'KEYWORD (KEYWORD'AND'PARM, ERR'LABEL);
BYTE ARRAY KEYWORD'AND'PARM;
LABEL ERR'LABEL;
BEGIN
INTEGER FNUM;
FNUM:=FOPEN (KEY'INFO'FILE, 1);
...
IF KEYWORD="FOOBAR" THEN
BEGIN
GET'SUBPARM (0, PARM'STRING);
PARSE'INT'ARRAY (PARM'STRING, SP0'VALUE, CLOSE'FILE);
...
END;
...
RETURN; << if we finished normally, just return >>
CLOSE'FILE: << branch here in case of error >>
FCLOSE (FNUM, 0, 0);
GOTO ERR'LABEL;
END;
Because you have complete control over each branch, you don't HAVE to
pass the procedure you call the same error label that you were passed;
if you want to do some cleanup, you can just pass the label that does
the cleanup, and THEN returns to your own error label.
Thus, with SPL's label parameter system, you get the best of both
worlds:
* If you pass an "error label" to a procedure, the procedure may
choose to return normally or to return to the error label.
* Since you can pass the same label to a procedure you call as the
one that you yourself were passed, a single GOTO to that label
can conceivably exit any number of levels of procedures.
* On the other hand, if you want to do some cleanup in case of an
error, you can just pass a different label, one that points to
the cleanup code.
* Finally -- if you want to -- you can actually pass several labels
to a procedure, allowing it to return to a different one
depending on what error condition it finds. A bit extravagant for
my blood, but maybe I'm just too stodgy.
The only problems that this system has are:
* You have to pass the label to any procedure that might
conceivably want to participate in a long jump -- either the
procedure that initially detects the error or any one that wants
to pass it on. This may often mean that virtually every one of
your procedures will have to have this error label parameter. Not
a very unpleasant problem, but a bit of a bother nonetheless.
* Similarly, there are some procedures whose parameters you can't
dictate; for instance, control-Y trap procedures (ones in which a
long jump to the control-Y handling code may often be just the
thing you want to do). Other trap procedures (arithmetic,
library, and system) are just like this, too, as are those which
are themselves passed as "procedure parameters" to other
procedures and whose parameters are dictated by those other
procedures (got that?).
Besides these minor problems, though, SPL's long jump support is quite
reasonably done.
PROPOSED ANSI STANDARD C
C's "GOTO" doesn't allow any branch from one function to another;
neither does C provide label parameters like SPL does. Long jumps in C
are accomplished with a different mechanism, involving the SETJMP and
LONGJMP built-in procedures.
SETJMP is a procedure to which you pass a record structure (of the
predefined type "jmp_buf"). When you first call it, it saves all the
vital statistics of the program -- the current instruction pointer,
the current top-of-stack address, etc. -- in this record structure.
Then, when the same record structure is passed to LONGJMP, LONGJMP
uses this information to restore the instruction pointer and stack
pointer to be exactly what they were at SETJMP time. Thus, control is
passed back to the SETJMP location, wherever it may be.
A typical application of this might be:
jmp_buf error_trapper;
proc()
{
...
if (setjmp(error_trapper) != 0)
/* do error processing */;
else
{
result = altfile (filename, keywords);
...
}
...
}
...
int parse_integer (str)
char str[];
{
...
if (bad_value)
longjmp (error_trapper, 1);
...
}
One thing, I didn't, as you see, mention at first was the "IF
(SETJMP(ERROR_TRAPPER) != 0)". Well, since the LONGJMP jumps DIRECTLY
to the instruction following the SETJMP, we have to have some way of
distinguishing the first time it is executed (after a legitimate
SETJMP) and the next time (after the LONGJMP which transferred control
back to it). The initial SETJMP, you see, returns a 0; a LONGJMP takes
its second parameter (in this case, a 1), and returns it as the
"result" of SETJMP.
Thus, when the IF statement is first executed, the value of the
"(SETJMP ... != 0)" will be FALSE, and the ALTFILE will be done; when
the IF is executed a second time, the value will be TRUE, the error
processing will be performed.
Note the distinctive features of the SETJMP/LONGJMP construct:
* The "jump buffer" -- set by SETJMP and used by LONGJMP -- need
not be passed as a parameter to each procedure that needs it
(although it could be). Typically, it's stored as a global
variable (which the SPL error label parameter couldn't be).
* You still have control over procedures you call; if you want to
trap their jump yourself (either to do some cleanup or treat it
as a normal condition), you can just do your own SETJMP using the
same buffer that they'll LONGJMP with.
* On the other hand, if you want do some cleanup and then continue
the LONGJMP process -- propagate it back up to the original error
trapper, in this case PROC -- you have to do more work. You must
save the original jump buffer in a temporary variable before
doing the SETJMP, and restore it before continuing the LONGJMP
(or simply returning from the procedure). For instance,
PROCESS_KEYWORD might look like this:
process_keyword (keyword_and_parm)
char keyword_and_parm[];
{
jmp_buf error_trapper; /* declare our temporary save buffer */
int fnum;
fnum = fopen (key_info_file, 1);
save_error_trapper = error_trapper;
if (setjmp (error_trapper) != 0)
/* Must be an error condition */
{
fclose (fnum, 0, 0);
error_trapper = save_error_trapper;
longjmp (error_trapper, 1);
}
...
if (strcmp (keyword, "foobar"))
{
get_subparm (0, parm_string);
parse_int_array (parm_string, sp0_value);
...
}
...
fclose (fnum, 0, 0);
error_trapper = save_error_trapper; /* restore for future use */
}
Frankly speaking, if you ask me -- and even if you don't -- this
doesn't look very clean. I'd like to see some way of automatically
"stacking" SETJMPs so that the system would do the saving of the old
jump buffer for you; also, I'd prefer not to have to type that ugly
"IF (SETJMP ... != 0)" kludge. On the other hand, this can be made
quite palatable-looking with a few macros, and it's better than
nothing (or is it?).
PASCAL/XL AND THE TRY..RECOVER
The authors of PASCAL/XL -- perhaps because they were faced with
the non-trivial task of building a language that MPE/XL could be
profitably written in -- must have given this subject a great deal of
thought. And, fortunately, they've come up with what I think to be a
very powerful construct.
TRY
statement1;
statement2;
...
statementN;
RECOVER
recoverycode;
The behavior here is
* EXECUTE statement1 THROUGH statementN. IF ANY PASCAL ERROR (e.g.
giving a bad numeric value to a READLN) OR A CALL TO THE BUILT-IN
"ESCAPE" PROCEDURE OCCURS WITHIN THESE STATEMENTS, CONTROL IS
TRANSFERRED TO recoverycode, AND AFTER THAT TO THE STATEMENT
FOLLOWING TRY..RECOVER.
This, as you see, allows you to put a TRY..RECOVER into the
top-level procedure (in our case, PROC or ALTFILE) and an ESCAPE call
in any of the called procedures (e.g. PARSE_INTEGER) that detects a
fatal error.
The best part, though, is that any procedure that wants to
establish some sort of "cleanup" code can do this trivially! For
instance, our PROCESS_KEYWORD might say:
PROCEDURE PROCESS_KEYWORD (VAR KEYWORD_AND_PARM: STRING);
VAR FNUM: INTEGER;
SAVE_ESCAPECODE: INTEGER;
BEGIN
FNUM:=FOPEN (KEY_INFO_FILE, 1);
TRY
...
IF KEYWORD="FOOBAR" THEN
BEGIN
GET_SUBPARM (0, PARM_STRING);
PARSE_INT_ARRAY (PARM_STRING, SP0_VALUE);
END;
...
FCLOSE (FNUM, 0, 0);
RECOVER
BEGIN
SAVE_ESCAPECODE:=ESCAPECODE;
FCLOSE (FNUM, 0, 0);
ESCAPE (SAVE_ESCAPECODE);
END;
END;
If any error occurs in the code between TRY and RECOVER, the BEGIN/END
in the RECOVER part is triggered. This is now free to close the file,
or do whatever else it needs to, and then "pass the error down" by
calling ESCAPE again.
This ESCAPE -- since it's no longer between this TRY and RECOVER --
will activate the previously defined TRY/RECOVER block (say, in the
PARSE_KEYWORDS procedure) which might do more cleanup and then call
ESCAPE again. Eventually, the error will percolate to the top-most
TRY/RECOVER, which will just do some work and not call ESCAPE any
more, continuing with the rest of the program.
In other words, "TRY .. RECOVER"s can be nested. In the following
piece of code
TRY
A;
TRY
B;
TRY
C;
RECOVER
R1;
D;
RECOVER
R2;
E;
RECOVER
R3;
* An error or ESCAPE in C will cause a branch to R1.
* An error/ESCAPE in B or D will, of course, branch to R2 (since B
and D are outside the innermost TRY .. RECOVER R1). However, an
error/ESCAPE in R1 will also cause a branch to R2! That's because
R1 is also out of the area of effect of the innermost TRY ..
RECOVER.
In other words, the "recovery handler" R1 is only "established"
between the innermost TRY and the innermost RECOVER; when it's
actually "triggered", it's disestablished, and the recovery
handler that was previously in effect is re-established.
* By this token, an error/ESCAPE in A, E, or R2 will branch to R3.
* And, finally, an error in R3 -- or anywhere else outside of the
TRY .. RECOVER -- will actually abort the program with an error
message.
As you see, then, all is for the best in this best of all possible
worlds. We can do long jumps "up the stack" to the RECOVER code, but
each intervening procedure can also easily set up "cleanup code" that
needs to be executed before the long jump can continue.
Several notes:
* First of all, remember that the RECOVER statement is executed
ONLY in case of an error or an ESCAPE. If the statements between
TRY and RECOVER finish normally, any "cleanup" code you may have
inside the RECOVER will NOT be executed. That's why our sample
program has two FCLOSEs -- one for the normal case and one for
the cleanup case.
* Note also that the ESCAPE call can take a parameter (just like
C's LONGJMP). This parameter is then available as the variable
ESCAPECODE in the RECOVER handler, and is used to indicate what
kind of error or ESCAPE happened.
A RECOVER handler might, for instance, be used to avoid an abort
caused by an expected error condition (e.g. file I/O error);
however, if it sees that ESCAPECODE indicates some other,
unexpected, error condition, it might terminate or call ESCAPE
again, hoping that some "higher-level" RECOVER block can handle
the error.
* Finally, if a RECOVER block wants to continue the long jump after
doing its cleanup work, it often needs to pass the ESCAPECODE up
as well (unless, of course, the higher-level RECOVER handler
won't use the ESCAPECODE). Unfortunately, the PASCAL/XL manual
explicitly tells us:
- "It is wise to assign the result of the ESCAPECODE function
to a local variable immediately upon entering the RECOVER
part of a TRY-RECOVER construct, because the system can
change that value later in the RECOVER part."
This is too bad; it would have been nice to have TRY .. RECOVER
do this saving for you automatically, saving you the burden of
having to declare and set an extra local variable. Still, we
oughtn't look a gift horse in the mouth.
Note, incidentally, how C's #define macro facility can come to our
aid if we want to implement this same construct in C. All we need is
three #defines:
int escapecode;
int jump_stack_ptr = -1;
jmp_buf jump_stack[100]; /* the stack used to do nesting */
#define TRY if (setjmp(jump_stack[++jump_stack_ptr])==0) {
#define RECOVER jump_stack_ptr--; } else
#define ESCAPE(parm) \
{ \
escapecode = parm; \
longjmp(jump_stack[jump_stack_ptr--], 1);
}
This would allow us to say:
TRY
code;
RECOVER
errorhandler;
and
ESCAPE(value);
just like we could in PASCAL/XL! Note how we've added this entirely
new control structure without any changes to the compiler -- nothing
more complicated than a few #defines! (Many thanks to Tim Chase of CCS
for showing me how to do this!)
NESTED PROCEDURES
An interesting feature of PASCAL is its ability to have procedures
nested within other procedures. In other words, I could say:
PROCEDURE PARSE_THING (VAR THING: STRING);
VAR CURR_PTR, CURR_DELIMS: INTEGER;
QUOTED: BOOLEAN;
...
PROCEDURE PARSE_CURR_ELEMENT (...);
BEGIN
...
END;
BEGIN
...
PARSE_CURR_ELEMENT (...);
...
END;
PARSE_CURR_ELEMENT here is just like a local variable of PARSE_THING
-- it's a local procedure. It's callable only from within PARSE_THING
and not from any other procedure in the program. More importantly,
* THE NESTED PROCEDURE (PARSE_CURR_ELEMENT) CAN ACCESS ALL OF
PARSE_THING'S LOCAL VARIABLES.
This is a significant consideration. If PARSE_CURR_ELEMENT didn't need
to access PARSE_THING's local variables, not only could it be a
different (non-nested) procedure, but it probably should be. When a
procedure is entirely self-contained, it's usually a good idea to make
it accessible to as many possible callers as possible.
On the other hand, what if PARSE_CURR_ELEMENT needs to interrogate
CURR_PTR to find out where we are in parsing the thing; or look at or
modify CURR_DELIMS or QUOTED or whatever other local variables are
relevant to the operation?
We don't want to have to pass all these values as parameters --
there could be dozens of them.
We don't want to make them global variables, since they're really
only relevant to PARSE_THING -- why make them accessible by other
procedures that have no business messing with them? (Incidentally,
making the variables global will also prevent PARSE_THING from calling
itself recursively.)
But, on the other hand, we certainly DO want to have
PARSE_CURR_ELEMENT be a procedure -- after all, we might need to call
it many times from within PARSE_THING; surely we don't want to repeat
the code every time!
Thus, the main advantage of nested procedures is not just that,
like local variables, they can only be accessed by the "nester".
Rather, the advantage is the fact that they can share the nester's
local variables, which are often quite relevant to what the nested
procedure is supposed to do.
Another substantial benefit comes when you pass procedures as
parameters to other procedures. A good example of this might be a
report writer procedure:
TYPE LINE_TYPE = PACKED ARRAY [1..256] OF CHAR;
PROCEDURE PRINT_LINE (VAR LINE: LINE_TYPE;
LINE_LEN: INTEGER;
PROCEDURE PAGE_HEADER (PAGENUM: INTEGER);
PROCEDURE PAGE_FOOTER (PAGENUM: INTEGER));
This procedure takes the line to be output and its length, but also
takes two procedures -- one that will be called in case a page header
should be printed and one in case a page footer should be printed. The
utility of this is obvious -- it gives the user the power to define
his own header and footer format.
Now, let's say we have the following procedure:
PROCEDURE PRINT_CUST_REPORT (VAR CATEGORY: INTEGER);
VAR CURRENT_COUNTRY: PACKED ARRAY [1..40] OF CHAR;
...
BEGIN
...
PRINT_LINE (OUT_LINE, OUT_LINE_LEN,
MY_PAGE_HEAD_PROC, MY_PAGE_FOOT_PROC);
...
END;
PRINT_LINE will output OUT_LINE and, in some cases, call
MY_PAGE_HEAD_PROC or MY_PAGE_FOOT_PROC. Now, it makes sense for you to
want these procedures to print, say, the current value of CATEGORY
and, perhaps, CURRENT_COUNTRY.
In C and SPL, which have no nested procedure, both
MY_PAGE_HEAD_PROC and MY_PAGE_FOOT_PROC would have to be separate
procedures which have no access to PRINT_CUST_REPORT's local
variables.
The variables would either have to be global (which is quite
undesirable) or would somehow have to be passed to PRINT_LINE, which
in turn would pass them to the MY_PAGE_xxx_PROC procedures.
This would be quite cumbersome, since in PRINT_CUST_REPORT the
header and footer procedures need to be passed an integer and a PACKED
ARRAY OF CHAR, whereas in some other application of PRINT_LINE they
would have be to passed, say, three floats and a record structure.
In PASCAL, on the other hand, both MY_PAGE_HEAD_PROC and
MY_PAGE_FOOT_PROC can be nested within PRINT_CUST_REPORT and thus have
access to CATEGORY and CURRENT_COMPANY (and all the other local
variables of the PRINT_CUST_REPORT procedure). Another useful
application for nested procedures.
C, as I mentioned, has no nested procedure support at all. On the
other hand, it does have #DEFINEs, which allow you to define text
substitutions that can often do the job (see the section on DEFINES)
of a nested procedure, especially if it's a small one. For instance,
you can say:
#define foo(x,y) \
{ \
int a, b; \ /* variables local to THIS DEFINE */
a = x + parm1; \ /* access a variable local to the procedure */
b = y * parm2; \ /* (the nesting procedure) */
x = a + b; \
y = a * b; \
}
As you can see, C's support for "block-local" variables -- local
variables that are local not just to the procedure, but rather to the
"{"/"}" block in which they're defined -- allows you to have #DEFINEs
that are almost as powerful as real procedures.
SPL allows you to have "SUBROUTINE"s nested within procedures, but
subject to some rather stringent restrictions:
* The subroutines can have no local variables of their own. This is
a pretty severe problem, since it means that all your local
variables have to be declared in the nesting procedure, which
increases the likelihood of errors and also prohibits you from
calling the subroutine recursively (which you would otherwise be
able to do).
* The subroutines can not be passed as procedure parameters to
other procedures (only procedures can be -- try parsing that!).
* Furthermore, this nesting capability goes to only one level; you
can nest SUBROUTINEs in PROCEDUREs, but you can't nest anything
within SUBROUTINEs. In PASCAL, procedures can be nested within
each other to an arbitrary number of levels. Frankly speaking,
I'm hard put to think of an application for triply-nested
procedures.
Practically, you'll have to decide for yourself whether PASCAL's
nested procedure support -- and C's lack of it -- is important to you.
I brought this issue up to a C partisan, and she replied that she's
simply never run into a case where nested procedures were all that
important. Upon thinking about this, I found myself forced to agree,
at least partially:
* #DEFINEs can do much of the job that nested procedures are needed
for;
* Most procedures should often NOT be nested, but rather be made
self-contained and made available to the world at large (rather
than just to a particular procedure).
* If the reason you don't want to declare your variables as global
is that you want to "hide" them from other procedures, you can do
this in C by making them "static". This will make them available
only to the procedures in the file in which they're defined. This
allows you to share data between procedures (which you might
otherwise have wanted to nest within each other) without making
the data readable and modifiable by everybody.
* On the other hand, there's no denying that there are cases in
which PASCAL's nested procedures are quite a bit superior to any
C or SPL alternative. For instance, a recursive procedure might
well not be able to use the "static global variable" approach I
just mentioned.
DATA TYPES
The difference most often cited between PASCAL and C is the way
that they treat data types. PASCAL is often considered a "strict type
checking" language and C a "loose type checking" language, and that's
true enough. However, the effects of this philosophical difference are
subtler and more pervasive than at first glance appears.
What are data types? Data types can be seen in the earliest of
languages, from FORTRAN and COBOL onwards. When you declare a variable
to be a certain data type, you give certain information to the
compiler -- information that the compiler must have to produce correct
code. Historically, this information has included:
* What the various operators of the language MEAN when applied to
the variable. "+", for instance, isn't just "addition" -- when
you add two integers, it's integer addition, and when you add two
reals, it's real addition. Two entirely different operations,
with entirely different machine language opcodes and (possibly)
different effects on the system state. Similarly, a FORTRAN
"DISPLAY X" means:
- If X is a string, print it verbatim;
- If X is an integer, print its ASCII representation;
- If X is a real, print its ASCII representation, but in a
different format and with a different conversion mechanism.
* How much SPACE is to be allocated for the variable. "Array of 20
integers" is a type, too, one from which the compiler can exactly
deduce how much memory (20 words) needs to be allocated to fit
this data.
If you look at SPL (and, incidentally, FORTRAN and other languages),
you'll find that all of its type declarations essentially aim at
serving these two functions. However, in recent times, a few other
functions have been ascribed to type declarations:
* Using type declarations, the compiler can DETECT ERRORS that you
may make. The compiler can't, of course, figure out if your
program does "the right thing" since it doesn't know what the
right thing is; however, it can see if there are any internal
inconsistencies in your program.
For instance, if you're multiplying two strings, the compiler can
tag that as an obvious error; similarly, if you pass a string
parameter to a procedure that expects an integer (or vice versa),
a good compiler will find this and save you a lot of run-time
debugging. The more elaborate and precise the type specifications
you give, the more error checking the compiler can do.
Error checking can also be provided at run time, where code that
knows what size arrays are, for instance, can make sure that you
don't inadvertently index outside them. PASCAL's "subrange types"
do this sort of thing, too, allowing you to declare what values
(e.g. "0 to 100") a variable may take and triggering an error
when you try assigning it an invalid value.
* Furthermore, with a type declaration, the compiler can
automatically SAVE WORK for you by automatically defining special
tools for the given type.
The classic example of this is the record structure -- by
declaring the structure, you're automatically defining a set of
"operators" (one for each field of the structure) that allow you
to easily access the structure. Similarly, enumerated types can
save you the burden of having to manually allocate distinct
values for each of the elements in the enumeration (admittedly,
not a very large burden).
Some fancy compilers can even automatically define "print"
operations for each record structure, so that you can easily dump
it in a legible format to the terminal without having to print
each element individually.
* Good type handling provisions can INSULATE YOUR PROGRAMS FROM
CHANGES IN YOUR DATA'S INTERNAL REPRESENTATION. For instance, if
the compiler allows you to refer to a field of a record as, say,
"CUSTREC.NAME" instead of "CUSTREC(20)", then you can easily
reformat the insides of the record (adding new fields, changing
field sizes, etc.) without having to change all places that
reference this record.
Similarly, if your language allows functions to return records
and arrays as well as scalars, you can easily change the type of
your, say, disc addresses from a 2-word double integer to a
10-word array of integers. In SPL, for instance, such a change
would require rewriting all procedures that want to return such
objects or to take them as "by-value" parameters. Even changing a
value from an "integer" to a "double integer" in SPL will require
you to change a great deal of code.
The reason I've given this list is that SPL, PASCAL, and C place
different weights on each of these points, and this makes for rather
substantial differences in the way you use these languages.
Now, away from the generalities and on to concrete examples.
RECORD STRUCTURES
Consider for a moment an entry in your "employee" data set. It
could be a file label; it could be a Process Control Block entry; it
could be any chunk of memory that contains various fields of various
data types.
A typical layout of this employee entry (or employee "record")
might be:
Words 0 through 14 - The employee name (a 30-character string);
Words 15-19 - Social security number (10-character string);
Words 20-21 - Birthday (a double integer, just to be interesting);
Words 22-23 - Monthly salary (a real number).
A simple record. It's 24 words long, but it's not really an "array of
24 words"; logically speaking, to you and me, it's a collection of
four objects, each of a different type, each starting at a different
(but constant) offset within the record.
How do we declare a variable to hold this record? In FORTRAN and
SPL, it's easy:
INTEGER ARRAY EMPREC(0:23);
or
INTEGER EMPREC(24)
Short and sweet. The compiler's happy -- it knows that it's an array
of integers, which means you can extract an arbitrary element from it,
and pass it to a procedure (like DBGET), which will receive its
address as an integer pointer. This defines to the compiler the
MEANING of the "indexing" and "pass to procedure" operations that can
be done on EMPREC. Also, the compiler knows that 24 words must be
allocated for this array, as a global or local variable.
The compiler is happy, but are you? First of all, how are you going
to access the various elements of this record structure? Are you going
to say
EMPREC(20)
when you mean the employee's birthday (actually, since it's a double
integer, you couldn't even do that)?
What about error checking? Since all the compiler knows about this
is that it's an integer array, it'll be happy as punch to allow you to
put it anywhere an integer array can go. Would you like to pass it as
the database name to DBGET instead of as the buffer variable? Fine.
Would you like to view it as a 4 by 5 matrix and multiply it by, say,
the department record? The computer will gladly oblige.
Finally, consider the burden this places on you whenever you want
to change the layout of EMPREC -- say, to increase the name from 30
characters to 40. You'll have to change all your "EMPREC(20)"s to
"EMPREC(25)", all your "INTEGER ARRAY EMPREC(0:23)" to "INTEGER ARRAY
EMPREC (0:28)". And, of course, if you forget one or the other -- why,
the compiler will be happy to extract the 4th word of the social
security number and treat it as the employee's birthday!
Of course, you're not going to do this. You will certainly not
refer to all the elements of the record structure by their numeric
array indices (although it so happens that most of HP's MPE code does
exactly this). Rather, you'll say (of course, in SPL, you can also do
the same thing with DEFINEs):
EQUATE SIZE'EMPREC = 24;
BYTE ARRAY EMP'NAME (*) = EMPREC(0);
BYTE ARRAY EMP'SSN (*) = EMPREC(15);
DOUBLE ARRAY EMP'BIRTHDATE (*) = EMPREC(20);
REAL ARRAY EMP'SALARY (*) = EMPREC(22);
[Note: The fact that we define, say, EMP'BIRTHDATE and
EMP'SALARY as arrays isn't a problem. If we say EMP'SALARY
with no subscript, it'll refer to the 0th element of this
"array", which is exactly what we want it to do.]
FORTRAN is similar (you'd use an EQUIVALENCE); COBOL is a bit
simpler, allowing you to say (remembering that COBOL doesn't have
REALs).
01 EMPREC.
05 NAME PIC X(30).
05 SSN PIC X(10).
05 BIRTHDATE PIC S9(9) COMP.
05 SALARY PIC S9(5)V9(2) COMP-3.
As you see, COBOL at least has the advantage that it automatically
calculates the indexes of each subfield for you. This is nice,
especially when you change the structure, reshuffling, inserting,
deleting, or resizing fields. On the other hand, I wouldn't call this
a very substantial feature, especially since sometimes you WANT to
manually specify the field offsets (whenever the record structure is
not under your control, like, say, an MPE file label).
To summarize, this "EQUIVALENCE"ing approach that's available in
SPL, FORTRAN, and COBOL saves you from the very substantial bother of
having to hardcode the offsets of all the subfields into your program.
This is certainly a good thing; however, PASCAL and C go substantially
beyond this.
The most serious problem with what I'll call the "EQUIVALENCE"ing
approach is a rather subtle one, one that I didn't realize until I'd
used it for some time.
The definitions we saw above -- in SPL, FORTRAN, or COBOL --
defined several variables as subfields of another variable. EMP'NAME
and EMP'SSN are subfields of EMPREC. What if we need to declare this
EMPREC twice -- say, in two different procedures?
Clearly we don't want to have to repeat the EQUIVALENCEs in each
procedure. Yet what choice do we have? We might, for instance, set up
each of the subfields as a DEFINE instead of an equivalence, making
the DEFINEs available in all the procedures that reference EMPREC:
DEFINE EMP'NAME = EMPREC(0) #;
DEFINE EMP'SSN = EMPREC(15) #;
DEFINE EMP'BIRTHDATE = EMPREC(20) #;
DEFINE EMP'SALARY = EMPREC(22) #;
but then, since DEFINEs are merely text substitutions and EMPREC is an
integer array, each EMP'xxx will also be an integer array. We'd have
to say
BYTE ARRAY EMPREC'B(*)=EMPREC;
DOUBLE ARRAY EMPREC'D(*)=EMPREC;
REAL ARRAY EMPREC'R(*)=EMPREC;
in each procedure that defines an EMPREC array, and a
DEFINE EMP'NAME = EMPREC'B(0) #;
DEFINE EMP'SSN = EMPREC'B(15) #;
DEFINE EMP'BIRTHDATE = EMPREC'D(20) #;
DEFINE EMP'SALARY = EMPREC'R(22) #;
at the beginning of the program. Still, we'd have had to have the
defines of the BYTE ARRAY, DOUBLE ARRAY, and REAL ARRAY repeated once
for each declaration of EMPREC; and, what if we want to call the
record something else, like have two records called EMPREC1 and
EMPREC2?
* THE PROBLEM WITH DEFINING SUBFIELDS OF A RECORD STRUCTURE USING
THE "EQUIVALENCING" APPROACH IS THAT IT DEFINES THE SUBFIELDS OF
ONLY ONE RECORD STRUCTURE VARIABLE.
WHAT WE WANT IS TO DEFINE A GENERALIZED "TEMPLATE" ONCE AND THEN
APPLY THIS TEMPLATE TO EACH RECORD STRUCTURE VARIABLE WE USE.
In other words, we want to be able to say
DEFINE'TYPE EMPLOYEE'REC (SIZE 24)
BEGIN
BYTE ARRAY NAME (*) = RECORD(0);
BYTE ARRAY SSN (*) = RECORD(15);
DOUBLE ARRAY BIRTHDATE (*) = RECORD(20);
REAL ARRAY SALARY (*) = RECORD(22);
END;
and then declare any particular employee record buffer by saying:
EMPLOYEE'REC EMPREC1;
EMPLOYEE'REC EMPREC2;
Then, we could extract each individual subfield of the record like
this:
NEW'SALARY := EMPREC1.SALARY * 1.1;
The point here is that
* IN ADDITION TO NOT HAVING TO EXPLICITLY SPECIFY THE OFFSET OF THE
SUBFIELD OF THE RECORD (like having to say RECORD(22), an awful
thing to do), WE CAN NOW DEFINE THE LAYOUT OF THE RECORD
STRUCTURE ONCE, REGARDLESS OF HOW MANY VARIABLES WITH THAT
STRUCTURE WE WANT TO DECLARE.
Do you see how nicely this dovetails with the "INSULATING YOUR PROGRAM
FROM CHANGING INTERNAL REPRESENTATION" principle we gave above? The
record structure layout is defined in EXACTLY ONE PLACE in the program
file. We can have a hundred different variables of this type -- none
of them will have to specify the physical size of the buffer or the
offsets of the subfields. Each one will merely refer back to the type
declaration.
Also, we've now announced EMPREC1 to the compiler as being of the
special "EMPLOYEE'REC" type. It's no longer a simple INTEGER ARRAY,
just like any other integer array. Conceivably, if we declare a
procedure to be
PROCEDURE PUT'EMPLOYEE (DBNAME, EMPREC, FRAMASTAT);
INTEGER ARRAY DBNAME;
EMPLOYEE'REC EMPREC;
INTEGER FRAMASTAT;
...
the compiler can warn us that
EMPLOYEE'REC EMPREC;
INTEGER ARRAY DBNAME;
INTEGER FOOBAR;
...
PUT'EMPLOYEE (EMPREC, DBNAME, FOOBAR);
is an invalid call -- it sees that an object of type EMPLOYEE'REC is
being passed in place of an INTEGER ARRAY, and an INTEGER ARRAY is
being passed in place of an EMPLOYEE'REC. Without this error checking,
you'd have to find this problem yourself at run-time, a distinctly
more difficult task.
RECORD STRUCTURES IN PASCAL AND C
What I just gave is the rationale for record structures, mostly for
the benefit of SPL programmers who haven't used PASCAL and C before.
Of course, the only reason I gave it is that PASCAL and C do have
record structure support, remarkably similar support at that. Here's
the way you declare a structure data type in PASCAL:
{ "PACKED ARRAY OF CHAR"s are PASCAL strings }
TYPE EMP_RECORD = RECORD
NAME: PACKED ARRAY [1..30] OF CHAR;
SSN: PACKED ARRAY [1..10] OF CHAR;
BIRTHDATE: INTEGER; { really a double integer }
SALARY: REAL;
END;
...
VAR
EMPREC: EMP_RECORD; { declare a variable called "EMPREC" }
And in C:
typedef
struct {char name[30];
char ssn[10];
long int birthdate;
float salary;
}
emp_record;
...
emp_record emprec; /* declare a variable called "emprec" */
You can see the minor differences -- the type names are different
("float" instead of "REAL", "long int" to mean double integer); the
type name comes at the end of the "typedef"; the newly defined type is
used a "statement" all its own rather than as part of a VAR statement;
and, of course, everything's written in those CUTE lower-case
characters. In essence, of course, the constructs are absolutely
identical.
The use is identical, as well:
NEW_SALARY := EMPREC.SALARY * 1.1;
new_salary = emprec.salary * 1.1;
Incidentally, if we didn't want to define a new type, but rather just
wanted to define one variable of a given structure, we could have
said:
VAR EMPREC: RECORD
NAME: PACKED ARRAY [1..30] OF CHAR;
SSN: PACKED ARRAY [1..10] OF CHAR;
BIRTHDATE: INTEGER; { really a double integer }
SALARY: REAL;
END;
struct {char name[30];
char ssn[10];
long int birthdate;
float salary;
}
emprec;
Note how the type declaration is very much like the original variable
declaration.
So, declaring and using record structures is identical in PASCAL
and C. However, there's a VERY BIG DIFFERENCE between PASCAL and C.
* In PASCAL, strict type checking is more than just a good idea,
it's the LAW.
If a function parameter is declared as type EMPLOYEE_REC, any
function call to it must pass an object of that type. Even if it
passes a record structure that's defined with exactly the same
fields but with a different type name (admittedly a rare
occurrence), the compiler will cough.
Any structure parameter must be of EXACTLY THE RIGHT TYPE.
* Many C programmers view strict type checking much as you or I
might view, say, the Gestapo or the KGB. Kernighan & Ritchie C
compilers DO NOT do type checking.
In fact, in Kernighan & Ritchie C, you can pass a string where a
real number is expected, and the compiler won't say a word! (On
the other hand, your program is unlikely to work right.)
I could fault C for this, treating C's lack of type checking much as I
do, say, SPL's lack of an easy I/O facility. The trouble is that C
programmers don't think that lack of type checking is a bug; they
think it's a feature. The problem is philosophical -- what are the
benefits of type checking and do they outweigh the drawbacks?
TYPE CHECKING -- ORIGINAL STANDARD PASCAL AND PASCAL/3000
Earlier in the paper I brought up a certain point. Compilers that
know the type of variables can, I said, check your code to make sure
that you're not using types inconsistently.
For instance, if you use a character when you should be using a
real number, that's an "obvious error" and the compiler can do you a
favor by complaining at compile-time. Similarly, if you pass an
employee record to a procedure that expects a database name, that's
also an error, and should also be reported.
Now, this principle is in many ways at the heart of the PASCAL
language. And, certainly, everyone will agree that it would be good
for the compiler to find errors in your program rather than making you
do it yourself. The question is --
IS A COMPILER WISE ENOUGH TO DETERMINE WHAT IS AN ERROR AND WHAT IS
NOT?
For instance, say you write
VAR CH: INTEGER;
IF 'a'<=CH AND CH<='z' THEN
CH:=CH-'a'+'A';
Utterly awful! We have what -- to PASCAL, at least -- is at least 4
type inconsistencies; we're comparing an integer against a character
two times, and then we're adding and subtracting characters and
integers! Obviously an error.
Actually, of course, this code takes CH, which it assumes is a
character's ASCII code, and upshifts it. If it finds that CH is a
lower case character, it shifts it into the upper case character set
by subtracting 'a' and adding 'A'.
Some might complain that this code is not portable (it won't, for
instance, work on EBCDIC machines), but that's not relevant. The
programmer has a perfect right to assume that the code will run on an
ASCII machine; you mustn't ram portability down his throat. Sometimes,
it's very useful to be able to, say, treat characters as integers and
vice versa.
Now, before anybody accuses me of slandering PASCAL, I must point
out that the solution is readily available. Pascal can convert a
character to an integer using the "ORD" function, and an integer to a
character using "CHR"; our code could easily be re-written:
VAR CH: INTEGER;
IF ORD('a')<=CH AND CH<=ORD('z') THEN
CH:=CH-ORD('a')+ORD('A');
The important point here is not whether or not you can upshift
characters; the important fact is that:
* SOMETIMES A PROGRAMMER MAY CONSCIOUSLY WANT TO DO THINGS THAT
MIGHT USUALLY BE VIEWED AS TYPE INCOMPATIBILITIES.
Consider, for a moment, the following application:
* You want to write a procedure that adds a record to the database.
Unlike DBPUT, this one should just take the database name, the
dataset name, and the buffer, and do all the error checking
itself.
Sounds simple, no? You write:
TYPE TDATABASE = PACKED ARRAY [1..30] OF CHAR;
TDATASET = PACKED ARRAY [1..16] OF CHAR;
TRECORD = ???;
...
PROCEDURE PUT_REC (VAR DB: TDATABASE;
S: TDATASET;
VAR REC: TRECORD);
BEGIN
...
END;
BUT HOW DO YOU DEFINE "TRECORD"?
Remember why I said that type checking is such a wonderful thing.
After all, if a procedure expects a "customer record" and you pass it
an "employee record", you want the compiler to complain.
But what if the procedure expects ANY kind of record? What if it'll
be perfectly HAPPY to take an employee record, a sales record, a
database name, or a 10 x 10 real matrix? How should the compiler react
then?
Unfortunately, PASCAL, with all its sophisticated type checking,
falls flat on its face (this is true of both Standard PASCAL and
PASCAL/3000).
At this point, in the interest of fairness (and for the practical
use of those who HAVE to do this sort of thing in PASCAL), I must
point out that PASCAL does have a mechanism for supporting record
structures of different types. The trick is to use a degenerate
variation of the record structure called the "tagless variant" or
"union" structure. It's quite similar to EQUIVALENCE in FORTRAN, but
even uglier.
To put it briefly, you have to say the following:
TYPE TANY_RECORD =
RECORD
CASE 1..5 OF
1: (EMP_CASE: TEMPLOYEE_RECORD);
2: (CUST_CASE: TCUSTOMER_RECORD);
3: (VENDOR_CASE: TVENDOR_RECORD);
4: (INV_CASE: TINVOICE_RECORD);
5: (DEPT_CASE: TDEPARTMENT_RECORD);
END;
This defines the type "TANY_RECORD" to be a record structure which can
be looked at in one of FIVE different ways:
* As having one field called "EMP_CASE" which is of type
"TEMPLOYEE_RECORD".
* As having one field called "CUST_CASE" which is of type
"TCUSTOMER_RECORD".
* Or, as having one field called "VENDOR_CASE", "INV_CASE", or
"DEPT_CASE", which is of type "TVENDOR_RECORD",
"TINVOICE_RECORD", or "TDEPARTMENT_RECORD", respectively. You get
the idea.
If you declare a variable of type "TANY_RECORD", it'll be allocated
with enough room for the largest of the component datatypes. Then, you
can make the variable "look" like any one of these records by using
the appropriate subfield:
VAR R: TANY_RECORD;
...
WRITELN (R.EMP_CASE.NAME); { views R as an employee record }
WRITELN (R.DEPT_CASE.DEPTHEAD); { views R as a dept record }
WRITELN (R.INV_CASE.AMOUNT); { views R as an invoice record }
In other words, an object of type TANY_RECORD is actually five
different record structures "equivalenced" together; which one you get
depends on which ".xxx_CASE" subfield you use.
Got all that? Now, here's how you define and call the PUT_REC
procedure:
PROCEDURE PUT_REC (VAR DB: TDATABASE;
S: TDATASET;
VAR REC: TANY_RECORD);
BEGIN
...
END;
...
{ now, all dataset records you need to pass must be declared to }
{ be of type TANY_RECORD. }
READLN (R.EMP_CASE.NAME, R.EMP_CASE.SSN);
R.EMP_CASE.BIRTHDATE := 022968;
R.EMP_CASE.SALARY := MINIMUM_WAGE - 1.00;
PUT_REC (MY_DB, EMP_DATASET, R);
You must declare ALL YOUR DATASET RECORDS to be of type TANY_RECORD
(wasting space if, say, TDEPARTMENT_RECORD is 10 bytes long and
TINVOICE_RECORD is 200 bytes long); you must refer to them with the
appropriate ".xxx_CASE" subfield; then, you must pass the TANY_RECORD
to PUT_REC. (Alternately, you may have one "working area" record of
type TANY_RECORD and move the record you want into the appropriate
subfield of this "working area" record before calling PUT_REC.)
As you may have guessed, I think this is a very poor workaround
indeed:
* You need to specify in the TANY_RECORD declaration every possible
type that you'll ever want to pass to PUT_REC;
* You have to declare any record you want to pass to PUT_REC to be
of type TANY_RECORD, even if it wastes space.
* If you don't want to use a "working area" record, you have to
refer to all your records as "R.EMP_CASE" or "R.DEPT_CASE" rather
than just defining R as the appropriate type and referring to it
just as "R".
* If you do use a "working area" record, to wit:
VAR WORK_RECORD: TANY_RECORD;
EMP_REC: TEMPLOYEE_RECORD;
...
READLN (EMP_REC.NAME, EMP_REC.SSN);
EMP_REC.BIRTHDATE := 022968;
EMP_REC.SALARY := MINIMUM_WAGE - 1.00;
WORK_RECORD.EMP_CASE := EMP_REC;
PUT_REC (MY_DB, EMP_DATASET, WORK_RECORD);
then you have to move your data into it before every PUT_REC
call, which is both ugly and inefficient.
And why? All because PASCAL isn't flexible enough to allow you to
declare a parameter to be of "any type".
A couple more examples of cases where strict type checking is
utterly lethal may be in order:
* Say that you want to write a procedure that compares two PACKED
ARRAY OF CHARs (in Standard PASCAL, these are the only way of
representing strings). You must define the types of your
parameters, INCLUDING THE PARAMETER LENGTHS! In other words,
TYPE TPAC = PACKED ARRAY [1..256] OF CHAR;
VAR P1: PACKED ARRAY [1..80] OF CHAR;
P2: PACKED ARRAY [1..80] OF CHAR;
...
FUNCTION STRCOMPARE (VAR X1: TPAC; VAR X2: TPAC): BOOLEAN;
BEGIN
...
END;
...
IF STRCOMPARE (P1, P2) THEN ...
is ILLEGAL. P1, you see, is an 80-character string, which is not
compatible with the function parameter, which is a 256-character
string.
* Say that you want to write a procedure like WRITELN, which will
format data of various types. WRITELN may not be sufficient for
your needs -- you might need to be able to output numbers
zerofilled or in octal, you might want to provide for page breaks
and line wraparound, etc. Surely you should be allowed to do
this!
Well, first of all, you can't have a variable number of
parameters. But, even if you're willing to have a maximum of,
say, 10 parameters and pad the list with 0s, your parameters must
all be of fixed types!
Thus, even if your design calls for some kind of "format string"
that'll tell your WRITELN-replacement what the actual type of
each parameter is, you can't do anything. You must either have a
procedure for each possible type combination (one to output two
integers and a string, one to output a real, an integer, and
three strings, etc.), or have the procedure only output one
entity at a time. This way, you'll have to write:
PRINTS ('THE RESULT WAS ');
PRINTI (ACTUAL);
PRINTS (' OUT OF A MAXIMUM ');
PRINTI (MAXIMUM);
PRINTS (', WHICH WAS ');
PRINTR (ACTUAL/MAXIMUM*100);
PRINTS ('%');
PRINTLN;
instead of
PRINTF ('THE RESULT WAS %d OUT OF A MAXIMUM %d, WHICH WAS %f',
ACTUAL, MAXIMUM, ACTUAL/MAXIMUM*100);
* Finally -- although it should be obvious by now -- you can't
write, say, a matrix inversion function that takes any kind of
matrix. You could write a 2x2 inverter, a 3x3 inverter, a 4x4
inverter, and so on. You could also write a matrix multiplier
that multiplies 2x2s by 2x2s, another that does 2x2s by 2x3s,
another 2x2s by 2x4s, another 3x2s by 2x2s, .... Just think of
the job security you'll have!
For fairness's sake, I must admit that this problem is SLIGHTLY
mitigated in PASCAL/3000.
PASCAL/3000 has a "STRING" data type, which is a variable-length
string (as opposed to PACKED ARRAY OF CHAR, which is a fixed-length
string). In other words, PASCAL/3000 STRINGs are essentially
(internally) record structures, containing an integer -- the current
string length -- and a PACKED ARRAY OF CHAR -- the string data.
When HP implemented this, they were good enough to make all STRINGs
-- regardless of their maximum sizes -- "assignment- compatible" with
each other. This means that you can say:
VAR STR1: STRING[80];
STR2: STRING[256];
...
STR1:=STR2;
and also
TYPE TSTR256 = STRING[256];
VAR S: STRING[80];
...
FUNCTION FIRST_NON_BLANK (PARM: TSTR256): INTEGER;
BEGIN
...
END;
...
I := FIRST_NON_BLANK (S);
Since STRING[80]s (strings with maximum length 80) and STRING[256]s
(strings with maximum length 256) are assignment- compatible, you may
both directly assign them (STR1:=STR2) and pass one by value in place
of another (PROC(S)).
Although "assignment compatibility" allows by-value passing, a
variable passed by reference still has to be of exactly the same type
as the formal parameter specified in the procedure's header. Thus,
TYPE TSTR256 = STRING[256];
VAR S: STRING[80];
...
FUNCTION FIRST_NON_BLANK (VAR PARM: TSTR256): INTEGER;
BEGIN
...
END;
...
I := FIRST_NON_BLANK (S);
is still illegal, since STRING[80]s can't be passed to by-reference
(VAR) parameters of type STRING[256]. Fortunately, PASCAL/3000 also
lets you say:
FUNCTION FIRST_NON_BLANK (VAR PARM: STRING): INTEGER;
Specifying a type of "STRING" rather than "STRING[maxlength]" allows
you to pass any string in place of the parameter.
This only works for STRING parameters. It doesn't work for PACKED
ARRAYs OF CHAR; it doesn't work for other array structures; it isn't
supported by Standard PASCAL. However, for the specific case of string
manipulation, you can get around some of PASCAL's onerous parameter
type checking restrictions.
Remember also that this is strictly an PASCAL/3000 (PASCAL/3000 and
PASCAL/XL) feature, and can not be relied on in any other PASCAL
compiler.
TYPE CHECKING -- KERNIGHAN & RITCHIE C
Where PASCAL insists on checking all parameters for an exact type
match, original -- Kernighan & Ritchie -- C takes the diametrically
opposite view.
Classic C checks NOTHING. It does not check parameter types; it
does not even check the number of parameters. All data in C is passed
"by value", which means that the value of the expression you specify
is pushed onto the parameter stack for the called procedure to use; if
you want to pass a variable "by reference" -- pushing its pointer onto
the stack -- you have to use the "&" operator to get the variable's
address, to wit:
myproc (&result, parm1, parm2);
If you omit the "&", or specify it when you shouldn't -- well, C
doesn't check for this, either.
Much can be said about the philosophical reasons that C is this
way; many labels, from "flexibility" to "cussedness" can be attached
to it. The fact of the matter, though, is that K&R C -- which means
many, if not most, of today's C compilers -- doesn't do any type
checking.
The effects of this, of course, are the opposite of the effects of
PASCAL's strong type checking:
* You have almost complete flexibility in what types you pass to a
procedure. In two different calls, the same parameter can be one
of two entirely different record structures; one of two character
or integer arrays of entirely different lengths (C doesn't do
run-time bounds checking, anyway); a real in one call, an integer
in another, and a pointer in a third.
Practically, virtually all of the examples I showed in the PASCAL
chapter can thus be implemented in C. For instance,
int strcompare(s1,s2,len)
char *s1, *s2;
int len;
{
int i;
i = 0;
while ((i < len) && (s1[i] == s2[i]))
i = i+1;
}
will merrily compare two character arrays, no questions asked.
You can pass arrays of any size, and it'll do the job. You can
pass integers, reals, integer arrays, whatever; of course, the
code isn't likely to work, but, hey, it's a free country --
nobody'll stop you.
* In most implementations of K&R C, you're even allowed to pass a
different number of parameters than the function was declared
with. Though this is not guaranteed portable, most C compilers
make sure that if, say, your procedure's formal parameters are
"a", "b", and "c" (all integers) and you actually pass the values
"1" and "2", then A will be set to 1, B to 2, and C will contain
garbage (that's "C" the variable, not "C" the language).
This is good because it allows you to write procedures that take
a variable number of parameters; as long as you have a way of
finding out how many parameters were actually passed (e.g. the
PRINTF format string), your procedure can handle them
accordingly.
* On the other hand, say you make a mistake in a procedure call --
you pass a real instead of an integer, a value instead of a
pointer, or perhaps even omit a parameter. The compiler won't
check this; the only way you'll find the error is by running the
program, and even then the erroneous results may first appear far
away from the real error.
Some C compilers (especially on UNIX) come with a program called
LINT that can check for this error and others, but that's often
not enough. First of all, your programmers have to run LINT as
well as C for each program, which slows down the compilation
pass; more importantly, since LINT is no way part of standard C,
many C compilers don't have it.
VAX/VMS C, for instance, doesn't come with LINT; neither does the
CCS C that's available on the HP3000.
* Similarly, even things that seem like they ought to work --
passing an integer in place of a real and expecting it to be
reasonably converted -- will fail badly. Thus,
sqrt(100)
won't work if SQRT expects a real; C won't realize that an
integer-to-real conversion is required, and will thus pass 100 as
an integer, which is a different thing altogether.
A similar problem occurs on computers (like the HP3000) that
represent byte pointers (type "char *") and integer pointers
(type "int *" and other pointer types) differently. Since C
doesn't know which type of pointer a procedure expects, it'll
never do conversions. If you call a procedure like FGETINFO that
expects byte pointers and pass it an integer pointer, you'll be
in trouble (unless you manually cast the pointer yourself).
Incidentally, for ease of using real numbers, C will
automatically convert all "single-precision real" (called "float"
in C) arguments to "double-precision real" ("long") in function
calls. This makes sure that if SQRT expects a "long", passing it
a "float" won't confuse it.
* On the other hand (how many hands am I up to now?), C's
conversion woes -- requirements of passing "float"s instead of
"int"s, "char *"s instead of "int *"s, etc. -- are easier to
solve than in PASCAL. Since C allows you to easily convert a
value from one datatype to another (using the so-called "type
casts"), you could say
my_proc ((float) 100, (char *) &int_value);
and thus pass a "float" and a "char *" to "my_proc". In PASCAL
you couldn't do things this easily. The compiler might
automatically translate an integer to a float for you; but, if it
expects a character value and all you've got is an integer,
there's no easy way for you to tell it "just pass this integer as
a byte address, I know what I'm doing."
Thus, K&R C is flexible enough to do all that Standard PASCAL can
not. If this is necessary to you -- and I can easily understand why it
would be; Standard PASCAL's restrictions are very substantial -- then
you'll have to live with C's lack of error checking. On the other
hand, if flexibility is of less than critical value, you have to ask
yourself whether or not you want the extra level of compiler error
checking that PASCAL can provide you.
My personal experience, incidentally, has been that compiler error
checking of parameters is very nice, but not absolutely necessary. I'd
love to have the compiler find my bugs for me, but I can muddle
through without it. PASCAL's restrictions, though, are substantially
more grave. More than inconveniences, they can make certain problems
almost impossible to solve.
DRAFT ANSI STANDARD C
Time, it is said, heals all wounds; perhaps it can also heal
wounded computer languages. God knows, FORTRAN 77 isn't the greatest,
but it sure is better than FORTRAN IV.
The framers of the new Draft ANSI Standard C have apparently
thought about some of the problems that C has, especially the ones
with function call parameter checking and conversion. The solution
seems to be quite good, letting you impose either strict or loose type
checking -- whichever you prefer -- for each procedure or procedure
parameter. Remember, though, the standard is still only Draft, so it's
not unlikely that any given C compiler you might want to use won't
have it.
In Draft Standard C, you can do one of two things:
* You can call a procedure the same old way that you'd do in K & R
C. No type checking, no automatic conversion, no nothin'. You
might declare its result type, to wit:
extern float sqrt();
(Remember, you'd have to do that anyway in K&R C; otherwise, the
compiler will treat SQRT's result as an integer.) But no other
declarations are required, and no checking will be done.
* Alternatively, you can declare a FUNCTION PROTOTYPE. This can be
done either for an external function or for one you're defining
-- the prototype is very much like PASCAL's procedure header
declaration. A sample might be:
extern int ASCII (int val, int base, char *buffer);
or simply
extern int ASCII (int, int, char *);
[Note that the parameter NAMES, as opposed to TYPES, are not
necessary in a prototype for an EXTERNAL function. For a function
that you're actually defining, the names are necessary; the
declarations in the prototype are used in place of the type
declarations that you'd normally specify for the function
parameters.]
This function prototype tells the compiler enough about the
function parameters for it to be able to do appropriate type
checking and conversion. One of the reasons K&R C couldn't do
that is precisely because of the lack of this information.
Consider the cases where this would come in handy. We might declare
SQRT as
extern float sqrt (float);
and then a call like
sqrt (100)
would automatically be taken to mean "sqrt ((float) 100)", i.e. "sqrt
(100.0)". Similarly,
sqrt (100, 200)
or
sqrt ()
would cause a compiler error or warning, since now the compiler KNOWS
that SQRT takes exactly one parameter.
In general, say that you have a function declared as
extern int f(formaltype); /* or non-extern, for that matter */
This simply means that "f" is a function that returns an "int" and
takes one parameter of type "formaltype". Now, say that your code
looks like:
actualtype x;
...
i = f(x);
Is this kind of call valid or not? Of course, it depends on what
"formaltype" and "actualtype" are:
* If both FORMALTYPE and ACTUALTYPE are numbers -- integers or
floats, short, long, or whatever -- then X is converted to
ACTUALTYPE before the call. This is what lets us say
sqrt(100)
when "sqrt" is declared to take a parameter of type "real".
(The same goes the other way -- if "mod" is declared to take two
"int"s, then "mod(10.5,3.2)" would be converted to "mod(10,3)",
although the compiler might print a warning message to caution
you that a truncation is taking place.)
* If FORMALTYPE is a pointer -- which is the case for all
"by-reference" parameters, since that's how we pass things by
reference in C -- then ACTUALTYPE must be EXACTLY the same type.
In other words, if we say:
int copystring (char *src, char *dest)
then in the call
char x;
int y;
...
copystring (x, &y);
BOTH parameters will cause an error message. The first parameter
will be a "CHAR" passed where a "CHAR *" is expected, which is
illegal -- a good way of checking for attempts to pass parameters
by value where by-reference was expected. The second parameter
will be an "INT *" passed where a "CHAR *" is expected, which is
also illegal, since although both are pointers, they don't point
to the same type of thing.
* If ACTUALTYPE is a pointer, then FORMALTYPE must also be a
pointer of EXACTLY the same type. Again, this is useful for
catching attempts to pass "by-reference" calls to procedures that
expect "by-value" parameters, and also attempts to pass a pointer
to the wrong type of object.
* If either ACTUALTYPE or FORMALTYPE is a pointer of the special
type "void *", then the other one may be any type of pointer.
This is very useful when we want a parameter to be a BY-REFERENCE
parameter of some arbitrary type (similar to PASCAL/XL's ANYVAR,
for which see below). Thus, if we want to write our "put_rec"
procedure that'll put any type of record structure into a
database, we'd say:
put_rec (char *dbname, char *dbset, void *rec)
Then, we could say:
typedef struct {...} sales_rec_type;
typedef struct {...} emp_rec_type;
...
sales_rec_type srec;
emp_rec_type erec;
...
put_rec (mydb, sales_set, &srec);
...
put_rec (mydb, emp_set, &erec);
Both of the PUT_REC calls are valid since both "&srec" and
"&erec" (and, for that matter, any other pointer) can be passed
in place of a "void *" parameter. If we'd declared "put_rec" as:
put_rec (char *dbname, char *dbset, sales_rec_type *rec)
then the "put_rec (mydb, emp_set, &erec)" call would NOT be
legal, sinec "&erec" is NOT compatible with "sales_rec_type *".
Note that on some machines -- including the HP3000 -- integer
pointers and character pointers are NOT represented the same way.
However, it's always safe to pass either a "char *" or an "int *"
in place of a parameter that's declared as a "void *". The C
compiler will always do the appropriate conversion; thus, if we
declare the ASCII intrinsic as
extern int ASCII (int, int, void *);
then both of the calls below:
char *cptr;
int *iptr;
...
i = ASCII (num, 10, cptr);
...
i = ASCII (num, 10, iptr);
will be valid (assuming that a "void *" is actually represented
as a byte pointer, which is what the ASCII intrinsic wants). You
can thus think of "void *" as the "most general type"; any
pointer can be successfully passed to a "void *".
* Note that although you CAN'T pass, say, a "char *" to a parameter
of type "int *", C will ignore the SIZE of the array the pointer
to which is being passed. In other words, a function such as
extern strlen (char *s);
may be passed a pointer to a string of any size -- both of the
following calls:
char s1[80], s2[256];
...
i = strlen (s1);
i = strlen (s2);
are valid. Remember that C makes no distinction between a
"pointer to an 80-byte array" and a "pointer to a 256-byte
array"; similarly, it makes no distinction between an array like
"s1" and a "pointer to a character" (see below).
* An interesting exception to the above rules is that the integer
constant 0 can be passed to ANY pointer parameter. This is
because a pointer with value 0 is conventionally used to mean a
"null pointer".
This is quite useful in some applications, but can often prevent
the compiler from detecting some errors. If I say:
extern PRINT (int *buffer, int len, int cctl);
...
PRINT (0, -10, current_cctl);
this won't, of course, print a "0"; rather, it'll pass PRINT the
integer pointer "0", which will point to God knows what in your
stack. Not a very serious problem, but something you ought to
keep in mind.
* Unlike Standard PASCAL, not only can you entirely waive parameter
checking for a procedure (just omit the prototype!), but you can
also explicitly CAST an actual parameter whenever you want it to
match the type of a formal parameter. In other words, say that
you declare two structure types:
typedef struct {...} rec_a;
typedef struct {...} rec_b;
rec_a ra; /* declare a variable of type "rec_a" */
rec_b rb; /* declare a variable of type "rec_b" */
and then write a function
process_record_a (int x, int y, rec_a *r)
{
...
}
If you then say
process_record_a (10, 20, &rb);
then the compiler will (quite properly) print an error message,
since you were trying to pass a "pointer to rec_b" instead of a
"pointer to rec_a". If you really want to do this, though, all
you need to do is say:
process_record_a (10, 20, (rec_a *) &rb);
manually CASTING the pointer "&rb" to be of type "rec_a *", and
the compiler won't mind.
* Finally, let me also point out that, like everywhere in C, an
"array of T" and a "pointer to T" are mutually interchangeable.
In other words, if you say:
extern int string_compare (char *s1, char *s2);
and then call it as:
char str1[80], str2[256];
...
if (string_compare (str1, str2)) ...
the compiler won't mind. To it a "char *" and a "char []" are
really one and the same type.
Somewhat (but not exactly) similarly -- perhaps I should say,
similarly but differently -- the NAME OF A FUNCTION can be passed
to a parameter that is expecting a POINTER TO A FUNCTION. In
other words, if you write a procedure
int do_function_on_array_elems (int *f(), int *a, int len);
(which takes a pointer to a function, a pointer to an integer,
and an integer), and then call it as:
do_function_on_array_elems (myfunc, xarray, num_xs);
the compiler won't complain (assuming, of course, that MYFUNC is
really a function and not, say, an integer or a pointer).
To summarize, then, Draft Proposed ANSI Standard C lets you check
function parameters almost as precisely as Standard PASCAL. The
differences are:
* You can ENTIRELY INHIBIT PARAMETER CHECKING for all function
parameters by just omitting the function prototype.
* You can declare a parameter to BE A BY-REFERENCE PARAMETER OF AN
ARBITRARY TYPE by declaring it to be of type "void *". You can do
this while still enforcing tight type checking for all the other
parameters.
* In addition to overriding type checking on a PROCEDURE BASIS or
PROCEDURE PARAMETER basis, you can also override type checking on
a particular call by simply casting the actual parameter to the
formal parameter's datatype.
* Unlike PASCAL, C will never check the SIZE of an array parameter;
only its TYPE.
STANDARD "LEVEL 1" PASCAL TYPE CHECKING -- CONFORMANT ARRAYS
If you recall, one of the PASCAL features I most complained about
was the inability to pass arrays of different sizes to different
procedures. This essentially prevents you from writing any sort of
general array handling routine, including:
* For PACKED ARRAYs OF CHAR -- the way that Standard PASCAL
represents strings -- you can't write things like blank trimming
routines, string searches, or anything that's intended to take
PACKED ARRAYs OF CHAR of different sizes.
* For other arrays, the problem is exactly the same -- you can't
write matrix handling routines that work with arbitrary sizes of
arrays, e.g. matrix addition, multiplication division, etc.
This wasn't the only type checking problem (others included the
inability to pass various record types to database I/O routines,
etc.), but it was a major one.
The ISO Pascal Standard, released in the early 80's, addresses this
problem. A new feature called "conformant arrays" has been defined;
PASCAL compilers are encouraged, but not required, to implement it. A
compiler is said to
* "Comply at level 0" if it does not support conformant arrays;
* "Comply at level 1" if it does support them.
You see the problem -- who knows just how many new PASCAL compilers
will include this feature? It is a fact that most compilers written
before the ISO Standard do NOT include it.
PASCAL/3000, for instance, does not have it; PASCAL/XL, on the
other hand, does.
What are "conformant arrays"? To put it simply, they are
* FUNCTION PARAMETERS that are defined to be ARRAYS OF ELEMENTS OF
A GIVEN TYPE, but whose bounds are NOT defined. Instead, the
compiler makes sure that the ACTUAL BOUNDS of whatever array
parameter is ACTUALLY passed are made known to the procedure.
An example:
FUNCTION FIRST_NON_BLANK
(VAR STR: PACKED ARRAY [LOWBOUND..HIGHBOUND: INTEGER]
OF CHAR): INTEGER;
VAR I: INTEGER;
BEGIN
I:=LOWBOUND;
WHILE I<HIGHBOUND AND STR[I]=' ' DO
I:=I+1;
FIRST_NON_BLANK:=I;
END;
This procedure is intended to find the index of the first non-blank
character of STR. Note how it declares STR: Instead of specifying a
constant lower and upper bound in the PACKED ARRAY [x..y] OF CHAR
declaration, it specifies TWO VARIABLES.
When the procedure is entered, the variable LOWBOUND is
automatically set to the lower bound of whatever array the user
actually passed, and HIGHBOUND is set to the upper bound of the array.
In other words, if we say:
VAR MYSTR: PACKED ARRAY [1..80] OF CHAR;
...
I:=FIRST_NON_BLANK (MYSTR);
then, in FIRST_NON_BLANK, the variable LOWBOUND will be set to 1 and
HIGHBOUND will be set to 80. Instead of just passing the MYSTR
parameter, PASCAL actually passes "behind your back" 1 and 80 as well.
The way I see it, this is a very good solution, even better in some
ways than C's (in which you can always pass an array of any arbitrary
size):
* You're no longer restricted (like you are in Standard PASCAL) to
a fixed size for your array parameters.
* When you pass an array to a conformant array parameter, you don't
have to manually specify the size of the array; the array bounds
are automatically passed for you. If I were to write the same
procedure in C, I'd have to say
int first_non_blank (str, maxlen)
char str[];
int maxlen;
...
and then manually pass it both the string and the size that it
was allocated with; otherwise, the procedure won't know when to
stop searching (assuming you don't use the convention that a
string is terminated by a null or some such terminator).
* Since the compiler itself knows what the conformant array
parameter's bounds are (it doesn't know the actual values, but it
does know what variables contain the values), it can emit
appropriate run-time bounds checking code. This can automatically
catch some errors at run-time, which is good if you like heavy
compiler-generated error checking.
* Conformant arrays are even better for two-dimensional arrays. To
index into a two-dimensional array the compiler must, of course,
know the number of columns in the array (assuming it's stored in
row-major order, as C and PASCAL 2-D arrays are). In C, you must
either declare the number of columns as a constant, e.g.
matrix_invert (m)
float m[][100];
or declare the parameter as a 1-D array, pass the number of
columns as a parameter, and then do your own 2-D indexing, to
wit:
matrix_invert (m, numcols)
float m[];
int numcols;
...
element = m[row*numcols+col]; /* instead of M[ROW,COL] */
...
In ISO Level 1 PASCAL, you just declare the procedure as:
PROCEDURE MATRIX_INVERT (M: ARRAY [MINROW..MAXROW,
MINCOL..MAXCOL] OF REAL);
Then you automatically know the bounds of the array AND can also
do normal array indexing (M[ROW,COL]), since the compiler knows
the number of columns, too.
This, it seems, is how original Standard PASCAL should have worked,
and I'm glad that the standards people have established it. The only
problems are:
* This is, of course, somewhat less efficient than not passing the
bounds or just passing, say, the upper bound (like you would in
C).
* Remember that this only fixes the case where we want to pass
differently sized arrays to a procedure. If we want to pass
different TYPES (like in our PUT_REC procedure that should
accept one of several database record types), conformant arrays
won't help us.
* Most importantly, MANY PASCAL COMPILERS MIGHT NOT SUPPORT THIS
WONDERFUL FEATURE. In particular, PASCAL/3000 DOES NOT SUPPORT
CONFORMANT ARRAYS.
PASCAL/XL TYPE CHECKING
PASCAL/XL obeys all of PASCAL's type checking rules, but gives you
a number of ways to work around them:
* PASCAL/XL supports the CONFORMANT ARRAYS that I just talked
about.
* PASCAL/XL allows you to specify a variable as "ANYVAR", e.g.
PROCEDURE PUT_REC (VAR DB: TDATABASE;
S: TDATASET;
ANYVAR REC: TDBRECORD);
What this means to PASCAL is that, when PUT_REC is called, the
third parameter (REC) will NOT be checked. Inside PUT_REC, you'll
be able to refer to this parameter as REC, and to PUT_REC it'll
have the type TDBRECORD; however, the CALLER need not declare it
as TDBRECORD. For instance,
VAR SALES_REC: TSALES_REC;
EMP_REC: TEMP_REC;
...
PUT_REC (MY_DB, SALES_DATASET, SALES_REC);
...
PUT_REC (MY_DB, EMP_DATASET, EMP_REC);
will do EXACTLY what we want it to -- it'll pass SALES_REC and
EMP_REC to our PUT_REC procedure without complaining about their
data types.
As I said, PUT_REC itself will view the REC parameter as an
object of type TDBRECORD. However, PUT_REC can say
SIZEOF(REC)
and determine the TRUE size of the actual parameter that was
passed in place of REC. This can be very useful if PUT_REC needs
to do an FWRITE or some such operation that needs to know the
size of the thing being manipulated.
The way this is done, of course, is by PASCAL/XL's passing the
size of the actual parameter as well as the parameter's address.
Incidentally, you can turn this off for efficiency's sake if
you're not going to use this SIZEOF construct.
* PASCAL/XL allows you to do TYPE COERCION -- you can take an
object of an arbitrary type and view it as any other type. For
instance, you can take a generic "ARRAY OF INTEGER" and view it
as a record type, or take an INTEGER parameter and view it as a
FLOAT. A possible application might be:
TYPE COMPLEX = RECORD REAL_PART, IMAG_PART: REAL; END;
INT_ARRAY = ARRAY [1..2] OF INTEGER;
...
PROCEDURE WRITE_VALUE (T: INTEGER; ANYVAR V: INT_ARRAY);
BEGIN
IF T=1 THEN WRITELN (V[1])
ELSE IF T=2 THEN WRITELN (FLOAT(V))
ELSE IF T=3 THEN WRITELN (BOOLEAN(V))
ELSE IF T=4 THEN WRITELN (COMPLEX(V).REAL_PART,
COMPLEX(V).IMAG_PART);
END;
As you see, this procedure takes a type indicator (T) and a
variable of any type V. Then, depending on the value of T, it
VIEWS V as an integer, a float, a boolean, or a record structure
of type COMPLEX. All we need to do is say
typename(value)
and it returns an object with EXACTLY THE SAME DATA as "value",
but viewed by the compiler as being of type "typename". Note that
this means that "REAL(10)" won't return 10.0 (which is what a C
"(float) 10" type cast would do); rather, it'll return the
floating point number the MACHINE REPRESENTATION of which is 10.
Some other example applications for this very useful construct
are:
- You can now have a pointer variable that can be set to point
to an object of an arbitrary type; this allows you to write
things like generic linked list handling procedures that work
regardless of what type of object the linked list contains.
More about this on ANYPTR below.
- You may write a generic bit extract procedure that can be
used for extracting bits from characters, integers, reals,
etc. You'd declare it as:
FUNCTION GETBITS (VAL, STARTBIT, LEN: INTEGER): INTEGER;
...
and call it using
I:=GETBITS (INTEGER(3.0*EXP(X)), 10, 6);
or
I:=GETBITS (INTEGER(MYSTRING[I]), 5, 1);
or whatever. Note that you couldn't do this with ANYVAR
parameters since ANYVAR parameters are by-reference, and thus
can't be passed constants or expressions.
* PASCAL/XL -- just like PASCAL/3000 -- makes STRING parameters of
any size compatible with each other. Thus, you can pass a
STRING[20] to a procedure that's defined to take a STRING[256];
or, if you're passing the string by REFERENCE, you can just
declare the formal parameter as "STRING", which will be
compatible with any string type.
* PASCAL/XL has a new type called "ANYPTR"; declaring a variable to
be an ANYPTR makes it "assignment-compatible" with any other
pointer type, which means that that variable can be easily made
to point to objects of different types. This, coupled with the
"type coercion" operation mentioned above, makes manipulating
say, linked lists of different data structures much easier.
Needless to say, use of any of these constructs can get you into
trouble precisely because of the additional freedom they give you.
Converting a chunk of data from one record data type to another only
makes sense if you know exactly what you're doing; if you don't,
you're likely to end up with garbage.
However, often there are cases where you NEED this additional
freedom, and in those cases, PASCAL/XL really comes through. As a
rule, its type checking is as stringent and thorough as Standard
PASCAL's, but it allows you to relatively easily waive the type
checking whenever you need to.
ENUMERATED DATA TYPES
If you recall, before I started talking about type checking, I was
describing RECORD STRUCTURES, a new data type that PASCAL and C
support. My mind, you see, works like a stack -- sometimes I'll
interrupt what I'm doing and go off on a digression (sometimes
relevant, sometimes not); then, I'll just POP the stack, and I'm back
where I was before.
So, I'm popping the stack and continuing with the discussion of
"new" data types -- data types that C and/or PASCAL support, but SPL
does not.
Say you want to call the FCLOSE intrinsic. You pass to it the file
number of the file to be closed and you also pass the file's
DISPOSITION. This disposition is a numeric constant, indicating what
the system is to do with the file being closed:
FCLOSE (FNUM, 0, 0); means just close the file;
FCLOSE (FNUM, 1, 0); means SAVE the file as a permanent file;
FCLOSE (FNUM, 2, 0); means save the file as a TEMPORARY file;
FCLOSE (FNUM, 3, 0); means save the file as a temporary file,
but if it's a tape file, DON'T REWIND;
FCLOSE (FNUM, 4, 0); means DELETE the file being closed;
[we'll ignore for now the "squeeze" disposition and the fairly useless
third parameter.] Now, naturally, today's enlightened programmer
doesn't want to specify the disposition as a numeric constant -- how
many people will understand what's going on if they see a
FCLOSE (FNUM, 4, 0);
in the middle of the program? Instead, we'd define some constants --
EQUATE DISP'NONE = 0,
DISP'SAVE = 1,
DISP'TEMP = 2,
DISP'NOREWIND = 3,
DISP'PURGE = 4;
Now, we can say
FCLOSE (FNUM, DISP'PURGE, 0);
Don't you like this better? I knew you would.
As you see, in this case, an integer is being used not as a
QUANTITATIVE measure (how large a file is, how many seconds an
operation took, etc.), but rather as a sort of FLAG. This flag has no
mnemonic relationship to its numeric value; the numeric value is just
a way of encoding the operation we're talking about (save, purge,
etc.).
This sort of application actually occurs very frequently. Some
examples might include:
* FFILEINFO item codes, which indicate what information is to be
retrieved (4 = record size, 8 = filecode, 18 = creator id, etc.).
* CREATEPROCESS item numbers, which indicate what parameter is
being passed (6 = maxdata, 8 = $STDIN, 11 = INFO=, etc.).
* FOPEN foptions bits -- 1 = old permanent, 2 = old temporary, 4 =
ASCII file, 64 = variable record length file, 256 = CCTL file,
etc.; same for aoptions bits.
* And many other cases; each system table you look at, for
instance, will contain at least two or three of these sorts of
encodings.
As I mentioned, SPL's solution to this sort of problem is just
declaring constants (using EQUATE). Similarly, in PASCAL you could
easily say:
CONST DISP_NONE = 0;
DISP_SAVE = 1;
DISP_TEMP = 2;
DISP_NOREWIND = 3;
DISP_PURGE = 4;
and in C, you could code:
#define disp_none 0
#define disp_save 1
#define disp_temp 2
#define disp_norewind 3
#define disp_purge 4
Nice and readable; the constant declaration creates the link between
the symbolic name and the real numeric value -- after this, you can
use the symbolic name wherever you need to.
Enumerated data types are just like this, only different. In
PASCAL, you could say
TYPE FCLOSE_DISP_TYPE = (DISP_NONE, DISP_SAVE, DISP_TEMP,
DISP_NOREWIND, DISP_PURGE);
Instead of just defining five constants with values 0, 1, 2, 3, and 4,
this declaration defines a new DATA TYPE called FCLOSE_DISP_TYPE and
five OBJECTS of this type -- DISP_NONE, DISP_SAVE, etc. If you declare
the FCLOSE intrinsic as
PROCEDURE FCLOSE (FNUM: INTEGER; DISP: FCLOSE_DISP_TYPE;
SEC: INTEGER);
EXTERNAL;
you'll now be able to say
FCLOSE (FNUM, DISP_PURGE, 0);
The key difference between an "ENUMERATED TYPE" declaration and
the ordinary constant definitions is that the objects of the data type
can't be used as integers. For instance, saying this:
VAR DISP: FCLOSE_DISP_TYPE;
...
DISP:=1;
is an error, and you certainly can't say:
DISP:=DISP_SAVE*DISP_PURGE;
In fact, if you've declared FCLOSE as was shown above, then PASCAL
will even check the DISP parameter to make sure you're really passing
a DISP_xxx; if you accidentally pass something else, the compiler will
catch this and complain.
As you see, the advantage of enumerated types is type checking (a
field which PASCAL, in general, is rather compulsive about). The
disadvantage is this:
* How are you certain that when you declared the enumerated type,
DISP_SAVE actually corresponded to 1 and DISP_PURGE to 4? In
other words, when you pass a disposition to FCLOSE, PASCAL must
pass it as some integer value -- if you had declared it as a
constant, you'd KNOW the value; with an enumerated type, how are
you sure?
Well, although Standard PASCAL doesn't define what the "ACTUAL" value
of an enumerated type object is, most PASCALs -- including PASCAL/3000
and PASCAL/XL -- number the objects from 0 up, in the order given in
the enumerated type declaration. This is what lets our
FCLOSE_DISP_TYPE type work; the way that PASCAL allocates the numeric
values of the DISP_xxx objects is exactly the way we want it to.
On the other hand, say that I want to define file system error
numbers (which FCHECK might return). These are also special numeric
codes that we'd like to access using symbolic names, but they are NOT
sequentially ordered. For instance, you might want to declare
CONST FERR_EOF = 0;
FERR_NO_FILE = 52;
FERR_SECURITY = 93;
FERR_DUP_FILE = 100;
How can you declare this as an enumerated data type? Well, you can't,
unless you're willing to declare 51 "dummy items" between FERR_EOF and
FERR_NO_FILE so that FERR_NO_FILE will fall on 52. In general,
wherever there are "holes" in the sequence, enumerated types can't
really be used.
Now, this is not a complaint against enumerated types per se.
Enumerated types are great as long as YOU DON'T CARE WHAT THE VALUES
OF THE ENUMERATED TYPE OBJECTS ARE; if the type is used solely within
your programs, you won't have any problems. The trouble comes in when
you try to use enumerated types for objects whose values are dictated
externally.
To summarize,
* Enumerated types are very similar to constant declarations.
* Enumerated types' big advantages are:
- The compiler does type checking for them, making sure that you
don't accidentally use, say, an FOPEN foptions mode where you
ought to use an FCLOSE disposition.
- You don't have to manually assign a numeric value to each
enumerated type object.
* Enumerated types are great if they're defined and used solely
within your program, where you don't care what values the
compiler assigns to each object.
* If you're using enumerated types to represent objects whose
actual value is important -- say FCLOSE dispositions, FFILEINFO
item numbers, file system errors -- you may have troubles. If the
actual values are numbered sequentially starting with 0, you can
use an enumerated type to represent these values; if they don't
start with 0 or are not sequential, you can't really use an
enumerated type.
* In general, even if the values ARE numbered sequentially from 0
(like FCLOSE dispositions, FFILEINFO item numbers, CREATEPROCESS
item numbers, etc.) you might want to use constants instead of
enumerated types. This is because the numeric assignments aren't
easily visible in enumerated type declarations; if you
accidentally omit a possible value (e.g. declare the type as
(NONE, SAVE, TEMP, PURGE), omitting NOREWIND), it won't be at all
obvious that PURGE now has the wrong value.
ENUMERATED DATA TYPES IN C
Classic K&R C did not support enumerated types; as we saw, this
probably wasn't such a big disadvantage, since enumerated types are
just a fancy form of constant declarations.
Draft ANSI Standard C -- and, in fact, most modern Cs -- supports
enumerated types; you can say
typedef enum { disp_none, disp_save, disp_temp,
disp_norewind, disp_purge } fclose_disp_type;
which will define the type FCLOSE_DISP_TYPE just like PASCAL's
enumerated type declaration will. In fact, the numeric values of
DISP_NONE, DISP_SAVE, etc. will even be assigned the same way as they
would be with PASCAL.
The trouble is this: what was the major advantage of PASCAL
enumerated types over constants?
Well, once the PASCAL compiler knew that a variable was of an
enumerated type, it could to appropriate type checking. But C isn't a
strong type checking language! To C, any object of any enumerated type
is viewed exactly as an integer would be viewed.
Thus, the above declaration is EXACTLY the same as:
#define disp_none 0
#define disp_save 1
#define disp_temp 2
#define disp_norewind 3
#define disp_purge 4
If you say
fclose_disp_type disp;
(thus declaring DISP to be an object of FCLOSE_DISP_TYPE), you can now
code
disp = disp_save;
but you could also (if you wanted to) say
disp = 1;
or
disp = (i+123)/7;
One advantage that C has, though, is that (unlike PASCAL) it allows
you to explicitly specify the numeric values of each element in the
enumeration, to wit:
typedef enum { ferr_eof = 0, ferr_no_file = 52,
ferr_security = 93, ferr_dup_file = 100 }
file_error_type;
The DEFAULT sequence, you see, is from 0 counting up by 1; however,
you can override it with any initializations you want.
In other words, in C, enumerated type declarations are truly just
another way of defining integer constants. The above declaration is in
fact identical to
#define ferr_eof 0
#define ferr_no_file 52
#define ferr_security 93
#define ferr_dup_file 100
To summarize:
* Enumerated data types in PASCAL =
Just like ordinary constants +
Type checking.
* Enumerated data types in Draft ANSI Standard C =
Enumerated data types in PASCAL -
Type checking.
* Ergo,
Enumerated data types in Draft ANSI Standard C =
Just like ordinary constants!
See how easy things become if you use a little mathematics?
SUBRANGE TYPES IN PASCAL
Another new category of data type that PASCAL has is the so-called
subrange type. It is in some ways the quintessential PASCAL feature
because it really performs NO NEW FUNCTION except for allowing
additional compiler type checking.
In PASCAL, you can declare a variable thus:
VAR V: 1..100;
This means that V is defined to always be between 1 and 100. It is an
error for it be outside of these bounds, and the PASCAL compiler may
generate code to check for this (PASCAL/3000 certainly does).
Now, fortunately, the type checking on this isn't quite as
stringent as on other types. In other words, if you declare:
TYPE RANGE1 = 1..10000;
SMALL_RANGE = 100..199;
VAR SM: SMALL_RANGE;
...
PROCEDURE P (NUM: RANGE1);
then you can still call
P (SM);
even though SMALL_RANGE and RANGE1 are not the same type.
On the other hand, if NUM is a BY-REFERENCE (VAR) parameter, i.e.
PROCEDURE P (VAR NUM: RANGE1);
then saying
P (SM);
will be an error! Any by-reference parameter MUST be an IDENTICAL type
(i.e. either the same type or one defined as identical, i.e. "TYPE
NEWTYPE = OLDTYPE"). Different subranges of the same type (even two
differently-named and separately-defined types whose definitions are
identical!) are FORBIDDEN.
If the full implications of this haven't sunk in yet, consider this
procedure:
TYPE TPAC256 = PACKED ARRAY [1..256] OF CHAR;
PROCEDURE COUNT_CHARS (VAR S: TPAC256;
VAR NUM_BLANKS: INTEGER;
VAR NUM_ALPHA: INTEGER;
VAR NUM_NUMERIC: INTEGER;
VAR NUM_SPECIALS: INTEGER);
This one's simple -- it goes through a string S and counts the number
of blanks, alphabetic characters, numeric characters, and other
"special" characters; all the counts are returned as integer VAR
parameters.
The variables that we pass as NUM_BLANKS, NUM_ALPHA, NUM_NUMERIC,
and NUM_SPECIALS can NOT be declared as subrange types! If we say:
VAR NBLANKS, NALPHA, NNUMERIC, NSPECIALS: 1..256;
...
COUNT_CHARS (S, NBLANKS, NALPHA, NNUMERIC, NSPECIALS);
the compiler WON'T just check the variables after the COUNT_CHARS call
to make sure that COUNT_CHARS didn't set them to the wrong values;
rather, THE COMPILER WILL PRINT AN ERROR MESSAGE!
If you still insist on using subrange types for this sort of thing,
you get into the ridiculous circumstance in which YOU NEED A SEPARATE
COUNT_CHARS PROCEDURE FOR EACH POSSIBLE TYPE COMBINATION OF THE
NUM_BLANKS, NUM_ALPHA, NUM_NUMERIC, AND NUM_SPECIALS PARAMETERS!
This is why I'm skeptical of the utility of subrange variables.
It's great for the compiler to be able to do run-time error checking
and warn me of any errors in my program; however, I can never really
pass "by reference" subrange variables to any general-purpose routine!
On the one hand, we are told that it's great to have lots of
general-purpose utility procedures that can be called by a number of
other procedures in a number of possible circumstances; on the other
hand, we're prevented from doing this by too-stringent type checking!
Thus, to summarize:
* Subrange types are theoretically useful as a way of giving the
compiler more information with which to do run-time checking.
* However, their utility is SERIOUSLY COMPROMISED by the fact that
you can't, for instance, pass a subrange type by reference to a
procedure that expects an INTEGER (or vice versa).
This is especially damaging if you like to (and you should like
to) write general-purpose procedures -- your only serious
alternative there is to declare any by-reference parameters'
types to be INTEGER and make sure that all the variables you'd
ever want to pass to such a procedure are type INTEGER too.
SPL and C, not being very strict type-checkers, don't support
subranges. In light of all I've said, this doesn't seem to be such an
awful lack.
Finally, one more important comment about subrange types. As I
mention in the "BIT OPERATORS" section of this paper, subrange types
(in PACKED RECORDs and PACKED ARRAYs) are PASCAL/3000's and
PASCAL/XL's mechanism for accessing bit fields.
This is NOT endorsed or supported by the PASCAL Standard, but it
turns out to be one of the most useful applications of subrange types
in PASCAL/3000 and PASCAL/XL.
DATA ABSTRACTION
When I was converting our MPEX/3000 product to run on both the
pre-Spectrum and Spectrum machines, I had to overcome several
problems.
One was, of course, that some (although not all) of the privileged
procedures and operations that I did had to be done somewhat
differently on MPE/XL.
My conversion here was helped by the concept of "code isolation" --
rather than putting various calls to, say, DIRECSCAN or FLABIO all
over my program, I isolated them in individual procedures. Then, all I
had to do was replace those "wrapping" procedures, and all of the
programs that called them didn't have to be changed.
Another problem was that some of the tables (like the file label,
directory entries, ODD, JIT, etc.), though similar in principle and
containing much the same fields, had different offsets for those
fields.
Here I was helped by the fact that I never explicitly referenced,
say, the filecode field of the file label by saying "FLAB(26)".
Instead, I had an $INCLUDE file that DEFINEd the token "FLAB'FCODE" to
be "FLAB(26)" -- all I had to do was change the $INCLUDE file and
again the rest of my programs didn't need to be changed.
One area, though, that gave me more trouble than I would have
expected was the changing size of some fields.
Not the changing meaning -- the file label still contained a record
size field and a block size field, and the directory still contained a
file label address -- but rather the changing SIZE. The record size
and the block size were now 2 words rather than 1; the file label
address was 10 words instead of 2.
Consider a few of my procedures:
DOUBLE PROCEDURE ADDRESS'FNUM (FNUM);
VALUE FNUM;
INTEGER FNUM;
<< Given a file number, returns its file label's disc address. >>
DOUBLE PROCEDURE ADDRESS'NAME (FILENAME);
BYTE ARRAY FILENAME;
<< Given a filename, returns its file label's disc address. >>
PROCEDURE FLABREAD (ADDR, FLABEL);
VALUE ADDR;
DOUBLE ADDR;
ARRAY FLABEL;
<< Given a file label's disc address, reads the file label. >>
The plan here is that FLABREAD is the master file label read
procedure, to which we pass a disc address. We can either say
FLABREAD (ODD'FLAB'ADDR, FLABEL); << if we have the address >>
or
FLABREAD (ADDRESS'FNUM(IN'FNUM), FLABEL);
or
FLABREAD (ADDRESS'NAME(PROG'FILENAME), FLABEL);
Convenient, readable, general. What's wrong with it?
This mechanism was quite acceptable for MPE/III, MPE/IV, and MPE/V
because then the disc address was a double integer. In MPE/XL it
changed to a 10-word array. Any places that explicitly refer to it as
a DOUBLE must be changed to call it an INTEGER ARRAY.
"Data abstraction" refers to exactly this concern. Don't call a
disc address "a double integer". Rather call it "an object of type
DISC_ADDRESS". In PASCAL terms, don't say:
PROCEDURE FLABREAD (ADDR: INTEGER; VAR F: FLABEL);
Say
TYPE DISC_ADDRESS = INTEGER; { double integer }
...
PROCEDURE FLABREAD (ADDR: DISC_ADDRESS; VAR F: FLABEL);
Then, when you need to change to MPE/XL, all you need to do is change
the TYPE declaration to
TYPE DISC_ADDRESS = ARRAY [1..10] OF SHORTINT; { 10 words }
and you're home free. Of course, you'll doubtless have to change the
IMPLEMENTATION of FLABREAD (if the disc address format has changed,
probably the way of accessing it has, too); however, you won't have to
touch any of the CALLERS of FLABREAD.
So that's the first component of data abstraction -- the
responsibility of the programmer for declaring objects not with the
type they happen to have -- say, INTEGER -- but rather with some
"abstract type" (DISC_ADDRESS) that is defined elsewhere as INTEGER.
The second component of data abstraction, though, is much less
obvious. Say that you said
{ in PASCAL }
TYPE DISC_ADDRESS = ARRAY [1..10] OF SHORTINT;
...
FUNCTION ADDRESS_FNUM (FNUM: INTEGER): DISC_ADDRESS;
{ in C }
typedef int disc_addr[10];
...
disc_addr address_fnum (fnum);
int fnum;
{ in SPL }
DEFINE DISC'ADDRESS = INTEGER ARRAY #;
DISC'ADDRESS ADDRESS'FNUM (FNUM);
VALUE FNUM;
INTEGER FNUM;
All of these operations would make sense -- instead of returning an
integer, ADDRESS'FNUM is to return an object of type DISC_ADDRESS,
which happens to be an integer array.
The trouble here is that in neither Standard PASCAL nor C nor SPL
can a procedure return an integer array!
Thus, "hiding" the type of an object from most of the object's
users is very nice, but ONLY IF THE COMPILER PERMITS IT TO REMAIN
HIDDEN. In another example, in SPL, saying
FOR I:=1 UNTIL RECSIZE DO
is only legal if RECSIZE is an integer. If RECSIZE is a double
integer, all the data abstraction in the world will do us no good
because the SPL compiler itself will reject the above FOR loop.
To be truly able to have "data abstraction" -- to be able to not
care about an object's underlying representation type -- the compiler
must treat all the possible types as equally as possible.
Considering again the case of the disc address, there's no way we
can have an SPL procedure return anything that can represent a 10-word
value. We'd have to write ADDRESS'FNUM as
PROCEDURE ADDRESS'FNUM (FNUM, RETURN'VALUE);
VALUE FNUM;
INTEGER FNUM;
INTEGER ARRAY RETURN'VALUE;
and then call it as:
DOUBLE TEMP'DISC'ADDR;
...
ADDRESS'FNUM (FNUM, TEMP'DISC'ADDR);
FLABREAD (TEMP'DISC'ADDR, FLAB);
instead of simply
FLABREAD (ADDRESS'FNUM(FNUM), FLAB);
This is, of course, less convenient, which is why I kept the address
as a double integer instead of an integer array -- and got stuck when
I converted to Spectrum.
In Standard PASCAL, as I said, I couldn't have a function returning
an integer array. PASCAL/3000, though, lifts this restriction -- you
can now say:
TYPE DISC_ADDRESS = ARRAY [1..10] OF SHORTINT;
...
FUNCTION ADDRESS_FNUM (FNUM: INTEGER): DISC_ADDRESS;
...
FLABREAD (ADDRESS_FNUM(FNUM), FLAB);
Since the function is allowed to return an integer array, we can keep
the same interface regardless of whether DISC_ADDRESS is a double
integer or an array. Of course, the efficiency of the code won't be
quite the same; similarly, the internals of ADDRESS_FNUM would
doubtless be somewhat different. However, the callers of ADDRESS_FNUM
wouldn't have to change a whit despite the change in the underlying
definition of the DISC_ADDRESS type.
In C (K&R or Draft Standard), functions can't return arrays,
either. However, they can return structures, and a structure might
well contain only one element -- an array. Thus, we could say
typedef struct {int x[10];} disc_address;
...
disc_address address_fnum(fnum)
int fnum;
...
flabread (address_fnum(fnum), flab);
Of course, it isn't quite as convenient to manipulate an object of
type DISC_ADDRESS as it would be if it were a simple array (instead of
"discaddr[3]=ldev", we have to say "discaddr.x[3]=ldev"), but this is
a reasonable alternative.
Again, note how we can easily switch the underlying representation
of DISC_ADDRESS to, say, a double integer, or a long float, or
whatever, without changing the fundamental structure of the procedures
that use DISC_ADDRESSes.
Similarly, compare the SPL treatment of INTEGERs and DOUBLEs
against the PASCAL treatment of SHORTINTs (1-word integers) vs.
INTEGERs (2-word integers) or the C treatment of "short int"s (1-word)
vs. "int"s (2-word). In SPL, INTEGERs and DOUBLEs are mutually
INCOMPATIBLE -- you can't say:
DOUBLE D;
INTEGER I;
D:=I+D;
In PASCAL, though,
TYPE SHORTINT = -32768..32767; { this is built into PASCAL/XL }
VAR S: SHORTINT;
I: INTEGER;
I:=I+S;
will work, as will
short int s;
long int i;
i = i + s;
in C.
A similar thing, incidentally, can be said about SPL, PASCAL, and
C's handling of real numbers. SPL's REAL and LONG (double precision)
types are incompatible; in PASCAL and C dialects where two
floating-point types are provided (remember, neither language is
OBLIGATED to provide more than one floating-point type), the
floating-point types are always mutually compatible.
What this means, of course, is that it's quite easy in PASCAL or C
to change the type of a variable from "short integer" or "short real"
to "long integer" or "long real", or vice versa; in SPL, it's quite
difficult, since we'll have to put in a lot of manual type conversions
to make sure everything stays consistent.
To summarize, then, the differences in the way the various
languages handle data types:
[Note: "STD PAS" refers to both Standard PASCAL and the ISO Level 1
Standard.]
STD PAS/ PAS/ K&R STD
SPL PAS 3000 XL C C
CAN A FUNCTION RETURN ANY OBJECT?
CAN IT RETURN AN ARRAY? NO NO YES YES NO NO
CAN IT RETURN A RECORD? NO NO YES YES NO YES
CAN A FUNCTION OR A PROCEDURE
HAVE ANY OBJECT AS A "BY-VALUE" PARM?
CAN IT HAVE A BY-VALUE ARRAY? NO YES YES YES NO NO
CAN IT HAVE A BY-VALUE RECORD? NO YES YES YES YES YES
DOES AN ASSIGNMENT STATEMENT COPY
ANY TYPE OF OBJECT?
CAN IT COPY AN ARRAY? NO YES YES YES NO NO
CAN IT COPY A RECORD? NO YES YES YES NO YES
CAN YOU MIX, SAY, "INTEGER"S AND
"DOUBLE"S IN AN EXPRESSION? NO YES YES YES YES YES
CAN YOU MIX, SAY, "REAL"S AND
"LONGREAL"S IN AN EXPRESSION? NO YES YES YES YES YES
The more similar the treatment of various types, the easier it is
to achieve data abstraction -- and thus to insulate a program from the
underlying representation that a particular type might have.
I/O IN PASCAL AND C
You can't write a program without I/O -- that's obvious enough.
Even minimally sophisticated programs, especially system programs,
need to be able to do many I/O-related things. This doesn't just mean
reading and writing; it means direct I/O (by record number rather than
serial), building new files, opening old files, deleting files,
checking to see if files exist, and so on.
Of course, here we run into the classic problem of portability vs.
functionality. Nowhere do operating systems vary more than in their
file systems and the modes of I/O that they support; implementing I/O
in a portable programming language can be a nightmare for the language
designers.
PASCAL and C I/O are substantially different in many respects.
Standard PASCAL and PASCAL/3000 I/O are different too, and PASCAL/XL
adds a couple more interesting quirks. And, of course, Kernighan &
Ritchie C and Draft Standard C have their differences as well -- what
fun!
Before I go further, some ground rules have to be established.
There are two ways to talk about I/O (or any other feature of a
language):
* We can discuss the BUILT-IN I/O mechanisms; in PASCAL's case,
this includes WRITE, READ, WRITELN, READLN, GET, PUT, and the
like -- in C's it includes "fopen", "fclose", "getc", "putc",
"printf", "scanf", etc.
* We can discuss how EXTENSIBLE the I/O mechanism is. Since I/O
systems differ on all machines, no standard portable language can
include all the features that are available on all computers.
Thus, the question arises -- how easily can we use additional,
machine-related features, together with the standard I/O
facility?
In other words, do we have to choose "all standard" vs. "all
native mode" or can we, say, open a file using our particular
computer's I/O system and then read it using the language's
facility?
This, I believe, is an important distinction. It's true that PASCAL
and C are "extensible" languages -- as long as a hook is available to
the machine-specific system procedures (e.g. INTRINSICs in
PASCAL/3000), we can use the host's I/O system (e.g. FOPEN, FREAD,
FWRITE, FCLOSE). But what's the point of re-inventing the wheel?
We'd like the built-in I/O system to satisfy most of our needs,
both for portability's sake and convenience's sake. On the other hand,
we know that some machine-dependent features won't be included in
either the standard or even the particular machine implementation.
How do you expect, for instance, to have PASCAL/3000 know about RIO
files? You have to have some means of accessing the native I/O
procedures (e.g. HP's FOPEN, FCLOSE, etc.), but more than that -- you
have to be able to use a maximum of the language's I/O mechanism
combined with the necessary minimum of the host's non-portable I/O
system.
In other words, you shouldn't be forced to either use RESET,
READLN, and WRITELN or FOPEN, FREAD, FWRITE, and FCLOSE, but not both.
For instance, you ought to be able to call FOPEN to open a file in a
special mode but then use READLN and WRITELN against it; or,
conversely, open the file using RESET or REWRITE and then be able to
call built-in procedures like FGETINFO or FREADLABEL against it.
This will be both easier to write and more portable -- when you
port the program, you'll only have to change the small
system-dependent part.
STANDARD PASCAL
I have a theory about SPL. I believe that the main reason why
SPL/3000 isn't more popular in the HP3000 community is not that it
has, say, an ASSEMBLE statement or a TOS construct. Nobody HAS to use
ASSEMBLEs or TOSes.
Rather, the problem was that you CAN'T DO SIMPLE I/O IN SPL. You
want to write a program that adds two numbers? The addition statement
is simple:
INTEGER NUM1, NUM2, RESULT;
RESULT:=NUM1+NUM2;
Ah, but the I/O!
INTRINSIC ASCII, BINARY, PRINT, READX;
INTEGER ARRAY BUFFER'L(0:39);
BYTE ARRAY BUFFER(*)=BUFFER'L;
INTEGER LEN;
LEN:=READX (BUFFER'L, -80);
NUM1:=BINARY (BUFFER, LEN);
LEN:=READX (BUFFER'L, -80);
NUM2:=BINARY (BUFFER, LEN);
...
LEN:=ASCII (RESULT, 10, BUFFER);
PRINT (BUFFER'L, -LEN, 0);
And this is without prompting the user, or printing any string
constants at all! How can a beginner program get anything DONE this
way? For that matter, think of the trouble that even an EXPERT has to
go to to do anything useful!
Note that in SPL, you have complete FLEXIBILITY -- you can call any
intrinsic, open a file in any mode, do I/O with any carriage control.
But, since you have no built-in I/O interface to make all these
features easy to use, you have to go through a lot of effort to do
what you need to do. Like life itself -- everything is possible but
nothing is easy.
PASCAL -- having originally been designed as a teaching language --
naturally placed a premium on quick "start-up" time. Terminal I/O, for
instance, of either strings or numbers, isn't difficult; READ, READLN,
WRITE, and WRITELN can do appropriate formatting. File I/O, however,
is quite a bit less flexible, and even terminal I/O lacks some rather
valuable features.
Consider the set of Standard PASCAL I/O operators:
* READ and READLN can be used to read data from a file.
* WRITE and WRITELN can be used to write data to a file.
* PAGE can be used to trigger a form feed.
* RESET and REWRITE "open" files for reading or writing, filename
that corresponds to a particular PASCAL file.
* GET, PUT, and file buffer variables allow you to work with the
file in a slightly different way than READ and WRITE; we won't
discuss these much in this section, since for the most part
they're quite similar to READ and WRITE.
* EOLN allows you to detect an end-of-line condition in text input.
* EOF allows you to determine whether or not the NEXT read against
a file will get an end-of-file. This is very nice, since it
allows you to say:
WHILE NOT EOF(F) DO
BEGIN
READLN (F, X, Y, Z);
...
END;
as opposed to, say, the SPL solution, with which you have to
repeat the read twice:
FREAD (F, REC, 128);
WHILE = DO
BEGIN
...
FREAD (F, REC, 128);
END;
This is what PASCAL has -- what doesn't it have?
* There is no standard way of telling PASCAL to output a "prompt"
-- a string not followed by a carriage return/line feed. A vital
operation, I'm sure you'll agree, and surely any computer can
support it -- why doesn't Standard PASCAL include it?
* There is no standard way of accessing a file using "direct-
access" -- reading or writing by record number instead of
serially (like FREADDIR and FWRITEDIR do). Even FORTRAN IV
supports this (READ (10@RECNUM))!
* There is no standard way of indicating exactly what file you want
to open. Most PASCALs associate some default system filename with
each file declared in the programmer (e.g. the pascal file
"EMPFILE" may be associated with the MPE filename "EMPFILE"); but
what if you don't know the filename at compile time?
Portability, incidentally, isn't a concern here. There are plenty
of very portable programs that require this feature -- say, a
simple file copier, a text editor, etc.
* Standard PASCAL allows you to open a file for read access or for
write access. You can't open a file for APPEND access or
INPUT/OUTPUT access, both very common requirements. Again, why
not? Almost every operating system supports these access modes!
* Of course, no provisions are provided for other, equally portable
and equally important features like closing a file, deleting
files, creating files, checking if a file exists, not to mention,
say, renaming a file.
* No standard mechanism exists for detecting errors in file
operations. If, say, a file open (RESET or REWRITE) fails, the
program is typically aborted by the PASCAL run-time library. What
about graceful recovery?
How would you like, say, a command-driven file-manipulation
program that aborted with a compiler library error message
whenever you gave it a bad filename?
* The lack of error handling is particularly grave in READs from
text files. It's great that PASCAL will parse the numeric input
for you, but what if the user enters an invalid number? Surely
you don't just want the entire program to abort!
* WRITELN and READLN are rather simple-minded. No provision exists
for left- vs. right-justification, octal or hex output of
numbers, mandatory sign specification (i.e. print a "+" if the
number is positive, rather than printing no sign at all), and a
number of other useful things.
I find this to be a rather substantial set of inadequacies,
especially if we want to use PASCAL as a system programming language.
Now, all those problems are a property of Standard PASCAL. I'll be
the first to admit that virtually all PASCAL implementations work
around at least some of these things (after all, if they didn't, the
language wouldn't be usable).
However, remember the advantages of STANDARDS. Some PASCAL
compilers might call the prompt function PROMPT and others might just
use WRITE; some might have an APPEND procedure to open a file for
append access and others might have this as an option to a general
OPEN procedure.
A general language like PASCAL is great for writing portable code,
and surely there's nothing non-portable about wanting to prompt the
user or append to a file! But, the more implementation-dependent
features we have to use, the more portability we'll lose.
PASCAL/3000
The designers of PASCAL/3000 knew about Standard PASCAL's I/O
deficiencies, and they introduced a number of features to correct
them:
* PROMPT has been added -- this is just like WRITELN, but prints
its stuff without a carriage return/line feed.
* READDIR, WRITEDIR, and SEEK do direct I/O; they are equivalent to
FREADDIR, FWRITEDIR, and FPOINT.
* RESET and REWRITE allow you to specify the filename of the file
to be opened, for input or output access, respectively.
* OPEN allows you to open a file for input/output access; APPEND
lets you open for appending.
* CLOSE lets you close a file; procedures like LINEPOS, MAXPOS, and
POSITION let you find out various information about an open file
(a very small subset of FGETINFO). CLOSE has an option that lets
you purge the open file or save it as temporary.
* FNUM allows you to get the file number of any open PASCAL file,
thus letting you call any file system intrinsic (like FGETINFO,
FRENAME, etc.) on a PASCAL file. This is a MAJOR and VITAL
flexibility feature, because otherwise you would have to either
do your I/O on a particular file using either ONLY the PASCAL I/O
system, or only the MPE I/O system, but never both.
* Finally, a very intricate and hard-to-use mechanism (XLIBTRAP)
has been implemented to catch either I/O errors or
string-to-number conversion errors. To use it, you have to use
XLIBTRAP, the low-level WADDRESS procedure, and a global
variable; look at the example in the HP Pascal manual under
TRAPPING RUN-TIME ERRORS (pages 10-21 through 10-23 in the OCT 83
issue of the manual) to convince yourself that there's GOT to be
a better way.
This of course makes things a lot more bearable. Still, some things
remain unresolved:
* PASCAL/3000 allows you to open a file for input, output,
input/output, and append access. It also allows you to indicate
CCTL vs. NOCCTL and SHARED vs. EXCLUSIVE. This is very nice, but,
of course, MPE allows ten times this many options -- how about
opening temporary files, opening files for OUTKEEP access,
specify record size, file limit, ASCII vs. BINARY, etc.?
You can use FNUM to go from a PASCAL file variable to an MPE file
number and thus use MPE intrinsics on a PASCAL-opened file. You
ought to be able to do the converse of this -- open a file using
FOPEN and then use PASCAL features (like READLN and WRITELN) on
this file. You can't -- if you need to open, say, a temporary
file, you'd have to FOPEN it and then use your own FREADs and
FWRITEs.
(Actually, you could use a :FILE equate issued using the COMMAND
intrinsic; this, however, is much more cumbersome, doesn't
support all the FOPEN intrinsic options, and prevents you from
allowing the user to issue his own file equations for the file.)
* Error trapping, as I said, is still very hard to do.
The first two, I think, are the most serious problems. Since so
much systems programming involves file handling, flexible and
resilient file system operations are, I believe, a MUST for any system
programming language. PASCAL/3000 is a lot better than Standard
PASCAL, but it still has some flaws.
PASCAL/XL
PASCAL/XL I/Os is even better than PASCAL/3000's. PASCAL/XL adds a
couple of features that make its I/O capability almost complete:
* The ASSOCIATE built-in procedure:
ASSOCIATE (pascalfile, fnum);
Very simple. Makes the given PASCAL FILE variable point to the
file indicated by FNUM. Thus, you can call FOPEN with whatever
options your heart desires, and then use all of PASCAL's I/O
facilities against that file. Such a deal!
* TRY .. RECOVER. This construct is described in more detail in the
"CONTROL STRUCTURES" chapter of this manual -- and a very
powerful construct it is -- but for file I/O it lets you do this:
ERROR:=FALSE;
REPEAT
PROMPT ('Enter filename: ');
READLN (FILENAME);
TRY
OPEN (FILENAME);
RECOVER
ERROR:=TRUE; { will branch here in case of error }
UNTIL
NOT ERROR;
You just wrap a "TRY" and a "RECOVER" around the file operation
that might get an error, and the statement after "RECOVER" will
get branched into in case of error (instead of having the program
abort). Similarly, you can say:
ERROR:=FALSE;
REPEAT
PROMPT ('Enter an integer: ');
TRY
READLN (I);
RECOVER
ERROR:=TRUE;
UNTIL
NOT ERROR;
Still not QUITE as easy as I'd like it to be, but a lot better
than before.
The only trouble is that -- at least for you and me and the rest of
the HP3000 user community -- PASCAL/XL is still a "future" language;
we can't really tell how good it is until we've hacked at it for some
time and have seen all the implications of the various new features.
Still, the PASCAL/XL I/O system seems to be an eminently reasonable
and usable creature.
KERNIGHAN & RITCHIE C
The original "Kernighan & Ritchie" book, which for practical
purposes was the original C "standard" has a chapter describing the C
I/O library. Its first sentence was "input and output facilities are
not part of the C language", which while technically true, proved
practically incorrect. By the very act of inclusion of the I/O chapter
into the K & R C book, this I/O library became as "standard" as the
rest of the C language described therein -- which is to say, not
entirely standard, but nonetheless surprisingly compatible on
virtually all modern machines.
Note that the same can not be said of the next chapter, "THE UNIX
SYSTEM INTERFACE", and I won't consider the features listed there as
part of standard C.
The list of standard C I/O features differs from standard PASCAL
I/O features:
* The C I/O facility emphasizes what are known in PASCAL as TEXT
files -- files that are viewed as streams of characters. In
PASCAL you can declare a "FILE OF RECORD" or "FILE OF ARRAY
[0..127] OF INTEGER" and read an entire record or 128-element
array at a time. In C you'd have to read this array
character-by-character.
Note that from a performance point of view, this may not be a
problem, since virtually all C's buffer their I/O in rather big
chunks -- 256 single-byte reads in C shouldn't be much slower
than a single 128-word read. Still, this kind of character-read
loop is more cumbersome than one would like.
* C provides the GETC and PUTC primitives to read and write a
character at a time.
C also provides an UNGETC primitive that "ungets" the last
character you've read, effectively moving the file pointer back
by one byte and assuring that the next character you'll read will
be the same one you've just read.
This is surprisingly useful, especially for parsing.
* C provides FSCANF and FPRINTF to do formatted I/O. These are
rather more powerful than PASCAL's READLN and WRITELN -- see the
FORMATTED I/O: C vs. PASCAL section below.
* FGETS and FPUTS read an entire line at a time. Nothing much --
just like PASCAL's READLN and WRITELN of strings.
* FOPEN (not to be confused with the MPE intrinsic of the same
name!) lets you open an arbitrary file for read, write, or append
access. Unlike Standard PASCAL, C allows you to specify the name
of the file you want to open. FCLOSE closes an open file.
* End of file is indicated by a special return condition from GETC
and PUTC. In PASCAL, of course, the special EOF procedure returns
you the end of file indication, and all attempts to read at an
end of file cause a program abort. Each method has its
advantages.
* Records in a file are delimited not by a special "line delimiter"
as in PASCAL, but rather by the ordinary ASCII character
"NEWLINE". The exact ASCII value of this character is left up to
the compiler's discretion -- it's usually a LINEFEED (10), but
sometimes a CARRIAGE RETURN (13); however, this character is
always available in C as '\n', so you can say something like:
if (getc(stdin) == '\n')
/* do end of line processing */
* If you want to skip to the next line in a file (or on the
terminal), you have to output a newline character. Thus, instead
of
WRITELN ("HELLO, WORLD!")
you'd say
fprintf (stdout, 'hi, wld!\n');
/* no C programmer would actually SPELL OUT "hello" */
/* or "world". */
This implies that just by omitting the "\n", you can prevent the
skip-to-the-next-line. Thus,
printf ("name? "); /* same as 'fprintf (stdout,"name? ");' */
scanf ("%s", &name);
will presumably prompt the user with "name? " and request input
on the same line. In most PASCALs (including PASCAL/3000), if you
do a WRITE followed by a READ, the prompt won't actually come out
until a subsequent WRITELN -- the WRITE output will be "buffered"
until "flushed" by a WRITELN. As best I can tell, no C standard
would actually prevent this behavior in a C compiler; however,
most C compilers do the right thing and flush any pending
terminal output before doing terminal input.
* Error handling is different from PASCAL's:
- If FOPEN can't open a file, it returns a special value to the
program (unlike PASCAL, which aborts the program). The program
can then check for this condition and handle it appropriately.
I like this, even if there's no standard way to find out
exactly what kind of error occurred.
- FSCANF behaves differently from PASCAL READLN. If you use
FSCANF to read an integer it won't print an error if it sees a
letter or some special character; rather, it'll just consider
that that character has delimited the read of the integer. 0 is
returned as the value of the integer, and the file pointer
points to the non-numeric character that stopped the read.
Then, you can use GETC to make sure that the character was
really a newline or a blank or whatever it is that you
expected; or, you can check the result of FSCANF (which will be
set to the number of items actually read) to see if all the
items that you were asking for were really given. This is a lot
better than PASCAL's approach of just aborting and giving the
program no chance to recover gracefully.
- Error conditions for the other functions (except for end of
file on GETC) are not defined by K&R.
Seeing how I tore apart Standard PASCAL's I/O facility earlier, you
might expect the following complaints from me about C's I/O:
* As I mentioned, K&R C can't gracefully handle reads of, say,
records or large arrays from files. It emphasizes flexible-format
text files rather than fixed-format binary files.
* You can't read or write a record at a particular record number
(direct access); you can only access the file serially.
* You can't open a file for input/output access.
* No delete/create/check-if-file-exists support.
How sad!
DRAFT ANSI STANDARD C
Draft ANSI Standard C has expanded the standard C I/O library quite
dramatically. A number of useful (and often confusing) features now
exist:
* Input/output file opens are supported.
* I/O in units of more than one character is allowed; FREAD and
FWRITE (again, no relationship to the MPE intrinsics) let you
easily read or write structures and arrays from/to files.
* Direct I/O is provided using the FSEEK procedure, which positions
the file pointer in a file. Then you can use any of the
read/write mechanisms (GETC, PUTC, FSCANF, FPRINTF, FREAD,
FWRITE, etc.) to do the I/O at the new location.
* Error handling is more concrete. Presumably, none of the new
services are allowed to abort in case of error; the FERROR
procedure returns to you the error status (combination of a
has-error-occurred-flag and some kind of error number) of an
operation.
* REMOVE and RENAME, which delete and rename files, are provided.
* Various other new features of various utility and arcaneness have
been added.
These all look quite nice, and seem to satisfy me as thoroughly --
or more so -- as PASCAL/XL. Note, however, that the only two languages
that I'm happy with are ones that barely exist and in which I've done
virtually no serious programming.
This may say something about my character; it also says something
about the pitfalls of comparing "new- improved-we'll have them for you
Real Soon Now" languages. Both PASCAL/XL and Draft Standard C SEEM
nice, but who knows how and whether they'll actually work?
One other thing that I ought to point out: as you recall, in the
discussion of PASCAL/3000 and PASCAL/XL I sang the praises of FNUM,
which returns the system file number of a PASCAL file variable, and
ASSOCIATE, which initializes a file variable to point to a given
system file number. The reason for this was to allow you to mix PASCAL
and native file system I/O.
Naturally, Draft Standard C, being a portable standard, doesn't
discuss these features; however, I wouldn't like to use any particular
implementation of C that doesn't support FNUM- and ASSOCIATE-like
operations. I hope that HP's C/XL provides them; I know that CCS's
C/3000 provides both.
FORMATTED I/O: C vs. PASCAL
In the previous discussion, we talked about the I/O operations that
C and PASCAL allow. Two of the most useful ones, of course, are
formatted write and formatted read -- this is what allows you to input
and output numbers (so hard to do using SPL).
Standard PASCAL lets you input (READ, READLN) and output (WRITE,
WRITELN) characters, strings, integers, reals, and booleans. A sample
call might look like:
WRITELN (STRING_VALUE, INT_VALUE:10, REAL_VALUE:10:2);
{ write a string, an integer left-justified in a
10-character field, and a real number in a 10-character
field with 2 characters after the decimal }
or
READLN (STR, INT, REALNUM);
These procedures allow you to
* Read entities delimited by blanks.
* Write values right-justified in a fixed-format field.
* Write values "free-format", i.e. in as many characters as they
need (this is done by setting the "width" parameter in WRITELN to
a size smaller than that needed to fit the entire number; e.g.,
WRITELN (I:0)).
* Write real numbers in either exponential format (comparable to
FORTRAN's Ew.d) or fraction format (Fw.d).
* Output booleans as the strings "TRUE" or "FALSE"; PASCAL/3000
expands this to allow WRITEing any variable belonging to an
enumerated type as its symbolic equivalent. Thus, if COLOR is of
type (MAUVE, PUCE, AQUA) and has value PUCE, it'll be output as
"PUCE", rather than, say, 1, which might happen to be PUCE's
integer representation.
This is quite a nice set of functions, but quite obviously there are
some important features missing:
* The ability to output data to a program string variable, rather
than a file. PASCAL/3000 has this feature (STRREAD and STRWRITE).
* Output in hex or octal, vital for a system programming language.
* Left-justified as well as right-justified output.
* Money format ("123,456,789").
* As mentioned before, some way of reading numbers without having
the program abort in case the number is invalid (AS I SAID, THIS
IS *VERY IMPORTANT!).
Less important but desirable features include:
* Padding with zeroes (e.g. printing 123 as "00123"; especially
important in octal and hex).
* Always printing the sign, even if the number is positive.
The most important failing of PASCAL's READ, WRITE, et al. is, in
fact, one of the less obvious ones:
* If you're dissatisfied with the way READ and WRITE work, IT'S
VERY DIFFICULT FOR YOU TO WRITE YOUR OWN.
Think about it -- say you wanted to add a "money format" output
facility. You'd like to write a procedure called MYWRITELN that's just
like WRITELN, but allows its caller to somehow specify that a
particular type REAL parameter is to be output in money format. What
could you do?
Remember:
* In Standard PASCAL and PASCAL/3000 you can't have your functions
have a variable number of parameters.
* In Standard PASCAL and PASCAL/3000 you can't have your functions
take parameters of flexible data types.
* Almost incidentally to all this, READ, WRITE, READLN, and WRITELN
are the only "procedures" that allow you to specify auxiliary
parameters like field width and fraction digits using a ":".
You see, PASCAL documentation calls READ, READLN, WRITE, and WRITELN
"procedures", but they're not like the procedures that we mere mortals
can write. If we want to write our "money-format output" procedure, we
have to have it have only one data parameter of type REAL and a couple
of parameters indicating the field width and fraction digits.
A typical call to this might look like:
WRITE ('COMMISSIONS ARE ');
WRITEMONEY (COMMISSIONS, 15, 2);
WRITE (' OUT OF A TOTAL OF ');
WRITEMONEY (TOTAL, 15, 2);
WRITELN;
Instead of being able to stick this all into one WRITELN, we have to
have a special procedure that takes exactly one value to be output,
making us write five lines rather than one.
[In PASCAL/3000, we can avoid this by having WRITEMONEY be a procedure
that returns a string instead of outputting it, and then write the
call as "WRITELN ('COMMISIONS ARE ', FMTMONEY(COMMISIONS, 15, 2),
..."); however, this is both fairly inefficient and quite
non-portable, since Standard PASCAL doesn't allow functions to return
string results.]
Note that this all applies only to Standard PASCAL and PASCAL/3000.
PASCAL/XL's winning new features might very well extinguish this
particular problem.
So much for PASCAL. How about C?
Standard C's WRITELN is called PRINTF (or FPRINTF, if you want to
print to a file rather than the terminal); READLN is called SCANF (or
FSCANF, to read from a file). Examples might be:
printf ("%s %10d %10.2f", string_value, int_value, real_value);
or
scanf ("%s %d %f", &string_value, &int_value, &real_value);
/* The "&"s are needed to indicate that the address of
the variable is to be passed, not the actual value. */
Note how both PRINTF's and SCANF's first parameters are "control
strings" that indicate the format of the input or output.
Incidentally, they also tell PRINTF and SCANF how many parameters they
are to expect and what the type of each parameter will be. If you make
an error in the control strings, beware! You'll get VERY interesting
results.
In any event, PRINTF's and SCANF's features include:
* Output of integers, in decimal, octal, or hex.
* Output of reals, in exponential or fractional format. You can
also output a real number using so-called "general" format
(similar to FORTRAN's Gw.d), which uses exponential or fractional
format, whichever is shorter.
* Free-format output, left-justification, and right-justification.
In other words, 10 can be output as "10", " 10", or "10 ".
All three of these formats are useful in different applications.
* Zero-padding (e.g. outputting 10 as "00010").
* Bad input data (strings where numbers are expected, etc.) does
not generate an error; in fact, the only way of detecting is to
read the next character after the SCANF is done (say, using,
GETC) and see if it's the terminator you expected (e.g. a blank
or a newline) or some other character that might have been viewed
as a numeric terminator.
This is somewhat cumbersome, but in the long run more flexible;
it is certainly much better than having your entire program abort
whenever the user types bad data.
* Standard C allows you to use SSCANF to read from a string and
SPRINTF to output to a string.
This is, overall, a richer feature set than PASCAL's. Note, however,
some problems:
* Still no monetary output facility.
* Still no "always print a sign character even if the number's
positive" feature (again, Draft Standard corrects this).
* Unlike PASCAL, printing a boolean value will just print a 0 or 1
(since C doesn't have a separate boolean type). What's more, even
an enumerated type value will just be printed as its numeric
equivalent (since C views variables of enumerated types as simple
integer constants).
In my opinion, these things are all pretty bearable; but, the
important thing is that in C you CAN define your own PRINTF and SCANF
like procedures.
You can have parameters of varying types; even variable numbers of
parameters are supported by the Draft Standard (most non-Standard
compilers give you some such feature, too). Thus, you can write your
"myprintf" procedure, and call it using
myprintf ("comms are %15.2m of tot %15.2m", commissions, total);
/* assuming you've defined "%X.Ym" to be your
"money-format" format specifier. */
Of course, nobody says it'll be easy to write this MYPRINTF procedure,
especially if you'll want to emulate the standard PRINTF directives
(which can be done by just calling SPRINTF); the important thing is
that you CAN write a procedure like MYPRINTF, whereas in Standard
PASCAL and PASCAL/3000, you can't.
SUMMARY OF I/O FACILITIES
[Since SPL relies solely on the HP System Intrinsics to do I/O,
numeric formatting, etc., I don't include it in this comparison.
Believe me -- with SPL I/O, everything is possible but nothing is
easy.]
STD PAS/ PAS/ K&R STD
PAS 3000 XL C C
OPEN ARBITRARY FILE GIVEN NAME NO YES YES YES YES
OPEN FILE FOR APPEND ACCESS NO YES YES YES YES
OPEN FILE FOR INPUT/OUTPUT ACCESS NO YES YES NO YES
CLOSE A FILE NO YES YES YES YES
READ, WRITE FILES SERIALLY YES YES YES YES YES
READ, WRITE FILES BY RECORD NUMBER NO YES YES NO YES
DETECT AND HANDLE FILE ERRORS NO NO+ YES YES YES
FORMAT NUMBERS FOR OUTPUT YES YES YES YES+ YES+
INPUT NUMBERS YES YES YES YES YES
INPUT ERROR DETECTION? NO NO+ YES YES YES
OUTPUT A STRING WITH NO NEW-LINE NO YES YES YES YES
USE A PASCAL-OPENED FILE FOR NATIVE N/A YES YES N/A N/A
FILE OPERATIONS
USE A "NATIVELY" OPENED FILE FOR N/A NO YES N/A N/A
PASCAL FILE OPERATIONS
WRITE YOUR OWN WRITELN/PRINTF-LIKE NO NO YES- YES- YES
FUNCTION (WITH VARIOUS PARAMETER
TYPES AND NUMBERS OF PARAMETERS)
LEGEND: YES = Implemented.
YES+ = Implemented in a really nice and useful way.
YES- = Implemented, but there's some ugliness involved.
NO = Not implemented.
NO+ = I can't fairly say that it's simply "not implemented",
but believe me, it's soooo ugly...
N/A = Not applicable.
STRINGS IN STANDARD PASCAL AND SPL
Much, if not most, of the data we keep on computers is character
data -- filenames, user names, application data, text files. Of
course, it's imperative that any programming language we use can
represent and manipulate this sort of data.
Standard PASCAL's mechanism for storing strings is the "PACKED
ARRAY OF CHAR". PACKED here is simply a convention used to indicate
that there should be one character stored per byte, not one per word.
I've never seen anyone use an unpacked ARRAY OF CHAR.
If you think about it, SPL and C use PACKED ARRAY OF CHARs, too.
All a PACKED ARRAY [1..100] OF CHAR means is "an array of 100 bytes,
each of which is individually addressable". Practically speaking,
VAR X: PACKED ARRAY [1..256] OF CHAR; { PASCAL }
BYTE ARRAY X(0:255); << SPL >>
char x[256]; /* C */
are all identical -- and fairly reasonable -- ways of storing a string
that's between 0 and 256 characters long. Still, despite this identity
of representation, I claim that Standard PASCAL has severe problems
with string processing.
Support for strings involves much more than just having a way of
representing them. The important thing for strings -- as for any data
type -- is the OPERATORS THAT ARE DEFINED to manipulate them. What's
the use of having strings if you can't extract a substring?
Concatenate them? Find a character within a string? It's by the level
of this sort of support that a language's string facility is measured.
SPL, for instance, has several useful operators that help in string
manipulation:
* You can say
MOVE STR1(OFFSET1):=STR2(OFFSET2),(LENGTH);
to move one substring of a string into another. PASCAL can only
move the entire thing in one shot (STR1:=STR2), or examine/set
one character at a time (STR1[I]:=STR2[I]).
* You can say
IF STR1(OFFSET1)=STR2(OFFSET2),(LENGTH) THEN ...
to compare two substrings. You can also compare for <, >, <=, >=,
and <>, as well as compare against constants (e.g. IF
STR1(X)="FOO").
* You can say
MOVE STR1(OFFSET1):=STR2(OFFSET2) WHILE ANS;
that will copy substrings WHILE the character being copied is
Alphabetic or Numeric (upShifting in the process). You can copy
only WHILE AN (no upshifting), WHILE AS (while alphabetic,
upshifting), or WHILE N (while numeric). You can also find out
how many characters were so copied (i.e. at what point the copy
stopped).
* You can say
I:= SCAN STR(OFFSET) UNTIL "xy";
assigning to I the index of the first character in the substring
which is either equal to "x" or to "y"; you can also say
I:= SCAN STR(OFFSET) WHILE "b";
which will assign to I the index of the first character in the
substring that is NOT equal to "b" (for more details, see the SPL
manual). Note that you can NOT say "SCAN until you either find
this character OR you've gone through 80 characters", which is
very desirable if you know that the maximum length of your string
is, say, 80.
* You can say
P (STR(OFFSET));
calling the procedure P and passing to it all of STR starting
with offset OFFSET. In PASCAL, you can only pass the entire
string, not this sort of substring. Note, however, that in SPL
you can't pass BYTE ARRAYs by value -- only by reference.
* On the other hand, reading or writing strings is rather more
difficult than one would like. Since the PRINT and READX
intrinsics take "logical arrays" rather than "byte arrays", you
in general have to say:
LOGICAL ARRAY BUFFER'L(0:127);
BYTE ARRAY BUFFER(*)=BUFFER'L;
INTEGER IN'LEN;
...
IN'LEN:=READX (BUFFER'L, -256);
MOVE STR(OFFSET):=BUFFER,(IN'LEN);
...
MOVE BUFFER:=STR(OFFSET),(LEN);
PRINT (BUFFER'L, -LEN, 0);
These features are a part of the SPL language; you can use them
without writing any procedures of your own. Furthermore, if you want
to, say, write a procedure that finds the first occurrence of STR1 in
STR2, you can just say:
INTEGER PROCEDURE FIND'STRING (STR1, LEN1, STR2, LEN2);
VALUE LEN1, LEN2;
BYTE ARRAY STR1, STR2;
INTEGER LEN1, LEN2;
...
and implement it yourself. It may not be easy (actually, it is), but
it's certainly possible.
These are the string-handling features that SPL supports, and you
may consider them sufficient or not. PASCAL supports a different, and
somewhat smaller set of features:
* You can copy entire strings, or examine and set single
characters:
STR1:=STR2;
or
STR1[I]:=STR2[J];
You can't copy substrings without writing your own FOR loop, or
having a special temporary array and calling the little-known
PACK and UNPACK procedures.
* You can input and output a string using READLN and WRITELN. You
can output the first N characters of a string by saying
WRITELN (STR:N);
but you again can't output an arbitrary substring of STR.
* You can pass a string to a procedure; you can't pass a substring.
On the other hand, you can pass a string by value as well as by
reference.
As you see, this set of operators is in some respects richer than
SPL's (I/O) and in others poorer (substrings, comparisons, SCANs,
etc.). But the WORST problem with PASCAL's string handling is:
* YOU CAN'T WRITE YOUR OWN GENERAL STRING HANDLERS!
Strange for a language that emphasizes breaking things down into
little, general-purpose procedures, eh? But if you've read the "DATA
STRUCTURES, TYPE CHECKING" chapter of this paper, you'll know why:
* YOU CAN'T DECLARE A PROCEDURE TO TAKE A GENERAL STRING!
You can write
TYPE PAC256 = PACKED ARRAY [1..256] OF CHAR;
...
FUNCTION STRCOMPARE (S1, S2: PAC256): INTEGER;
but THE ONLY STRINGS YOU CAN PASS TO THIS PROCEDURE ARE THOSE OF TYPE
PAC256! What if you have a string that's at most 8 bytes long, a
PACKED ARRAY [1..8] OF CHAR? No dice! You have to either declare it
with a maximum length of 256 bytes (thus wasting 97% of the space!) OR
write one STRCOMPARE procedure for every possible combination of S1
and S2 maximum lengths.
This means that not only do you start with a somewhat poor set of
string handling primitives, but you'll have a very hard time of
implementing your own, unless you're willing to have all your strings
be of the same maximum length! In my opinion, this is a very, very
unpleasant circumstance.
STRING HANDLING IN PASCAL/3000 AND PASCAL/XL
A better string handling system is one of the conspicuous
improvements that HP put into PASCAL/3000.
The first new feature that you'd notice in PASCAL/3000 strings is
that A PASCAL/3000 STRING CONTAINS MORE THAN JUST CHARACTERS. When you
say
VAR S: STRING[256];
you're allocating more than just a PACKED ARRAY [1..256] OF CHAR.
You're essentially creating a record structure:
VAR S: RECORD
LEN: -32768..32767; { 2 bytes in /3000; 4 bytes in /XL }
DATA: PACKED ARRAY [1..256] OF CHAR;
END;
Now S isn't REALLY a PASCAL RECORD -- you can't just access its
subfields using "S.LEN" and "S.DATA". But internally, it is a record
structure, in that it contains both of these pieces of data
independently. When you say
S:='FOOBAR';
then not only will PASCAL move "FOOBAR" to the data portion, but also
set the length portion to 6. The LEN subfield contains the actual
current length of the data; there may be room for up to 256
characters, but in this case it indicates that the actual length is
only 6 characters.
A brief aside: Obviously, it's quite important to somehow keep
track of the current string length. For a fixed-length thing like the
8-character account name, we may not need it, but if we're doing, say,
text editing, we want to know the actual length of the line. Let me
point out, though, that keeping a separate length field is not
imperative for this; C uses a NULL character as a string terminator,
and many SPL programmers do similar things. In other words, just
because PASCAL/3000 represents strings this way, don't think that
that's the only way of doing it...
Back to the PASCAL approach. The representational change is the
most obvious difference in PASCAL/3000; but, as we saw earlier, it's
the OPERATORS rather than the REPRESENTATION that really make a data
type. PASCAL/3000 provides a pretty rich set, including especially:
* You can extract and manipulate arbitrary substrings using STR:
STR(X,10,7)
returns a string containing the 7 characters starting from the
10th characters. The result of the STR function can be used
anywhere a "real string" could be used; however, it can not be
assigned to or passed as a by-reference parameter.
* You can concatenate two strings using "+":
S:='PURGE '+STR(FILENAME,1,10)+',TEMP';
* You can find out a string's length using the STRLEN procedure.
This is somewhat more convenient than in SPL; SPL strings don't
have a separate "length" field, so most SPL programmers end up
terminating their string data with some distinctive character
(often a carriage return, %15). Thus, an SPL programmer would
have to say
I:= SCAN STR UNTIL [8/%15,8/%15];
to scan through the string looking for a carriage return (a
relatively, though not very, slow process). The PASCAL programmer
would say
I:= STRLEN(STR);
* You can copy substrings using STRMOVE. (STRMOVE also works for
PACKED ARRAY OF CHARs.)
* You can easily edit a string using STRDELETE and STRINSERT to
delete/insert characters anywhere in the string.
* You can find the first occurrence of one string in another using
STRPOS.
* You can strip leading and trailing blanks using STRLTRIM and
STRRTRIM. Stripping trailing blanks is a particularly useful
operation.
* You can also do READLNs and WRITELNs into strings using STRREAD
and STRWRITE; this means that you can easily convert a string to
an integer and vice versa.
* You can have functions that return strings (many of the above,
including STR and STRRTRIM are examples). Standard PASCAL doesn't
allow functions to return structured types including arrays.
More importantly, you can now write a procedure
PROCEDURE STRREPLACE (VAR STR, FROMSTR, TOSTR: STRING);
{ Changes all occurrences of FROMSTR to TOSTR in STR. }
that will work for a string of ANY maximum length, because you
declared the parameters to be of type STRING, rather than STRING[256]
or some such fixed length. (Note, however, that only BY-REFERENCE
string parameters can be declared to be of type STRING.)
One fairly serious problem, though, that still afflicts PASCAL/3000
strings (PASCAL/XL fixes this) is the inability to dynamically
allocate a string of a size that is not known until run-time. For more
discussion of this, look at the "POINTERS" chapter of this manual.
STRING HANDLING IN C
The designers of C, of course, faced the same sorts of problems as
the designers of PASCAL/3000, but at least in the area of strings,
they attacked them in a somewhat different way.
As a matter of convention, C strings -- kept as simple arrays of
characters -- are terminated by a NULL ('\0', ascii 0) character. You
can, of course, have a
char x[10];
array, none of whose characters is a null, but all that means is that
you'll get screwy results if you pass it to a string manipulation
procedure. When you say
strcpy (x, "testing");
the C compiler will pass to STRCPY the address of the array X
(remember that in C saying "arrayname" gets you the address of the
array) and the address of the 8-character string which contains "t",
"e", "s", "t", "i", "n", "g", and NULL. Then STRCPY -- just because of
the way it's written, and because this is the useful thing to do --
will copy all the characters from the second string ("testing") into
the first string (X), up to and including the terminating NULL.
So here you see a fundamental representational difference between
PASCAL/3000 (and PASCAL/XL) and C strings:
* PASCAL/3000 keeps the current length of the string as a separate
field.
* C keeps it implicitly, having a null character terminate the
string's actual data.
The PASCAL/3000 clearly has some advantages:
* Determining the string length is much faster -- you need only
extract the first 2 bytes of the string array, and you've got it.
In C, you'd need to scan through each character until you find a
null.
* PASCAL/3000 strings can contain any character. C strings may not
contain a null, since that would be viewed as a terminator.
In practice, though, the second issue (strings that need to contain
nulls) doesn't arise, and the first issue isn't as important as one
would think. In fact, there are some compensating advantages to C's
approach, but I'll discuss them a bit later.
Given what we know about the different representational format,
what about the defined operations?
Kernighan & Ritchie is rather cavalier about this vital question,
and merely alludes to the "standard I/O library", which is said to
contain various string manipulation functions. Thus, it's not unlikely
that there'll be some non-trivial differences between various C
implementations in this area (although there'll also be a good deal of
similarity).
Therefore, I'll have to compare PASCAL/3000 and Draft ANSI Standard
C; keep in mind that the C functions might not be available on all
compilers.
Let's consider a (possibly) typical application. We need to write
two procedures:
* One that takes a file name (MPEX), group name (PUB), and account
name (VESOFT), and makes them into a fully-qualified filename
(MPEX.PUB.VESOFT).
* Another that does the opposite -- takes a fully-qualified
filename and splits it into its file part, group part, and
account part.
Here's what they'd look like, in PASCAL:
PROGRAM PROG (INPUT, OUTPUT);
TYPE TSTR256 = STRING[256];
TSTR8 = STRING[8];
VAR FILENAME, GROUP, ACCT: STRING[8];
FUNCTION FNAME_FORMAT (FILENAME, GROUP, ACCT: TSTR8): TSTR256;
BEGIN
FNAME_FORMAT := STRRTRIM(FILENAME) + '.' + STRRTRIM(GROUP) + '.' +
STRRTRIM(ACCT);
END;
PROCEDURE FNAME_PARSE (QUALIFIED: TSTR256;
VAR FILENAME, GROUP, ACCT: STRING);
VAR START_GROUP, START_ACCT: INTEGER;
BEGIN
START_GROUP := STRPOS (QUALIFIED, '.') + 1;
START_ACCT := STRPOS (STR (QUALIFIED, START_GROUP,
STRLEN(QUALIFIED)-START_GROUP-1), '.')
+ START_GROUP;
FILENAME := STR (QUALIFIED, 1, START_GROUP - 2);
GROUP := STR (QUALIFIED, START_GROUP, START_ACCT-START_GROUP-1);
ACCT := STR (QUALIFIED, START_ACCT,
STRLEN (QUALIFIED) - START_ACCT + 1);
END;
BEGIN
WRITELN (FNAME_FORMAT ('MPEX ', 'PUB ', 'VESOFT '));
FNAME_PARSE ('MPEX.PUB.VESOFT', FILENAME, GROUP, ACCT);
WRITELN (FILENAME, ',', GROUP, ',', ACCT, ';');
END.
and in C:
#include <stdio.h>
#include <string.h>
char *strrtrim (s)
char s[];
{
/* Strips trailing blanks from F; also returns F as the result. */
int i;
for (i = strlen(s); (i>0) && (s[i-1]==' '); i = i-1)
;
s[i] = '\0';
return s;
}
char *fname_format (filename, group, acct, qual)
char filename[], group[], acct[], qual[];
{
qual[0] = '\0';
strcat (qual, strrtrim (filename));
strcat (qual, ".");
strcat (qual, strrtrim (group));
strcat (qual, ".");
strcat (qual, strrtrim (acct));
return qual;
}
fname_parse (qual, filename, group, acct)
char qual[], filename[], group[], acct[];
{
char *start_group, *start_acct;
start_group = strchr (qual, '.') + 1;
start_acct = strchr (start_group, '.') + 1;
strncpy (filename, qual, start_group - qual - 1);
filename[start_group - qual - 1] = '\0';
strncpy (group, start_group, start_acct - start_group - 1);
group[start_acct - start_group - 1] = '\0';
strcpy (acct, start_acct);
}
main ()
{
char qual[256], filename[8], group[8], acct[8];
printf ("%s\n",
fname_format ("sl ", "pub ", "sys ", qual));
fname_parse ("mpex.pub.vesoft", filename, group, acct);
printf ("%s,%s,%s;\n", filename, group, acct);
}
What are the differences between these two, besides the fact that
one is upper-case and one is lower-case? Let's examine these programs
a piece at a time.
The FNAME_FORMAT procedure, which merges the three "file parts"
into a fully-qualified filename in PASCAL looks like this:
FUNCTION FNAME_FORMAT (FILENAME, GROUP, ACCT: TSTR8): TSTR256;
BEGIN
FNAME_FORMAT := STRRTRIM(FILENAME) + '.' + STRRTRIM(GROUP) + '.' +
STRRTRIM(ACCT);
END;
As you see, we're taking full advantage here of the fact that
PASCAL/3000 lets us say "A + B" to concatenate two strings. In C, this
is quite a bit more difficult:
char *fname_format (filename, group, acct, qual)
char filename[], group[], acct[], qual[];
{
qual[0] = '\0';
strcat (qual, strrtrim (filename));
strcat (qual, ".");
strcat (qual, strrtrim (group));
strcat (qual, ".");
strcat (qual, strrtrim (acct));
return qual;
}
Instead of just saying "A + B", we must say "STRCAT (A, B)", which
MODIFIES THE STRING A (appending the string B to it) rather than just
returning a newly-constructed string. In fact, STRCAT does return the
address of the modified A, so we could conceivably say:
strcat (strcat (strcat (strcat (strcat (qual,
strrtrim(filename)),
"."),
strrtrim (group)),
"."),
strrtrim (acct));
but for obvious reasons we don't. This is one advantage of having an
operator like "+" instead of a function like "strcat" -- it makes the
program look quite a bit cleaner, especially if we have to nest it.
Note also that the PASCAL string manipulators are quite willing to
"create" a new string, like + and STR do. The C string package, on the
other hand, can only modify parameters that are passed to it (like
"strcat" does to its first parameter).
This is an artifact of the fact that C functions can't return
arrays (like PASCAL functions, but unlike PASCAL/3000 functions).
[Actually, some C compilers, including Draft ANSI Standard, allow
functions to return structures, so we could have a structure that
"contains" only one subfield, which is an array -- but this is not
usually done.]
While we were writing FNAME_FORMAT, we needed some way of stripping
trailing blanks from the file name, group name, and account name
strings. In PASCAL, this was accomplished by calling the STRRTRIM
function; in C, no such function exists, so we had to write our own:
char *strrtrim (s)
char s[];
{
/* Strips trailing blanks from F; also returns F as the result. */
int i;
for (i = strlen(s); (i>0) && (s[i-1]==' '); i = i-1)
;
s[i] = '\0';
return s;
}
What we do is quite simple -- we find the end of the string using
STRLEN and then step back until we find a non-blank; then, we set the
first of the trailing blanks to a '\0', which is the string
terminator.
What this means, among other things, is that even if you aren't
satisfied with C's string library, or if you're using a C compiler
that doesn't come with a string library, you can write all your string
handling primitives quite easily. I'd guess that all of the Draft ANSI
Standard C string-handling routines (except perhaps the numeric
formatting/parsing ones, like "sprintf" and "sscanf") could be
implemented in 200 lines or less.
On the other hand, of course, you'd rather not have to write even
that much yourself.
Continuing through our sample programs, we get to FNAME_PARSE. It's
a more complicated procedure -- we have to find the locations of the
two dots and then extract the three substrings that lie before,
between, and after the dots. In PASCAL, this would be:
PROCEDURE FNAME_PARSE (QUALIFIED: TSTR256;
VAR FILENAME, GROUP, ACCT: STRING);
VAR START_GROUP, START_ACCT: INTEGER;
BEGIN
START_GROUP := STRPOS (QUALIFIED, '.') + 1;
START_ACCT := STRPOS (STR (QUALIFIED, START_GROUP,
STRLEN(QUALIFIED)-START_GROUP-1), '.')
+ START_GROUP;
FILENAME := STR (QUALIFIED, 1, START_GROUP - 2);
GROUP := STR (QUALIFIED, START_GROUP, START_ACCT-START_GROUP-1);
ACCT := STR (QUALIFIED, START_ACCT,
STRLEN (QUALIFIED) - START_ACCT + 1);
END;
and in C:
fname_parse (qual, filename, group, acct)
char qual[], filename[], group[], acct[];
{
char *start_group, *start_acct;
start_group = strchr (qual, '.') + 1;
start_acct = strchr (start_group, '.') + 1;
strncpy (filename, qual, start_group - qual - 1);
filename[start_group - qual - 1] = '\0';
strncpy (group, start_group, start_acct - start_group - 1);
group[start_acct - start_group - 1] = '\0';
strcpy (acct, start_acct);
}
Again, there are both similarities and differences:
* Both PASCAL and Draft Standard C have functions that find a
character inside a string (STRPOS in PASCAL, "strchr" in C).
* PASCAL's STRPOS returns the INDEX of the character in the string,
but C "strchr" returns a POINTER to the character.
* PASCAL has a STR function that returns a substring, the room for
which is allocated on the stack. In C, on the other hand, one
would usually use a pointer and address directly into the
original string. That's why we say:
start_group = strchr (qual, '.') + 1;
start_acct = strchr (start_group, '.') + 1;
instead of
START_GROUP := STRPOS (QUALIFIED, '.') + 1;
START_ACCT := STRPOS (STR (QUALIFIED, START_GROUP,
STRLEN(QUALIFIED)-START_GROUP-1), '.')
+ START_GROUP;
As you see, we just passed START_GROUP (which is a pointer into
the string QUAL) directly to STRCHR, rather than having to
specially extract a substring, which would probably have been
inefficient as well as being somewhat more cumbersome. Note,
though, that when we pass START_GROUP, we don't pass a true
substring in the sense of "L characters starting at offset S";
rather, STRPOS sees all of QUAL starting at the location pointed
to by START_GROUP.
* Although the C routine manipulates pointers more than it does
offsets, we can say
p + 1
to refer to "a pointer that points 1 character after P", or
p - q
to refer to "the number of characters between the pointers P and
Q".
Finally, the calling sequences to the two procedures are quite
similar:
WRITELN (FNAME_FORMAT ('MPEX ', 'PUB ', 'VESOFT '));
FNAME_PARSE ('MPEX.PUB.VESOFT', FILENAME, GROUP, ACCT);
WRITELN (FILENAME, ',', GROUP, ',', ACCT, ';');
printf ("%s\n",
fname_format ("sl ", "pub ", "sys ", qual));
fname_parse ("mpex.pub.vesoft", filename, group, acct);
printf ("%s,%s,%s;\n", filename, group, acct);
Note that in both PASCAL and C, you can pass constant strings (e.g.
"MPEX.PUB.VESOFT") to procedures -- a major improvement over SPL,
which can't do this.
I've already mentioned some of the built-in string handling
functions that Draft Standard C provides; here's a full list:
* STRCPY and STRNCPY copy one string into another (one copies the
entire string, the other copies either the entire string or a
given number of characters, whichever is smaller). They're
comparable to PASCAL/3000's string assignment and STRMOVE
functions. The "N" procedures -- STRNCPY, STRNCAT, STRNCMP --
coupled with C's ability to extract "instant substrings" ("&x[3]"
is the address of the substring of X starting with character 3)
are intended to compensate for the lack of a substring function
like PASCAL/3000's STR.
* STRCAT and STRNCAT concatenate one string to another; again the
"N" version (STRNCAT) will append not an entire string but rather
up to some number of characters of it. These functions are most
similar to PASCAL/3000's STRAPPEND, but can do the job of "+"
with a bit of extra difficulty (as we saw above).
* STRCMP and STRNCMP compare two strings, much like PASCAL/3000's
ordinary relational operators (<, >, <=, >=, =, <>) applied to
strings.
* STRLEN returns the length of a string (just like PASCAL/3000's
STRLEN).
* STRCHR and STRRCHR search for the first and last occurrence,
respectively, of a character in a string. STRSTR finds the first
occurrence of one string within another. STRSTR is the direct
equivalent of PASCAL/3000's STRPOS (which therefore can do what
STRCHAR does, too). STRRCHR has no direct PASCAL/3000 equivalent.
* STRCSPN searches for the first occurrence of ONE OF A SET OF
CHARACTERS within a string. 'strspn (x, "abc")' will find the
first occurrence of an "a", "b", OR "c" in the string X.
STRSPN searches for the first character that is NOT ONE OF A SET
OF CHARACTERS -- 'strspn (x, " 0.")' will skip all leading
spaces, zeroes, and dots in X, and return the index of the first
character that is neither a space, zero, nor dot.
These functions have no real parallels in PASCAL/3000.
* STRTOK is a pretty complicated-looking routine that allows you to
break up a string into "tokens" separated by delimiters. It has
no parallel in PASCAL/3000.
* SPRINTF and SSCANF are the equivalents of PASCAL/3000's STRWRITE
and STRREAD -- they let you do formatted I/O using a string
instead of a file.
* PASCAL/3000 procedures that don't have a direct equivalent in C
include: STRDELETE and STRINSERT (which can be emulated with some
difficulty using STRNCPY); STRLTRIM and STRRTRIM, which trim
leading/trailing blanks; and STRRPT, which returns string
containing a given number of repetitions of some other string.
I intentionally gave this list AFTER the example comparing
FNAME_PARSE and FNAME_FORMAT is PASCAL and C. As we saw with STRRTRIM,
PASCAL/3000's standard string handling routines can be easily
implemented in C, and I'm sure that C's string handling routines can
be easily implemented in PASCAL/3000.
The important thing, I believe, is not the exact set of the
built-in string-handling procedures but rather the ease of
extensibility (which is good in both PASCAL/3000 and C, but very bad
in Standard PASCAL) and the general "style" of string-handling
programming (which, as you can see, is somewhat different in
PASCAL/3000 and C).
If you prefer C's pointer and null-terminated strings -- or,
conversely, if you prefer PASCAL/3000's "+" operator and the ability
of functions to return string results -- I'm sure that you'll have no
problems implementing whatever primitives you need in either language.
SEPARATE COMPILATION -- STANDARD PASCAL
As I'm sure you can imagine, the MPE/XL source code is NOT stored
in one source file. Neither, for that matter, is my MPEX/3000 or
SECURITY/3000, or virtually any serious program. Not only do I, for
instance, heavily use $INCLUDE files, I often compile various portions
of my program into various USLs, RLs, and SLs, and then eventually
link them together at compile and :PREP time. Obviously, this sort of
thing is imperative when your programs get into thousands or tens of
thousands of lines.
Now, what you may not be aware of is that
* STANDARD PASCAL DEMANDS THAT YOUR ENTIRE PROGRAM BE COMPILED FROM
A SINGLE SOURCE FILE.
That's right. In the orthodox Standard, you can't have your program
call SL procedures; you can't have it call RL procedures; you can't
have it call any procedures OTHER THAN THE ONES THAT WERE DEFINED IN
THE SAME SOURCE FILE. Believe it or not, this is true -- and it's one
of the major problems with Standard PASCAL.
Let's say that you want to keep all of your utility procedures --
ones that might be useful to many different programs -- in an RL (or
SL). This way, you won't have to copy them all into each program,
which would be a maintenance headache, and would slow down the
compiles substantially (my utility RL is 13,000 lines long; MPEX is
3,000 lines).
Standard PASCAL has a problem with this. Say that it encounters a
statement such as:
I:=MIN(J,80);
If it had previously seen a function definition such as:
FUNCTION MIN (X, Y: INTEGER): INTEGER;
BEGIN
IF X<Y THEN MIN:=X ELSE MIN:=Y;
END;
then it would realize that MIN is a function -- a function that takes
two by-value integer parameters and returns an integer -- and would
generate code accordingly. But what if MIN isn't in the same source
file? How does PASCAL know what to do?
Now, PASCAL might conceivably be able to decide that MIN is a
function -- after all, it couldn't be anything else. Still, what are
the function's parameters? Is J, for instance, a by-value or a
by-reference (VAR) parameter? PASCAL must know, because it would have
to generate different code in these cases. Are its parameters and
return value 32-bit integers or 16-bit integers? Again, PASCAL must
know. Does the function really have two integer parameters, or is the
programmer making a mistake? PASCAL wants to do type checking, but it
has no information to check against.
Essentially, what we have here is a "knowledge crisis":
* WHEN YOU TRY TO CALL A PROCEDURE THAT ISN'T DEFINED IN THE SAME
SOURCE FILE, PASCAL DOESN'T HAVE ENOUGH KNOWLEDGE ABOUT THE
PROCEDURE TO GENERATE CORRECT CODE FOR THE CALL. FURTHERMORE,
PASCAL's TYPE-CHECKING DESIRES ARE FRUSTRATED BY THIS VERY SAME
THING.
HOW HP PASCAL, PASCAL/XL, AND SPL HANDLE SEPARATE COMPILATION
Now this, of course, is by no means a new problem. Other languages
have to call separately compiled procedures, too, and they've managed
to work out some solutions. In HP's COBOL/68, for instance, if you say
CALL "MIN" USING X, 80.
the compiler will assume that the function has two parameters, X, and
80, and that both of them are passed by reference, as word addresses.
If the parameters are passed by value, or as byte address, or if the
procedure returns a value -- why, that's just your tough luck. You
can't call this procedure from COBOL/68. COBOL/68 compatibility,
incidentally, is the reason why all the IMAGE intrinsics have
by-reference, word-addressed parameters.
FORTRAN/3000 adopted a similar but somewhat more flexible approach.
In FORTRAN, saying
CALL MIN (X, 80)
will also make FORTRAN assume that MIN has two parameters, each of
them by reference. However, if X is of a character type, FORTRAN will
pass it as a byte address, not a word address; if X is an integer or
any other non-character object, FORTRAN will pass it as a word
address. This gives the user more flexibility.
Furthermore, FORTRAN/3000 allows you to say:
CALL MIN (\X\, 80)
to tell the compiler that a particular parameter -- in this case X --
should be passed BY VALUE rather than by reference. Furthermore, if
you want MIN to be a function, you can say
I = MIN (\X\, \80\)
from which FORTRAN will deduce that MIN returns a result. The type of
the result, incidentally, is assumed to be an integer by FORTRAN's
default type conventions (anything starting with an M, or any other
character between I and N, is an integer). If you want to declare MIN
to be real, you can simply say:
REAL MIN
...
I = MIN (\X\, \80\)
Thus, we can see four possible components in a compiler's decision
about how a procedure is being called:
* STANDARD ASSUMPTIONS. Both COBOL68/3000 and FORTRAN/3000, for
instance, ASSUME that all parameters are by reference.
* ASSUMPTIONS DERIVED FROM A NORMAL CALLING SEQUENCE. How many
parameters does a procedure have? Both COBOL68 and FORTRAN guess
this from the number of parameters the user specified. Similarly,
FORTRAN determines whether or not a procedure returns a result
and also the word-/byte-addressing of the parameters from the
details of the particular call.
* CALLING SEQUENCE MECHANISMS BY WHICH A USER CAN OVERRIDE THE
COMPILER'S ASSUMPTIONS. In FORTRAN, the backslashes (\) around
by-value parameters allow a user to override the assumption that
the parameters are to be passed by reference. Similarly, in HP's
COBOL/74 (COBOLII/3000), you can say
CALL "MIN" USING @X, \10\.
which indicates that X is to be passed as a byte address and 10
is to be passed by value. Of course, if MIN expects X to be
by-value, too, this call won't give you the right result -- it's
your responsibility to specify the correct calling sequence.
* ONE-TIME DECLARATIONS THAT THE USER CAN SPECIFY. If a user says
REAL MIN
and then uses MIN as a function, the compiler will automatically
know that MIN returns a real result, regardless of the context in
which it is used.
Different compilers, as you see, use different ones of the above
methods, and use them in different cases. COBOL/68, as I said, only
uses default assumptions and information that it can derive from the
standard calling sequence; FORTRAN/3000 uses all four of the above
methods to determine different things about how a procedure is to be
called.
HP PASCAL, PASCAL/XL, and SPL all take exactly the same approach.
They
* REQUIRE A USER TO DECLARE EVERYTHING THAT THE COMPILER NEEDS TO
KNOW ABOUT THE PROCEDURE CALLING SEQUENCE.
Unlike COBOL/3000 or FORTRAN/3000, they don't make any "educated
guesses"; but, on the other hand, they let you specify the calling
sequence in exact detail, thus allowing you to call procedures that
wouldn't be easily callable from either COBOL or FORTRAN.
In fact, SPL, HP PASCAL, and PASCAL/XL demand that you copy into
your program the PROCEDURE HEADER of every separately-compiled (also
known as "external") procedure that you call. For instance, if you
declared your SPL procedure MIN as
INTEGER PROCEDURE MIN (X, Y);
VALUE X, Y;
INTEGER X, Y;
BEGIN
IF X<Y THEN MIN:=X ELSE MIN:=Y;
END;
then an SPL program that wants to call MIN as an external would have
to say
INTEGER PROCEDURE MIN (X, Y);
VALUE X, Y;
INTEGER X, Y;
OPTION EXTERNAL;
The "OPTION EXTERNAL;" indicates that the compiler shouldn't expect
the actual body of MIN to go here; rather, the procedure itself will
be linked into the program later on.
Similarly, if you want to call the PASCAL/3000 procedure
FUNCTION MIN (X, Y: INTEGER): INTEGER;
BEGIN
IF X<Y THEN MIN:=X ELSE MIN:=Y;
END;
from another PASCAL/3000 program, you'd have to say:
FUNCTION MIN (X, Y: INTEGER): INTEGER;
EXTERNAL;
Here, just the word "EXTERNAL;" tells PASCAL that this is only a
declaration of the calling sequence; but, armed with this calling
sequence, PASCAL can both
* Generate the correct code, and
* Check the parameters you specify to make sure that they're really
what the procedure expects.
In other words, armed with these declarations, both SPL and PASCAL can
make sure that you specify the right number of parameters, and (in
PASCAL more than in SPL) that they are of the right types.
For instance, if MIN was declared with by-reference rather than
by-value parameters, the compiler would BOTH be sure to pass the
address rather than the value AND would make sure that you're really
passing a variable and not a constant or expression. Finally, since
the external declaration is an exact copy of the actual procedure's
header, you're sure that you can call ANY PASCAL procedure from
another program, even if it has procedural/ functional parameters and
other arcane stuff.
SEPARATE COMPILATION IN K&R C
Where PASCAL/3000, PASCAL/XL, and SPL all follow the strict
"declare everything, assume nothing" approach, Kernighan & Ritchie C
does almost the exact opposite. Its solution is actually much like
FORTRAN's, only more general and more demanding on the programmer.
C will:
* Pass ALL parameters by value -- this isn't an assumption, it's a
requirement.
* Deduce the number of parameters the called procedure has from the
procedure call.
* Deduce the type of each parameter -- integer, float, or structure
-- from the procedure call.
* Allow you to declare the procedure's result type, but usually
assume it to be "integer" (some compilers make this assumption,
while others signal an error).
What's more, these are K&R C's assumptions EVEN IF THE PROCEDURE IS
DECLARED IN THE SAME FILE (i.e. not separately compiled). If you
declare MIN as
int min(x,y)
int x,y;
{
if x<y then return (x);
else return (y);
}
and then call it as
r = min(17.0,34.0);
then the C compiler will merrily pass 17.0 and 34.0 as floating-point
numbers, although it "knows" that MIN expects integers. In other
words, the C compiler will neither print an error message nor
automatically convert the reals into integers; it'll pass them as
reals, leaving MIN to treat their binary representations as binary
representations of integers.
In fact, C won't even "KNOW" that MIN has two parameters; if you
pass three, it'll try to do it and let you suffer the consequences.
The only thing that C remembers about MIN is the same thing that it
would allow you to declare about MIN if MIN were an external procedure
-- that MIN's result type is "int" (or whatever else you might declare
it as).
Since external function call characteristics are thus pretty much
the same in K&R C as internal call characteristics, I describe them
elsewhere (primarily in the "C TYPE CHECKING" section of the "DATA
STRUCTURES" chapter). However, I'll mention a few of the most
important points here:
* To refer to an external procedure, all you need to do is say:
extern <type> <proc>();
The EXTERN indicates that <proc> is defined elsewhere; the "()"
indicates that <proc> is a procedure; <type> indicates the type
of the object returned by <proc>. An example might be:
extern float sqrt();
which declares SQRT to be an external procedure that returns a
FLOAT.
* If you want to pass a parameter by reference, you actually pass
its address by value. In other words, you'd declare your
procedure to be
int swap_ints (x, y)
int *x, *y;
{
int temp;
temp = *x;
*x = *y;
*y = temp;
}
-- a procedure that takes two BY-VALUE parameters which happens
to be a pointer. Then, to call it, you'd say
swap_ints (&foo, &bar);
passing as parameters not FOO and BAR, but rather the expressions
"&FOO" and "&BAR", which are the addresses of FOO and BAR. If you
omit an "&" (i.e. say "FOO" instead of "&FOO"), the compiler will
neither warn you nor do what seems to be "the right thing"
(extract the address automatically); rather, it'll happily pass
the value of FOO, which SWAP_INTS will treat as an address
(boom!).
* Similarly, you must be meticulously careful with the number of
parameters you try to pass and the type of each parameter; as I
said before, if you say
SQRT (100)
and "sqrt" expects a floating point number, C won't automatically
do the conversion for you (because it doesn't know just what it
is that SQRT expects).
The only exceptions are the various "short" types that are
automatically converted to "int"s and "float"s which are
automatically converted to "double"s.
To summarize, K&R C saves you having to include the procedure
header of every external procedure you want to call; however, it does
require you to specify the procedure's result type.
On the flip side, it can't check for what you don't specify, so it
will neither check for your errors nor automatically do the kinds of
conversions (e.g. automatically take the address of a by-reference
parameter) that SPL and PASCAL programmers take for granted.
DRAFT ANSI STANDARD C AND SEPARATE COMPILATION
Draft ANSI Standard C allows you to say
extern float sqrt();
or
extern int min();
just like you would in standard Kernighan & Ritchie C. One new feature
that it provides, though, is the ability to declare a "function
prototype" which specifies the types of the parameters that the called
procedure expects, thus letting the C compiler do some type checking
and automatic conversion.
I discuss this facility in some detail in the "DATA STRUCTURES --
TYPE CHECKING" sections of this manual, but I'll talk about it briefly
here. In Draft ANSI Standard C, you can say
extern float sqrt(float);
or
extern int min(int,int);
thus declaring the number and types of the procedures' parameters.
This imposes some type checking that's almost as stringent as
PASCAL's; the major differences between it and PASCAL's type checking
are:
* The sizes of array parameters are not checked. This -- as I
mention in the DATA STRUCTURES chapter -- is actually a good
thing.
* You can entirely prevent parameter checking for a given procedure
by simply using the old ("extern float sqrt()") standard K&R C
mechanism instead of the new method.
* You can declare a parameter to be a "generic by-reference object"
by declaring its type to be "void *", to wit
extern int put_rec (char *, char *, void *);
where the third parameter is declared to be a "void *" and can
thus have any type of array or record structure (or any other
by-reference object) passed in its place.
* Finally, you can use C's "type cast" mechanism to force an object
to the expected data type if for some reason the object isn't
declared the same way.
I happen to like this new Draft Standard C approach; it allows you
to implement strict type checking, but waive it whenever appropriate.
MORE ABOUT SEPARATE COMPILATION -- GLOBAL VARIABLES
What we talked about above is how the compiler can know about
procedures that are declared in a different source file. What about
variables? What if we want some procedures in the RL to share some
global variables with procedures in our main program?
In recent times, global variables have fallen somewhat out of
favor, and for good reason. A procedure is a lot clearer if its only
inputs and outputs are its parameters; you needn't be afraid that
calling
FOO (A, B);
will actually change some global variable I or J that isn't even
mentioned in the procedure call. I myself often ran into cases where
an apparently innocent procedure call did something I didn't expect
because it modified a global variable. As a rule, it's much better to
pass whatever information the called procedure needs to look at or set
as a procedure parameter.
For this reason, many of today's programming textbooks --
especially PASCAL and C textbooks -- counsel us to have as few global
variables as possible.
Unfortunately, though global variables are usually "bad programming
style", they are often necessary. For instance, I have several global
variables in my MPEX/3000 product and the supporting routines I keep
in my RL:
* CONY'HIT, a global variable that my control-Y trap procedure sets
whenever control-Y is hit. Since its "caller" is actually the
system, the only way it can communicate with the rest of my
program is by using a global variable.
* DEBUGGING. If this variable is TRUE, many of my procedures print
useful debugging information (e.g. information on every file I
open, parsing info, etc.). If I were to pass DEBUGGING as a
procedure parameter, virtually every one of my procedures would
have to have this extra parameter, either to use itself or to
pass to other procedures that it calls.
* VERSION, an 8-byte array that's set to the current version number
of my program. Whenever one of my low-level routines detects some
kind of logic error within my program (e.g. I'm trying to read
from a non-existent data segment), it prints an error message,
prints the contents of VERSION, and aborts. That way, if a user
gets an abort like this and sends a PSCREEN to us, we'll see what
version of the program he was running. Again, it would be a great
burden to pass this variable as a parameter to all my RL
procedures.
All I'm saying is that there are cases where global variables are
necessary and desirable, and it's important that a programming
language -- especially a systems programming language -- support them.
Now, how do PASCAL and C do them?
GLOBAL VARIABLES IN PASCAL
Since Standard PASCAL can't handle separate compilation anyway, it
certainly has no provisions for "cross-source file" global variables,
i.e. global variables shared by separately compiled procedures. You
can declare normal global variables, to wit
PROGRAM GLOBBER;
{ The variables between a PROGRAM statement and the first }
{ PROCEDURE statement are global. }
VAR DEBUGGING: BOOLEAN;
VERSION: PACKED ARRAY [1..8] OF CHAR;
...
PROCEDURE FOO;
VAR X: INTEGER; { this variable is local }
...
but these variables will only be known within this source file; even
if you can somehow call an external, separately-compiled procedure,
there's no way for it to know about these global variables.
PASCAL/3000 and PASCAL/XL, of course, had to face this problem just
like they had to face the problem of calling external procedures.
Their solution was this:
* One of the source files (the MAIN BODY) should have a $GLOBAL$
control card at the very beginning. Then, ALL of its global
variables become "knowable" by any other source files that are
compiled separately.
* Any of the other source files that wants to access a global
variable must declare the variable as global within it; it must
also have a $EXTERNAL$ control card at the very beginning.
* All the global variables defined in any of the $EXTERNAL$ source
files must also be defined in the $GLOBAL$ source file.
In other words, our main body might say
$GLOBAL$
PROGRAM GLOBBER;
{ These global variables are now accessible by all the }
{ separately compiled procedures. }
VAR DEBUGGING: BOOLEAN;
VERSION: PACKED ARRAY [1..8] OF CHAR;
...
Two other separately-compiled files might read:
$EXTERNAL$
$SUBPROGRAM$ { only subroutines here, no main body }
PROGRAM PROCS_1;
VAR DEBUGGING: BOOLEAN; { want to access this global var }
...
$EXTERNAL$
PROGRAM PROCS_2;
{ want to access this global var }
VAR VERSION: PACKED ARRAY [1..8] OF CHAR;
...
So, you see, the main body file DECLARES all the global variables that
are to be shared among all of the separately-compiled entities; all
other files can essentially IMPORT none, some, or all of the global
variables that were so declared.
If, however, some procedures in PROCS_1 and PROCS_2 wanted to share
some global variables, or even some procedures in a single file (e.g.
PROCS_1) wanted to share a global variable between them, this variable
would also have to be declared in the main body.
Also note that, unlike the external procedure declaration, which is
rather similarly implemented in many versions of PASCAL, this
$GLOBAL$/$EXTERNAL$ is a distinctly unusual construct that is unlikely
to be compatible with virtually any other PASCAL.
GLOBAL VARIABLES IN SPL
Global variable declaration and use is much the same in SPL as in
PASCAL. The "main body" program -- the one that'll become the outer
block of the program file -- should declare all the global variables
used anywhere within the program. This would look something like:
BEGIN
GLOBAL LOGICAL DEBUGGING:=FALSE;
GLOBAL BYTE ARRAY VERSION(0:7):="0.1 ";
...
END;
[Note that unlike PASCAL, you can initialize the variables when you
declare them; this isn't overwhelmingly important, but does save a bit
of typing.]
Then, each procedure in a separately-compiled file that wants to
"import" any of these global variables would say:
...
PROCEDURE ABORT'PROG;
BEGIN
EXTERNAL BYTE ARRAY VERSION(*);
...
END;
...
Note one difference between SPL and PASCAL -- in PASCAL, all the
imported global variables would be listed at the top of the source
file; in SPL, they'd be given inside each referencing procedure. On
the one hand, the SPL method means more typing; on the other, it
"localizes" the importations to only those procedures that explicitly
request them, thus making it clearer who is accessing global variables
and who isn't. In any case, this isn't much of a difference.
Note once again an interesting feature, present also in PASCAL: any
global variables used anywhere in the various separately-compiled
sources have to be declared in the main body (an exception in SPL is
OWN variables; more about them later).
This can be quite a burden; say you have an RL file whose
procedures use a bunch of local variables, often just to communicate
between each other (for instance, a bunch of resource handling
routines might need to share a "table of currently locked resources"
data structure).
Then, any program that calls these procedures -- whether it needs
to access the global variables or not -- would have to declare the
variables as GLOBAL. Not a good thing, especially if you like to view
your RL procedures as "black boxes" whose internal details should not
be cared about by their callers.
GLOBAL VARIABLES IN C
Both K&R C and Draft Standard C were designed with separate
compilation in mind; thus, they have standard provisions for global
variable declaration.
In general, you're required to do two things:
* In one of your separately compiled source files, you must declare
the variable as a simple global variable (i.e. outside of a
procedure declaration), e.g.
int debugging = 0;
char version[8] = "0.1 ";
Note that unlike PASCAL and SPL, C doesn't require that these
declarations occur in any particular source file, or even in the
same source file. You could, for instance, put these declarations
in your RL procedure source file -- then, if your main program
doesn't want to change these variables, it need never know that
they exist.
* Any source file that wants to access a global variable that it
didn't itself declare must define the variable as "extern":
extern int debugging;
extern char version[];
The EXTERN declaration may occur either at the top level (outside
a procedure), in which case the variable will be visible to all
procedures that are subsequently defined; alternatively, it may
occur inside a procedure, in which case the variable will be
known only to that procedure.
As I mentioned, the advantage of this sort of mechanism is that
there is no central place that is obligated to declare all the global
variables that are used in any of the separately compiled source
files. Rather, each variable may be declared and "extern"ed only where
it needs to be used.
This apparently unimportant feature can be useful in many cases.
Say you have two procedures, "alloc_dl_space" and "dealloc_dl_space",
that allocate space in the DL- area of the stack. They need to keep
track of certain data, for instance, a list of all the free chunks of
memory. You can't just say:
alloc_dl_space()
{
int *first_free_chunk;
...
}
Not only will FIRST_FREE_CHUNK not be visible to DEALLOC_DL_SPACE, but
even a subsequent call to ALLOC_DL_SPACE won't "know" this variable's
value, since any procedure-local variable is thrown away when the
procedure exits. In C, however, you can say:
int *first_free_chunk = 0;
alloc_dl_space()
{
...
}
dealloc_dl_space()
{
...
}
Now, both ALLOC_DL_SPACE and DEALLOC_DL_SPACE can both see and modify
the value of FIRST_FREE_CHUNK; also, since it is no longer a
procedure-local variable, the value will be preserved between calls to
these two procedures. Equally importantly,
* NOBODY WHO MIGHT CALL THESE TWO PROCEDURES WILL HAVE TO KNOW THAT
THEY USE THIS GLOBAL VARIABLE.
Contrast with the PASCAL and SPL approach, where each global variable
(and its type and size) would have to be known to the main program.
This both puts a burden on the programmer and adds the risk that this
variable -- really the property of the ALLOC_DL_SPACE and
DEALLOC_DL_SPACE procedures -- will somehow be changed by the main
program that has to declare it.
Note that this is the one case where the ability to initialize
variables (which C has, SPL has to some extent, and PASCAL doesn't
have) becomes really necessary. Since the variable is not known to the
main body, who's going to initialize it? The initialization clause
("int *first_free_chunk = 0") will do it.
Incidentally, SPL has a similar feature in its ability to have OWN
variables. These are also "static" variables -- i.e. ones that don't
go away when the procedure is exited -- but are only known within one
procedure. Thus, this is somewhat less useful than C's approach, but
better than PASCAL's, where all variables are either global (and thus
have to be declared in the main body) or non-static (i.e. get thrown
away whenever the procedure is exited).
PASCAL/XL'S MODULES -- A NEW (IMPROVED?) SEPARATE COMPILATION METHOD
PASCAL/3000, PASCAL/XL, SPL, and both species of C provide for
separate compilation. Essentially, any file may declare a procedure or
variable as "external", tell the compiler some things about it, and be
able to reference this procedure or variable. This sort of solution is
certainly necessary, and we can certainly live with it. But it is not
without its problems.
I keep about 300 procedures in my Relocatable Library (RL) file;
these are general-purpose routines that I call from all my various
programs. Say that I write a program in SPL or PASCAL. In order for my
program to be able to call any RL procedure, the program must include
an EXTERNAL declaration for the procedure, complete with declarations
of all the parameters. Even in C, I'd have to declare at least the
procedure's result type.
Now, none of my programs actually calls all 300 of the RL
procedures; most call only about 10 or 20, though some can call
hundreds. Even if the program only calls 20 RL procedures, having to
type all these EXTERNAL declarations can impose a substantial burden.
Twenty full procedure headers, each one of which must be exactly
correct, or else the program may very well fail in very strange ways
at run-time.
My solution to this problem was to create one $INCLUDE file -- C,
SPL, and PASCAL/3000 all have $INCLUDE-like commands -- that contains
the external declarations of each procedure in the RL. Then, every one
of my programs $INCLUDEs this file, thus declaring as external all of
the RL procedures, and making any one of them callable from the
program.
Thus, if my source file looks like:
BEGIN
INTEGER PROCEDURE MIN (I, J);
VALUE I, J;
INTEGER I, J;
BEGIN
IF I<J THEN MIN:=I ELSE MIN:=J;
END;
INTEGER PROCEDURE MAX (I, J);
VALUE I, J;
INTEGER I, J;
BEGIN
IF I<J THEN MAX:=I ELSE MAX:=J;
END;
...
END;
then my $INCLUDE file would look like:
INTEGER PROCEDURE MIN (I, J);
VALUE I, J;
OPTION EXTERNAL;
INTEGER PROCEDURE MAX (I, J);
VALUE I, J;
INTEGER I, J;
OPTION EXTERNAL;
...
As you see, one OPTION EXTERNAL declaration for each procedure. Now,
instead of manually declaring each such procedure in every file that
calls it, I can just $INCLUDE the entire "external declaration" file,
and have access to ALL the procedures in my RL. Similarly, I may have
a separate $INCLUDE file for various global constants, DEFINEs, and
variable declarations.
Still, the problem is obvious -- the procedure headers have to be
written at least twice, once where the procedure is actually defined
and at least once where the procedure is declared EXTERNAL. These two
definitions then have to be kept in sync, and any discrepancy may have
unpleasant consequences. The same goes for global variables, too.
Do you feel sorry for me yet? Well, if you don't, consider this. If
I write my procedures in PASCAL, then doubtless many of the parameters
will be of specially-defined types -- records, enumerations,
subranges, etc.
My EXTERNAL declarations must have EXACTLY THE SAME TYPES! This
means that not only does each external procedure need to be defined in
the external declarations, but so must any type that is used in any
procedure header. The same goes for any constant used in defining the
type (e.g. MAX_NAME_CHARS in the type 1..MAX_NAME_CHARS).
In other words, all of the "externally visible entities" of the
file containing the utility procedures must be duplicated in each
caller of the procedures (either directly or using a $INCLUDE). This
includes the externally-visible procedure, the externally- visible
variables, types, and constants. They have to be maintained together
with the procedure definitions themselves; any change in the external
appearance of the file must be reflected in the copy.
Thus, to summarize, if I have this source file:
CONST MAX_NAME_CHARS = 8;
TYPE FLAB_REC = RECORD ... END;
NAME = PACKED ARRAY [1..MAX_NAME_CHARS] OF CHAR;
...
PROCEDURE FLAB_READ (N: NAME; VAR F: FLAB_REC);
BEGIN
...
END;
...
PROCEDURE FLAB_WRITE (N: NAME; VAR F: FLAB_REC);
BEGIN
...
END;
...
FUNCTION FLAB_CALC_SECTORS (F: FLAB_REC): INTEGER;
BEGIN
...
END;
then at the very least I have to have an $INCLUDE file that looks like
this:
CONST MAX_NAME_CHARS = 8;
TYPE FLAB_REC = RECORD ... END;
NAME = PACKED ARRAY [1..MAX_NAME_CHARS] OF CHAR;
PROCEDURE FLAB_READ (N: NAME; VAR F: FLAB_REC);
EXTERNAL;
PROCEDURE FLAB_WRITE (N: NAME; VAR F: FLAB_REC);
EXTERNAL;
FUNCTION FLAB_CALC_SECTORS (F: FLAB_REC): INTEGER;
EXTERNAL;
So much for the tale of woe. I bet you're thinking: "Now Eugene's
going to tell us that PASCAL/XL solves all these problems." Well, I
wish I could, but I can't.
PASCAL/XL's MODULEs look something like the following:
$MLIBRARY 'FLABLIB'$
MODULE FLAB_HANDLING;
$SEARCH 'PRIMLIB, STRLIB'$
IMPORT PRIMITIVES, STRING_FUNCTIONS;
EXPORT
CONST FSERR_EOF = 0;
FSERR_NO_FILE = 52;
FSERR_DUP_FILE = 100;
TYPE FLAB_TYPE = RECORD
FILE: PACKED ARRAY [1..8] OF CHAR;
GROUP: PACKED ARRAY [1..8] OF CHAR;
...
END;
DISC_ADDR_TYPE = ARRAY [1..10] OF INTEGER;
FUNCTION FLABREAD (DISC_ADDR: DISC_ADDR_TYPE): FLAB_TYPE;
PROCEDURE FLABWRITE (DISC_ADDR: DISC_ADDR_TYPE; F: FLAB_TYPE);
IMPLEMENT
CONST FSERR_EOF = 0;
FSERR_NO_FILE = 52;
FSERR_DUP_FILE = 100;
MAX_FILES_PER_NODE = 16; { not exported }
TYPE FLAB_TYPE = RECORD
FILE: PACKED ARRAY [1..8] OF CHAR;
GROUP: PACKED ARRAY [1..8] OF CHAR;
...
END;
DISC_ADDR_TYPE = ARRAY [1..10] OF INTEGER;
DISC_ID = 1..256; { not exported }
FUNCTION ADDRCHECK (DISC_ADDR: DISC_ADDR_TYPE): BOOLEAN;
BEGIN
{ the implementation -- ADDRCHECK is not exported }
END;
FUNCTION FLABREAD (DISC_ADDR: DISC_ADDR_TYPE): FLAB_TYPE;
BEGIN
{ the actual implementation of the function... }
END;
PROCEDURE FLABWRITE (DISC_ADDR: DISC_ADDR_TYPE; F: FLAB_TYPE);
BEGIN
{ the actual implementation of the function... }
END;
END;
OK, let's look at this one piece at a time:
* First, we say "$MLIBRARY 'FLABLIB'$" followed by "MODULE
FLAB_HANDLING". These tell the compiler that this file
DEFINES A MODULE CALLED "FLAB_HANDLING" INSIDE THE
SPECIALLY-FORMATTED MODULE LIBRARY "FLABLIB".
All of the information about the "external interface" of this
module -- i.e. everything that is specified in the EXPORT section
-- will be stored into this module library.
* Then we say "$SEARCH 'PRIMLIB, STRLIB'$" followed by "IMPORT
PRIMITIVES, STRING_FUNCTIONS;". This essentially "brings in" the
external interface of the modules PRIMITIVES and STRING_FUNCTIONS
that are stored in the module library files PRIMLIB and STRLIB.
Practically speaking, this is exactly the same as if
ALL THE TEXT SPECIFIED IN THE "EXPORT" SECTION OF THE
"PRIMITIVES" AND "STRING_FUNCTIONS" MODULES WAS INCLUDED
DIRECTLY INTO THE FILE, WITH "EXTERNAL;" KEYWORDS PLACED RIGHT
AFTER THE FUNCTION DEFINITIONS.
Now the compiler "knows" about all the TYPEs, CONSTs, VARs,
PROCEDUREs, and FUNCTIONs that are defined by the PRIMITIVES and
STRING_FUNCTIONS modules, and will let you use them from within
the module that's being currently defined (FLAB_HANDLING).
* The EXPORT section defines the "external interface" of this
module. We've already explained what this really means --
whenever you IMPORT a module, the result is exactly the same as
if you had copied in the EXPORT section of the module (except
that "EXTERNAL;" keywords are automatically put after all the
functions).
Any CONSTants, VARiables, TYPEs, PROCEDUREs, or FUNCTIONs that
you define in this module but want other modules to use should be
declared in the EXPORT section.
* Finally, the IMPLEMENT section is the actual source code of your
file. It has to include all the declarations and
procedure/function definitions (EVEN THE ONES ALREADY MENTIONED
IN THE "EXPORT" SECTION!).
In fact, it's just like an ordinary PROGRAM, except that it's
missing the PROGRAM statement.
Thus, if you think about it, you could have -- instead of defining
a MODULE --
* Written the IMPLEMENT section as an ordinary program;
* Put the EXPORT information into a separate file (we'll call it
the "external declarations $INCLUDE file");
* And, instead of the IMPORTs, just used the $INCLUDE$ compiler
command to include the "external declarations $INCLUDE files" of
all of the IMPORTed modules.
Really, this is ALL THERE IS TO A MODULE. Its only advantages are:
* You can keep the EXPORT declarations and IMPLEMENT part in the
same file, so that when you change the definition of one of the
external objects, you can easily change the EXPORT declaration in
the same file.
* The compiler will check to make sure that the EXPORT declarations
are exactly the same as the actual implementation declarations
(i.e. that you didn't define the procedure one way in an EXPORT
and another way in the IMPLEMENT section).
* Finally, for honesty's sake, I ought to point out that you'd
actually need two $INCLUDE files per module if you wanted to use
them instead of MODULEs. You'd need one $INCLUDE file for the
TYPE/CONST/VAR declarations and one file for the
PROCEDURE/FUNCTION EXTERNAL declarations -- this is because ALL
of the TYPE/CONST/VAR declarations for all $INCLUDE$d modules
would have to go before ALL the EXTERNAL declarations.
The major flaw -- not compared with $INCLUDE$s but rather compared
to what they could have so easily done! -- is obvious:
* WHY FORCE US TO DUPLICATE ALL THE EXTERNAL OBJECT DECLARATIONS IN
THE "EXPORT" AND "IMPLEMENT" SECTIONS?
Why should I define the hundred-odd subfields of a file label twice?
Why should I specify the parameter list of all my external procedures
twice?
To me it seems simple -- just have all the declarations go into the
IMPLEMENT section, and then let me say:
EXPORT FSERR_EOF, FSERR_NO_FILE, FSERR_DUP_FILE,
FLAB_TYPE, DISC_ADDR_TYPE,
FLABREAD, FLABWRITE;
How simple! Think of all the effort and possibility for error that
we'd avoid. Even better -- I'd like to be able to say either
EXPORT_ALL;
to indicate that ALL the things I define should be exported, or
perhaps specify a $EXPORT$ flag near every definition that I want to
export. That way, if I have a file with 20 constants, 5 variables, 10
data types, and 30 procedures, I wouldn't have to enumerate them all
in an EXPORT statement at the top.
This way, not only do I have to enumerate all of them, but I have
to DUPLICATE ALL THEIR DEFINITIONS! Why?
Another quirk that you may or may not have noticed: obviously,
since I specify a module name AND a module library filename, there can
be more than one module per library. And yet, I define a different
library file for each of the modules FLAB_HANDLING, PRIMITIVES, and
STRING_FUNCTIONS. Why would I do a silly thing like that?
Well, a little paragraph hidden in my PASCAL/XL Programmer's Guide
says: "A module can not import another module from its own target
library; that is, the compiler options MLIBRARY and SEARCH can not
specify the same library."
Seems innocent, eh? But that means that any time a module must
import another module, the imported module must be in a different
library file! The only modules that CAN be stored in the same module
library file are ones that do NOT import one another.
Well, it makes perfect sense for FLAB_HANDLING to want to use
PRIMITIVES and STRING_FUNCTIONS -- after all, why do I define modules
if not to be able to IMPORT them into as many places as possible?
Similarly, STRING_FUNCTIONS will probably want to use PRIMITIVES. What
we get is a rather paradoxical situation in which
THE ONLY MODULES THAT CAN BE PUT INTO THE *SAME* MODULE LIBRARY
FILES ARE ONES THAT HAVE *NOTHING TO DO WITH EACH OTHER*!
Of course, that's a bit of hyperbole, but you get my point -- you make
modules so you can IMPORT them into one another, but the more you use
IMPORT the less able you are to have several related modules in the
same library.
STANDARD LANGUAGE OPERATORS
Everybody knows operators; all languages have them. +, -, *, / --
you pass them parameters and they return to you results. Imagine what
would happen if you couldn't multiply two numbers!
PASCAL's set of operators is, by standards of most computer
languages (like BASIC, FORTRAN, and COBOL), about average. It
includes:
monadic + monadic -
+ - * /
DIV MOD
NOT AND OR
< <= = <>
> >=
IN (the set operator)
^ (pointer deferencing)
[ (array subscripting)
These include:
* The "monadic" operators, which take one parameter.
* The "dyadic" operators, which take two parameters.
* The arithmetic operators +, -, *, /, DIV, and MOD. These operate
on integer or real parameters.
* The logical operators NOT, AND, and OR. They work on booleans.
* The relational operators <, <=, =, <>, >, and >=. They take
numbers and return booleans.
* The SET operators IN, +, -, and *. They work on sets; note that
+, -, and * mean quite different (though conceptually similar)
things for sets than they do for integers -- + is set union, - is
set difference, and * is set intersection.
* Some other things that you may not think of as operators but
which most assuredly are. ^ and [, it seems to me, are just as
much operators as anything else.
Now, it takes no feats of analysis to construct this kind of list;
I just looked in the PASCAL manual. The C manual reveals to me that
the C operator set is -- at least in terms of number of operators --
quite a bit richer:
C operator PASCAL equivalent
---------- -----------------
monadic +, monadic - Same
+, -, *, / Same
% MOD
<, >, <=, >=, ==, != Same; == is "equal", != is "not equal"
!, &&, || NOT, AND, OR (respectively)
~, &, |, ^ BITWISE NOT, AND, OR, and EXCLUSIVE OR
<<, >> BITWISE LOGICAL SHIFTS (LEFT and RIGHT)
monadic *, e.g. *X pointer^
monadic &, e.g. &X None -- returns address of a variable
++X, --X None -- increments or decrements X by 1
(usually) and also returns the new value
(simultaneously changing X)
X++, X-- None -- increments or decrements X by 1
(usually) but returns the value that X
had BEFORE the increment/decrement!
X=Y := assignment, but can be used in an
expression (e.g. "X = 2+(Y=3)+Z").
X+=Y, X-=Y, X*=Y, X/=Y, None -- X+=Y means the same as
X&=Y, X|=Y, X^=Y, "X=X+Y", and so on for the other ops
X%=Y, X<<=Y, X>>=Y
X?Y:Z None -- an IF/THEN/ELSE that can be
used within an expression.
"(X<Y)?X:Y" returns the minimum of
X and Y (IF X<Y THEN X ELSE Y).
(X,Y,Z) None -- executes the expressions X, Y,
and Z, but only returns the result of Z!
sizeof X Returns the size, in bytes of its
operand expression.
This is the rich operator set that many C programmers are so proud
of, and deride PASCAL and other languages for not possessing. Indeed,
several new categories of operators do exist. But are they really that
useful? Or can they be easily (and, perhaps, more readably) emulated
with conventional, more familiar, operators.
BIT MANIPULATION
PASCAL prides itself on being a High Level language. That's not
just high level as in "high-level" -- it's High Level, with a capital
H and a capital L. In PASCAL, much care is taken to insulate the
programmer from the underlying physical structure of the data. You
don't need to know how many bytes there are in a word, or how many
words there are in your data structure; you don't even need to know
that there is such a thing as a "bit" and that it is what is
manipulated deep down inside the computer.
Unfortunately, we do not live in a High Level world. If I want to
write a PASCAL program that can access, say, file labels, or PCB
entries, or log file records, I need to be able to access individual
bit fields.
This may mean that the system was badly designed in the first
place; perhaps to much attention was paid to saving space; perhaps the
operating system should have been written in PASCAL, too, so I could
just use the same record structure definitions to access the system
tables that the system itself uses.
For better or worse, there are plenty of cases where we need to
manipulate bit fields:
* In FOPEN, FGETINFO, and WHO calls;
* In accessing data in system tables and system log files;
* In writing one's own packed decimal manipulation routines;
* In compressing one's own data structures to save space, by no
means an unworthy goal;
* And many other cases, but primarily systems programming rather
than application programming.
Thus, although bit fields are admittedly not something you usually
use quite as often as, say, strings or integers, they must be
supported by any system programming language.
How do PASCAL, SPL, and C support bit fields?
TYPICAL OPERATIONS ON BITS
There are really three main classes of operations that you'd want
to perform with bits:
* BITWISE OPERATIONS ON ENTIRE NUMBERS. These view a quantity -- an
8-bit, a 16-bit, a 32-bit, or whatever -- as just a sequence of
bits. When you do a "bitwise not" of a number, each of the bits
in it is negated; a "bitwise and" of two numbers ands together
each bit in the two numbers -- bit 0 of the result becomes the
"logical and" of bit 0 of the first number and bit 0 of the
second; bit 1 of the result becomes the and of the two bits 1;
and so on.
* BIT SHIFTS. Bit shifts also view a number as a bit sequence. A
bit shift just takes all of the bits and "moves" them some number
to the left or to the right. For instance, if you shift 101 left
by 2 bits, this takes the bit pattern
00001100101
and makes it into
00110010100
which is 404.
* BIT EXTRACTS/DEPOSITS. Bit extracts take a particular sequence of
bits -- say "3 bits starting from bit #10" -- and extract the
value stored at those bits; in other words, they view a bit
string as an integer. Bit deposits allow you to set the 3 bits
starting at bit #10 to some value. In SPL, for instance,
RECFORMAT:=FOPTIONS.(8:2)
and
FOPTIONS.(8:2):=RECFORMAT
extract and deposit bit fields.
Since the computer's lowest-level data type is the bit, bit
operations can usually be performed quite easily and efficiently.
Virtually all computers have the BITWISE OPERATIONS and SHIFTS built
in as instructions, and some (like the HP3000, but not, say, the VAX
or the Motorola 68000) have BIT EXTRACT/DEPOSIT instructions as well.
An interesting fact is that although bitwise operations and shifts are
relatively hard and inefficient to emulate in software, bit extracts
and deposits are relatively easy. For instance,
X.(10:3) -- extract 3 bits starting from bit #10
(assuming 16-bit words, least significant bit=#15)
is the same as
X shifted left by 10 bits and shifted right by 13 bits
or
X shifted right by 3 bits and ANDed with 7.
In general,
I.(START:COUNT) = I shifted left by START and shifted right by
(#bits/word) - COUNT.
So, just because these three types of operations are available in
certain languages, it does not follow that all of them are necessary.
Some can be easily (or not so easily) emulated using the others, but,
more importantly, it may well be the case that some just aren't very
frequently useful. Bit extracts, for instance, are something that I've
frequently found myself doing; bitwise operations and especially
shifts are (at least for me) far rarer things.
SPL
SPL supports all three types of bit operations, and supports them
all in a big way.
* Single-word unsigned quantities (LOGICALs) can be bit-wise
negated (NOT), ANDed (LAND), inclusive ORed (LOR), and exclusive
ORed (XOR).
* Single-word quantities can have arbitrary CONSTANT bit substrings
extracted and deposited. In other words, you can say
"FOPTIONS.(10:3)", but you CAN'T say "CAPMASK.(CAPNUM:1)".
* Single-, double-, triple-, and quadruple-word quantities can be
shifted in a number of ways -- see the SPL Reference Manual for
details.
* Some other, less important features, like "Bit concatenation" are
also supported; see the SPL Reference Manual if you're curious.
Bit extractions are, naturally, vital in SPL. All sorts of things
-- FOPEN foptions and aoptions, WHO capability masks, system table
fields, etc. -- contain various bit fields, most of which do not cross
word boundaries (hence it's sufficient to have bit fields of single
words only) and have constant offsets and lengths (hence it's only
necessary to have constant bit field parameters).
Other operations are less frequent. Checking my 13,000-line RL file
-- which, of course, is completely representative of any SPL program
that anybody's ever written -- I find that I use shift operations in a
very distinct set of cases:
* Converting byte to word addresses and vice versa.
* Extracting variable bit fields (e.g. if I have a WHO-format
capability mask and a variable capability bit number).
* Extracting bit fields of doubles (from WHO masks, disc addresses,
and other double-word entities).
* Rarely, quick multiplies and divides by powers of two.
* Constructing integers and doubles out of bytes and integers.
Note that the majority of these cases are actually "work-arounds"
caused by compiler problems. If SPL had a C-like "cast" facility with
which I could convert a byte address to a word address and vice versa,
I wouldn't need to do an ugly and unreadable shift; if SPL's bit
extracts were more powerful, I could use them for variable fields and
doubles; if SPL's optimizer were better, I could always do multiplies
and divides and count on SPL to do the work.
Shifts, then, in my opinion are a classic case of something that might
not be needed in a "perfect" system; however, as we see, they can come
in handy to avoid the imperfections that are bound to exist in any
language.
Similarly, I find that I almost don't use bitwise operations at
all; the only cases I do use them are those where I need to implement
double-word bit fields and variable bit fields.
C
C's approach to bit operations is rather like that of SPL. Since
most computers have bit manipulation instructions, C has bit operators
built in to the language. Actually, C does not have bit extracts and
deposits, but it does have shifts and bitwise operators, which (as I
mentioned) can be used to emulate bit extractions.
However, C has another mechanism to handle bit fields, which is
both more and less usable than SPL's.
In C, structures can have fields that are explicitly declared to be
a certain number of bits long. For instance, consider the following
definition:
typedef struct {unsigned:2; /* Bits .(0:2); unused */
unsigned file_type:2; /* .(2:2) */
unsigned ksam:1; /* .(4:1) */
unsigned disallow_fileeq:1; /* .(5:1) */
unsigned labelled_tape:1; /* .(6:1) */
unsigned cctl:1; /* .(7:1) */
unsigned record_format:2; /* .(8:2) */
unsigned default_desig:3; /* .(10:3) */
unsigned ascii:1; /* .(13:1) */
unsigned domain:2; /* .(14:2) */
} foptions_type;
This defines a data type called "foptions_type" as a structure with
several subfields. However, each subfield occupies a certain number of
bits, and because of certain guarantees made by the C compiler (which
differ, by the way, among different Cs on different machines) we know
which bits they are. Thus, if we say:
foptions_type foptions;
...
fgetinfo (fnum,,foptions);
if (foptions.cctl=1) ...
A lot clearer, you'll agree, than saying "FOPTIONS.(7:1)". And the
effect, of course, is exactly the same.
On the other hand, there are certain restrictions to this bit field
extraction mechanism:
* Just like in SPL, you can't extract bit fields whose offset and
length are not constant. To do this, you have to use the shift
operators or bitwise operations. For instance, instead of writing
CAPMASK.(BIT:2) (which you couldn't do in SPL anyway),
you'd say:
(capmask << bit) >> 14
or, perhaps,
(capmask >> (16-bit)) & 3 (3 = a binary 11)
You have to do similar (but uglier) stuff to set variable bit
fields.
* Another difference -- in which C falls short of SPL -- is that
the "structure subfield" approach of bit extraction only works
for getting bits from variables that were declared to be a
certain type. Say you want to extract bits (10:2) of something
that was declared as an "int", or, perhaps, an expression (like
"foptions | 4", which ORs foptions with 4, thus setting the ASCII
bit). You can't say
(foptions | 4).rec_format
or even
((foptions_type) (foptions | 4)).rec_format
(since you can't cast something to a structure type). Granted,
it's very rare that you'd want to extract a subfield of an
expression, and if you're trying to extract a subfield of a type
"int" variable, this might mean that you declared the variable
wrong. Still, the point here is that SPL's bit extract mechanism
is more flexible (though much less readable).
* Finally, a philosophical issue. Yes, you can use C record
structures to extract bit fields; but what if you want to use a
particular subfield only once? Do you need to declare a special
record structure just for extracting the field? Wouldn't it be
easier to do like you can in SPL, specifying the bit offset and
length directly, without encumbering yourself with a new
datatype?
This becomes an even more serious problem with PASCAL, in which
you can't even do bit shifts to emulate impromptu bit extraction.
The key point here is that for QUICK AND DIRTY, one-shot
operations, declaring a structure in order to be able to use bit
subfields may be more cumbersome that you'd like. Again, the old
trade-off of "Good Programming Style" versus ease of writing.
Thus, C supports bitwise operations and shifts (although not as
many shift operators as SPL does; on the other hand, many of SPL's
shift operators are of doubtful utility). It can also emulate bit
field manipulations using shifts, but, more importantly, can make bit
manipulations a lot easier and more readable using record structure
bit subfields. On the other hand, SPL's ".(X:Y)" operator has the
advantage of being usable on an ad-hoc basis, without needing to
declare a special record structure.
PASCAL
As I mentioned before, PASCAL, perhaps more for philosophical
reasons than anything else, does not explicitly support bits. "PASCAL:
An Introduction to Methodical Programming" (W. Findlay & D. A. Watt)
-- the book from which I learned PASCAL, and one that describes all
the Standard PASCAL features -- doesn't even mention "bits" in its
index.
Of course, the need for bit fields was recognized quite early, and
a fairly common consensus developed.
PACKED RECORDs are defined in Standard PASCAL as structures that
the compiler may -- at its option -- make use less space but slower to
access. Many PASCAL compilers use PACKED RECORDs as vehicles for
implementing bit subfields, much like C does.
For instance, say that you want to declare an "foptions" type much
like the one I showed for C. In PASCAL, it would be:
TYPE FOPTIONS_TYPE = PACKED RECORD
DUMMY: 0..3; /* .(0:2) */
FILE_TYPE: 0..3; /* .(2:2) */
KSAM: 0..1; /* .(4:1) */
DISALLOW_FILEEQ: 0..1; /* .(5:1) */
LABELLED_TAPE: 0..1; /* .(6:1) */
CCTL: 0..1; /* .(7:1) */
RECORD_FORMAT: 0..3; /* .(8:2) */
DEFAULT_DESIG: 0..7; /* .(10:3) */
ASCII: 0..1; /* .(13:1) */
DOMAIN: 0..3; /* .(14:2) */
END;
Note the most obvious feature here (which you may consider either
ingenious or utterly laughable, depending on your prejudices). Instead
of specifying the number of bits you're using explicitly, you specify
a range from 0 to
2 ^ NUMBITS - 1
The compiler then decides that the smallest number of bits it could
use to represent this is NUMBITS, and allocates that many. Remember
that PASCAL is very reluctant to let the programmer "see" any aspect
of the internal representation of its variable; therefore, it prefers
that bit fields be thus declared implicitly rather than explicitly.
Now, you can access the bit fields of an FOPTIONS_TYPE variable
just like you would in C:
VAR FOPTIONS: FOPTIONS_TYPE;
...
FGETINFO (FNUM, , FOPTIONS);
IF FOPTIONS.CCTL=1 THEN ...
If you don't like having to declare the bit fields with all those
powers of 2 (quick -- how many bits is "0..8191"?), you can just issue
the following declarations (usually in an $INCLUDE$ file):
TYPE BITS_1 = 0..1;
BITS_2 = 0..3;
BITS_3 = 0..7;
...
BITS_13 = 0..8191;
BITS_14 = 0..16383;
BITS_15 = 0..32767;
and then declare each subfield of, say, FOPTIONS_TYPE, as being of
type BITS_1 or BITS_3 or whatever.
So, the high-level means of accessing bit subfields exists in
PASCAL just as it does in C. What about the low-level means? What if
you want to do a shift or a bitwise AND? Or, more concretely, what if
you want to extract, say, a bit field that starts at a variable
offset?
Fortunately, HP PASCAL (unlike SPL and C) provides a nice built-in
mechanism for handling variable-offset bit fields. Consider our
classic example, a 32-bit "capability mask" (of the type that WHO
returns). We want to be able to retrieve, say, the CAPNUMth bit of the
capability mask. In PASCAL, we say:
TYPE CAPABILITY_MASK_TYPE = PACKED ARRAY [0..31] OF 0..1;
VAR CAPABILITY_MASK: CAPABILITY_MASK_TYPE;
...
I:=CAPABILITY_MASK[BITNUM];
Simple! Because "PACKED" in PASCAL can apply to any kind of structure,
HP PASCAL (and many other PASCAL compilers) allow PACKED ARRAYs of
subranges that can fit in less than 1 byte (e.g. 0..1, 0..3, etc.) to
become arrays of bit fields.
Thus, "CAPABILITY_MASK[BITNUM]" extracts the BITNUMth bit of
CAPABILITY_MASK simply because, to HP PASCAL, the BITNUMth bit is the
BITNUMth element of an array of 32 bits. We could do the same,
incidentally, to an array of 48 bits, 64 bits, etc. On the other hand,
if we want to retrieve more than one bit, we can see the limitations
of the PASCAL approach:
* There is no simple way, for instance, to retrieve a variable
number of bits from an integer. Since we have neither a bit
extract nor a bit shift operator, we can't use them; our only
alternatives are using division and modulo by powers of 2 (quite
complicated and very inefficient) or using the PACK and UNPACK
built-in procedures in a rather esoteric way (equally complicated
and maybe more inefficient).
* We can't even, in general, extract, say, 2 bits from a variable
bit offset! We can only do this if we know that the bit offset
will be a multiple of 2 -- in that case, we can use a PACKED
ARRAY OF 0..3. Similar problems, of course, happen with
extracting 3-bit fields, 4-bit fields, etc. from variable
boundaries (although we can, for instance, extract nybbles from a
packed decimal number, because they always start on a 4-bit
boundary).
In light of all this, it should be obvious that, say, shifts or
bitwise operations are well-nigh impossible to do efficiently or
conveniently in PASCAL (although in HP PASCAL, bitwise operations can
be craftily and kludgily emulated using sets).
What we see here is, I believe, a common PASCAL syndrome. Those
things that the language supports -- to wit, fixed-offset bit fields
and arrays of single-bit fields -- it often supports quite well; you
can use these features very readably and efficiently. On the other
hand, anything that the language designers didn't think to give you --
like shifts and bitwise operators -- you have NO WAY of accessing.
If the compiler is too dumb to implement "X*8" as the much more
efficient "X left shifted by 3 bits", too bad; in SPL and C you can do
this yourself, but in PASCAL you can't. If the compiler doesn't have
variable-offset bit field support, SPL and C let you do it using
shifts; in PASCAL, it would be very difficult and very inefficient.
Thus, if you find PASCAL's bit handling mechanisms sufficient -- as
is quite probable, since the major features are there -- then you'll
have no problems. On the other hand, there's a very distinct limit on
what you can do in PASCAL, and PASCAL doesn't have the flexibility to
let you work around it easily.
PASCAL/XL
Just a brief comment about PASCAL/XL -- in PASCAL/XL, instead of
using PACKED RECORD and PACKED ARRAY for bit field support, you must
use CRUNCHED RECORDs and CRUNCHED ARRAYs. Remember this when you try
to write code that'll run both in PASCAL/XL and normal HP PASCAL.
Remember this and weep.
INCREMENT AND DECREMENT IN C
Another feature of C worth mentioning is the variable increment/
decrement set of operators. This is what allows C programmers to say
char a[80], b[80];
char *pa, *pb;
pa = &a[0];
pb = &b[0];
while (pa != '\0')
*(pb++) = *(pa++);
This, of course, is either one of the most elegant pieces of code ever
written, or one of the most unreadable. Or both.
There are four operators like this in C:
* ++ PREFIX, i.e. "++X". This increments X by 1 (or sometimes 2 or
4, if it's a pointer -- more about this later) and returns X's
new value. In other words, if you say
int x, y;
x = 10;
y = 1 + (++x) + 2;
then X will be set to 11 and Y will be 14 (1 + 11 + 2).
* ++ POSTFIX, i.e. "X++". This increments X by 1 (or 2 or 4) and
returns X's OLD value, the one it had before the increment. Thus,
int x, y;
x = 10;
y = 1 + (x++) + 2;
then X will be set to 11, but Y will be 13 (1 + 10 + 2). In the
calculation of Y, the old value of X (10) was used.
* -- PREFIX ("--X"), just like ++, but decrements.
* -- POSTFIX ("X--"), just like --, but decrements.
Note that ++ and -- don't always increment/decrement by 1. If you
pass them a POINTER, the pointer will be incremented or decremented by
the SIZE OF THE OBJECT BEING POINTED TO. Thus, if "int" is a 4-byte
integer,
int a[10];
int *ap;
ap = &a[0];
++ap;
will increment AP by 4 bytes (or 2 words, depending on how the pointer
is represented internally). In other words, "++" and "--" of pointers
actually increment or decrement by one element; if AP used to point to
element 0 of A, now it points to element 1.
Now, one reason why I mention these operators is that they are a
non-trivial difference between C and PASCAL, and I have to say
something about them just so you'll think I'm thorough.
Another reason, though, is that beyond the seemingly simple (and,
in case of the postfix operators, counterintuitive) definition lurks a
fairly powerful construct that can be quite useful in many cases. On
the other hand, some say -- and not without reason -- that using these
kinds of operators makes code much more difficult to read and
understand.
One of the original reasons that these operators were introduced
was that some of the computers that C was first implemented on
supported these operations in hardware. Modern computers, like the VAX
and the Motorola 68000, for instance, have special "addressing modes"
on each instruction that allow you to store something into the
location pointed to by a register and then increment the register
(like postfix ++) or decrement the register and then store (like
prefix --).
A reasonable compiler, though, can know enough to translate, say,
X:=X-1;
into "decrement X"; even the 15-year-old SPL compiler can do this.
Today's reason for the increment/decrement operators -- besides saving
poor programmers' weary fingers -- is that in many cases they can very
directly represent what you're trying to do.
A classic case, for instance, is stack processing. Say you want to
implement your own stack data structure. The primary operations you
need are to PUSH a value onto the stack and to POP a value. In SPL,
you might have a pointer PTR that points to the top cell, and define
two procedures,
PROCEDURE STACK'PUSH (V);
VALUE V;
INTEGER V;
BEGIN
@PTR:=@PTR+1;
PTR:=V;
END;
INTEGER PROCEDURE STACK'POP;
BEGIN
STACK'POP:=PTR;
@PTR:=@PTR-1;
END;
In C, you can have PTR point to one cell AFTER the top cell, and say:
*(ptr++) = v; /* to push V onto the stack */
v = *(--ptr); /* to pop a value from the stack */
For stacks (and queues and other data structures), post-increment
and pre-decrement are EXACTLY what you need. Of course, a
full-function stack package would have to have many more features, but
many of them can profitably use post-increment/ pre-decrement and
other nice C features.
Other applications are, for intstance, the case I showed as an
example earlier:
char a[80], b[80];
char *pa, *pb;
pa = &a[0];
pb = &b[0];
while (pa != '\0')
*(pb++) = *(pa++);
What this actually does (isn't it obvious?) is copy the string stored
in the array A to the array B. PA is a pointer to the current
character in A; PB is a pointer to the current character in B. Since
all C strings are terminated by a null character ('\0'), the loop goes
through A, incrementing the pointers and copying characters at the
same time!
Similarly, you can say...
while (*(pa++) == ' ');
which will increment PA until it points to a non-blank; or,
while (*(pa++) == *(pb++));
which will increment PA and PB while the characters they point to are
equal -- very useful for a string comparison routine.
In a way, these features of C are rather like FOR loops that never
execute when the starting value is greater than the ending value. In
PASCAL, for instance, you can say
FOR INDEX:=CURRCOLUMN+1 TO ENDCOLUMN DO
...
and know that if CURRCOLUMN = ENDCOLUMN, the loop won't be executed at
all (which happens to be exactly what you want). In classic FORTRAN,
though,
DO 10 INDEX=CURRCOLUMN+1,ENDCOLUMN
...
will always execute the loop at least once, even if CURRCOLUMN is
equal to ENDCOLUMN; if you don't want this, you have to have an IF ...
GOTO around the loop. The point here is that the
"post/pre-increment/decrement" features are one of those things that
"just happen to come in handy" in a surprising number of cases. Just
by looking at them, you wouldn't think that they're so useful, but
there are a lot of applications where they are just the thing.
Fine, you've heard the "pro". ++ and -- let you write a lot of
elegant one-liners for handling stacks, strings, queues and the like.
Now, the con:
while (pa != '\0')
{
*pb = *pa;
pb = pb + 1;
pa = pa + 1;
}
while (pa == ' ')
pa = pa + 1;
while (*pa == *pb)
{
pa = pa + 1;
pb = pb + 1;
}
What are these? Well, these are the C loops that do exactly what the
above post/pre-increment examples do, but using conventional
operators. Are they more or less readable than the ++ mechanisms we
saw? Let us for the moment ignore performance; even if the compiler
doesn't optimize all these cases (and, to be fair, many compilers
won't), performance isn't everything. DO ++ AND -- CONSTRUCTS MAKE
YOUR CODE MORE OR LESS READABLE?
Now, I don't have any opinions on this matter; I just tell you the
two sides of the issue and let you decide. I am completely objective
(if you believe that, I've got some waterfront property in Kansas you
could have real cheap...). Readability isn't a black-and-white sort of
thing; everybody has his own standards. What do you think? Are these
"two-in-one" programming constructs elegant or ugly?
?: AND (,)
Let's say that you want to call the READX intrinsic, reading either
LEN words or 128 words, whichever is less. You know that your buffer
only has room for 128 words, and you don't want to overflow it if LEN
is too large; normally, though, if LEN<128, you want to read only LEN
words.
In PASCAL, you'd have to write this:
IF LEN<128 THEN
ACTUAL_LEN := READX (BUFFER, LEN)
ELSE
ACTUAL_LEN := READX (BUFFER, 128);
In C, however, you can instead say:
actual_len = readx (buffer, (len<128)?len:128);
(Think of all the keystrokes you save!) What this actually means is
that the second parameter to READX is the expression
(LEN<128) ? LEN : 128
This is a "ternary" operator -- an operator with three parameters:
* the first parameter (before the "?") is a boolean expression,
called the "test" -- in this case "LEN<128";
* the second parameter (between the "?" and ":") is an expression
called the "then clause", in this case "LEN";
* the third parameter (after the ":") is the "else clause", in this
case "128".
The behavior is quite simple -- if the "test" is TRUE, this operator
returns the value of the "then clause"; if the test is FALSE, the
operator returns the value of the "else" clause. Just like an
IF/THEN/ELSE statement except that it returns a value of an expression
instead of just executing some statements.
The advantage should be clear. There are many cases where you need
to do one of two things, almost exactly identical except for one key
parameter. If the two tasks need a different statement in one case or
another (e.g. a call to READ instead of READX), you'd use an
IF/THEN/ELSE; if they need a different expression as a parameter
inside a statement, you'd use a ?: construct.
The trouble is, again, one of readability. Consider this example
(taken as an example of the "right way" of using ?/: from "C, a
reference manual", by Harbison & Steele):
return (x > 0) ? 1 : (x < 0) ? -1 : 0;
OK, quick, what does this do? Why, it determines the "signum" of a
number, of course! +1 if the number is positive, -1 if it's negative,
0 if it's zero. Which is more readable -- the above or
if (x > 0) return 1;
else if (x < 0) return -1;
else return 0;
Again, up to you to decide -- some would say that ?: is better, others
would side with the IF/THEN/ELSE. On the other hand, in this case, I
think there is a substantive thing to be said against "? :":
* ?: IS MORE THAN JUST AN OPERATOR; IT'S A CONTROL STRUCTURE IN
THAT IT INFLUENCES THE FLOW OF THE PROGRAM. ESPECIALLY WHEN ?:S
ARE NESTED (e.g. if you're testing two conditions and do one of
four things based on the result), THE FACT THAT THIS CONTROL
STRUCTURE IS DELIMITED BY TWO SPECIAL CHARACTERS (rather than,
say, IF, THEN, or ELSE) CAN MAKE THE PROGRAM DIFFICULT TO READ.
In other words, since "?" and ":" are just two special characters,
like many of the other special characters that occur in C statements,
you can often have a hard time finding out where the test starts and
where it ends, where one THEN or ELSE clause ends, and so on. This is
especially the case when you write code like
a = (x>0) ? ((y>0)?(x*y):(-x*y)) : ((y>0)?(-x+y):(error_trap()));
In which ?:s are nested within each other. Of course, you might say
that this code is badly written; perhaps it should be:
a = (x>0)
? ((y>0) ? (x*y) : (-x*y))
: ((y>0) ? (-x+y) : (error_trap()));
But then, why not just write it as
if (x>0 && y>0) a = x*y;
else if (x>0) a = -x*y;
else if (y>0) a = -x+y;
else a = error_trap();
In any case, this is mostly a matter of personal preferences. I'm in
favor of using ?: in #define's (where it is necessary -- see the
chapter on them, and on "(,)" below), such as
#define min(a,b) (((a)<(b)) ? (a) : (b))
#define abs(a) (((a)<0) ? -(a) : (a))
On the other hand, I try to avoid ?:s in normal code in almost all
cases. I prefer IF/THEN/ELSE statements instead.
(,)
Just like ?: is equivalent to an IF/THEN/ELSE,
(x,y,z)
is essentially equivalent to
{ /* begin */
x;
y;
z;
} /* end */
The difference, of course, is that "(x,y,z)" returns the value of Z.
For instance, say that we're looping through a file. We want to read a
bunch of records until we get an EOF (which, presumably, is a returned
by a call to the "get_ccode" procedure). We might write:
while ((len=fread(fnum,data,128), get_ccode()==2))
...
Instead of a simple expression, our loop test consists of two parts,
which are both evaluated (in the given order!) to determine the value
-- every time we do the loop test, we first call FREAD and then check
the result of GET_CCODE. This, of course, is identical to
len = fread(fnum,data,128);
while (get_ccode()==2)
{
...
len = fread(fnum,data,128);
}
but using the "(x,y)" construct, we don't have to repeat the FREAD
call twice.
A more common use of this is in FOR loops. The three parts of the
FOR loop -- the loop counter initialization, the loop termination
test, and the loop counter increment -- must all be single
expressions. Using the "," operator, you can fit several operations
into one expression, to wit:
for (pa=&a[0], pb=&b[0]; pa!='\0'; pa++, pb++)
*pb=*pa;
This, of course, copies the string A t(without the trailing zero) to
the string B. The "," operator isn't used here for its value (which
would be the value of "&b[0]" in the initialization portion and the
new value of "pb" in the increment portion); it's just used for
combining several expressions in a context where only one is allowed.
The major power of both the ?: operator and the "," operator is
manifested in #define's. Statements can only appear at the
"top-level", separated by semicolons; expressions, however, can appear
either inside statements or in place of statements. Thus,
#define min(a,b) if (a<b) a; else b;
won't work, because it will make
x=min(y,z)
expand into
x=if (y<z) y;
else z;
which is quite illegal. On the other hand,
#define min(a,b) (((a)<(b)) ? (a) : (b))
will translate
x=min(y,z)
into
x=((y)<(z)) ? (y) : (z);
which will do the right thing.
Similarly, say you have a record structure
typedef struct {real re; real im;} complex;
Then, you can have a #define
#define cbuild(z,rpart,ipart) (z.re=(rpart), z.im=(ipart), z)
When you say "CBUILD(Z,1.0,3.0)", this sets the variable Z's RE
subfield to 1.0, its IM subfield to 3.0, and returns the value Z. You
can use this in cases like
c = csqrt (cbuild(z,1.0,3.0));
If you didn't have the "," operator, you couldn't write a #define like
this. You could say:
#define cbuild(z,rpart,ipart) {z.re=rpart; z.im=ipart;}
but then it wouldn't be usable in an expression because C statement
blocks can't be parts of expressions.
Again, my personal attitude towards the "," operator is similar to
my opinion about ?:. It is necessary for #DEFINEs but best avoided in
normal code, the one exception being FOR loops, where it's used more
as a separator than to return a result. That's one reporter's opinion.
THE "COMPOUND ASSIGNMENTS" (+=, -=, ETC.);
ALSO, SOME MORE GENERAL COMMENTS ON EFFICIENCY AND READABILIY
Finally, C has one other set of interesting operators. These are
the "compound assignments", which perform an operation and do an
assignment at the same time. A possible example might be:
x[index(7)+3] += inc;
which is, of course, identical to
x[index(7)+3] = x[index(7)+3] + inc;
Similarly,
(*foo).frob |= 0x100;
means the same thing as:
(*foo).frob = (*foo).frob | 0x100;
(and happens to set the 8th least-significant bit of "(*foo).frob").
[Note: The above examples aren't actually EXACTLY the same because of
considerations pertaining to "double evaluation" of the expression
being assigned to; however, this isn't usually very relevant, and I
won't discuss it here.]
Now if I wanted to, I could stop here. Obviously "a x= b" means the
same as "a = a x b" (where "x" is pretty much any dyadic operator) --
now you know it and can make up your own minds about it.
But, what the hell -- I'm a naturally garrulous kind of guy. I
could run on for pages about these operators, and their
psychoscientific motivations! In fact, I think I might do just that,
because I think that there's something of deeper significance to them
than just a few saved keystrokes.
To put it simply, there are several "statements" that the presence
of these operators -- "+=", "-=", "++", "--", etc. -- makes. Whether
you take the "pro" side or the "con" on them will influence your
opinion on the utility of these operators:
* EFFICIENCY. Saying "X += Y" or "X++" will let the compiler
generate more efficient code than just "X = X+Y" or "X = X+1".
- PRO: The compiler "knows" that we're just incrementing a
variable (rather than doing an arbitrary add) and can generate
the more efficient instructions that most computers have for
this special case.
- CON: All -- or almost all -- modern compilers can easily deduce
this information even from a "X = X+Y" or "X = X+1". Even
SPL/3000, which is 15 years old, will generate an "INCREMENT BY
ONE" instruction if you say "X := X+1".
- MORE PRO (COUNTER-CON?): It's true that most compilers will
automatically generate fast code for increments, bit extracts,
etc. However, every compiler will have SOME flaw somewhere --
perhaps it won't recognize one particular case and will
generate inefficient code. Special operators that the compiler
ALWAYS translates efficiently can allow you to write efficient
code even if you're stuck with a silly compiler implementation.
* READABILITY. Saying "X += Y" is more readable than "X = X+Y".
- PRO: Consider one of the examples above:
x[index(7)+3] = x[index(7)+3] + inc;
Here, we're incrementing "x[index(7)+3]" -- but how does the
person reading the program know that? He has to look at the
fairly complex expressions on both sides of the assignment, and
make sure that they're identical! Similarly, when he's writing
the program, it's quite easy for him to make a mistake -- say
"x[index(3)+7]" instead of "x[index(7)+3]" on one side of the
assignment, and probably never see it because he "knows" that
it's just a simple increment. Saying
x[index(7)+3] += inc;
is actually MORE readable, since you don't have to duplicate
any code and thus introduce additional opportunity for error.
- CON: "x[index(7)+3] += inc". Can you read that? I can't read
that! The more special characters and operators a language has,
the harder it is to read. Everybody's USED to simple ":="
assignments, present in ALGOL, FORTRAN, PASCAL, SPL, ADA, etc.;
when we introduce a whole new bevy of operators, people are
likely to misunderstand them, or at least have to take extra
time and effort while reading the program.
* FLEXIBILITY. OK, so you don't like these operators -- don't use
them!
- PRO: Hey, this is a free country! Look at the entire rich
operator set and use only those that you find pleasant; at
least in C, you have the choice.
- CON: I might not be forced to WRITE programs with these
operators in them, but I may well be forced to READ them; 70%
of a program's lifetime is spent in maintenance, and I don't
want my programmers to write in a language that ENCOURAGES them
to write unreadably! A language should be restrictive as well
as flexible -- it should prevent wayward programmers from
writing unreadable constructs like:
x += (x++) + f(x,y) + (y++); /* can you understand this? */
- PRO again: Authoritarian fascist!
- CON again: Undisciplined hippie!
OK, break it up, boys. I think that the above issues are
particularly involved in evaluating C's rich (but perhaps
"undisciplined") operator set, and, to some extent, the differences
between PASCAL and C in general. I won't pretend to tell you which
attitude is correct -- I don't know myself. I just want to lay more of
the cards out on the table.
PASSING VARIABLE NUMBERS OF PARAMETERS TO PROCEDURES -- SPL
One feature that SPL has is so-called "OPTION VARIABLE" procedures.
This is a procedure that looks like this:
PROCEDURE V (A, B, C);
VALUE A;
INTEGER A;
BYTE ARRAY B, C;
OPTION VARIABLE;
BEGIN
...
INTEGER VAR'MASK = Q-4;
...
IF VAR'MASK.(14:1)=1 THEN << was parameter B omitted? >>
...
END;
What does this mean? This means that when we say:
V (1);
or
V (,,BUFF);
or even simply
V;
the SPL compiler will NOT complain that we didn't specify the three
parameters that V expects. Rather, it will pass those parameters
you've specified, pass GARBAGE in place of the parameters you've
omitted, and will set the "Q-4" location in your stack to a bit mask
indicating exactly which parameters were specified and which were not.
As you see, we've declared the variable VAR'MASK to reside at this Q-4
location, and can now say
IF VAR'MASK.(x:1)=1 THEN
to check whether or not the parameter indicated by "x" was actually
specified. "x" has to be the bit number associated with the parameter,
counting from 15 (the last parameter) down. Thus, to check if C (the
last parameter) was specified, we'd say
IF VAR'MASK.(15:1)=1 THEN
To check for B (the second-to-last), we'd say
IF VAR'MASK.(14:1)=1 THEN
To check for A (the third-to-last), we'd say
IF VAR'MASK.(13:1)=1 THEN
Note the twin advantages of being able to omit parameters:
* It can make the procedure call a lot easier to write or read; the
FOPEN intrinsic has 13 parameters, all of them necessary for one
thing or another. Do you want to have to say:
MOVE DISC'DEV:="DISC ";
FNUM:=FOPEN (FILENAME, 1, %420, 128, DISC'DEV, DUMMY,
0, 2, 1, 1023D, 8, 1, 0);
or wouldn't you rather just type
FNUM:=FOPEN (FILENAME, 1, %420);
and let all the other parameters automatically default to the
right values? It's easier to write AND gives less opportunity for
error (did you notice that I accidentally specified blocking
factor 2 and 1 buffer instead of the default, which is the other
way around?).
* Furthermore, the very act of omitting or specifying a parameter
carries INFORMATION. For instance, FOPENing a file with DEV=LP
and the forms message parameter OMITTED is quite different than
passing any forms message. The very fact that the forms message
wasn't specified tells the file system something. Similarly,
omitting the blocking factor in an FOPEN makes the file system
calculate an "optimal" (actually it isn't) blocking factor for
the file.
Many examples -- FOPEN, FGETINFO, FCHECK, etc. -- can be given
where not having OPTION VARIABLE would make calling the procedure a
substantial burden.
While we talk about the advantages of OPTION VARIABLE procedures,
let's note also some of the problems with the way they're implemented
in SPL:
* If you declare a procedure to be OPTION VARIABLE, then SPL will
let its caller omit ANY parameter. Usually, some parameters are
optional, while others (e.g. the file number in an FGETINFO call)
are required, and you'd like the compiler to enforce this
requirement.
Otherwise, you'd either have to rely on the user (always a bad
idea), or check the presence of each of the required parameters
yourself at run-time (possible but somewhat cumbersome and
inefficient).
* As you saw, checking to see whether a parameter was actually
passed is not an easy job. Instead of saying
IF HAVE(FILENAME) THEN
you have to say
INTEGER VAR'MASK = Q-4;
...
IF VAR'MASK.(3:1)=1 THEN
knowing (as of course you do) that FILENAME is the 13th-to-last
procedure parameter and is thus indicated by VAR'MASK.(3:1).
* Often, a user's omission of a parameter simply means that some
default value should be assumed. Why not have the compiler take
care of this case for you instead of making you do it yourself?
For instance, if you were writing FOPEN, wouldn't you rather say:
INTEGER PROCEDURE FOPEN (FILE, FOPT, AOPT, RECSZ, DEV, ...);
VALUE FOPT, AOPT, RECSZ, ...;
BYTE ARRAY FILE (DEFAULT ""), DEV (DEFAULT "DISC ");
INTEGER FOPT (DEFAULT 0), AOPT (DEFAULT 0),
RECSZ (DEFAULT 128);
...
OPTION VARIABLE;
instead of
INTEGER PROCEDURE FOPEN (FILE, FOPT, AOPT, RECSZ, DEV, ...);
VALUE FOPT, AOPT, RECSZ, ...;
BYTE ARRAY FILE (DEFAULT ""), DEV (DEFAULT "DISC ");
INTEGER FOPT (DEFAULT 0), AOPT (DEFAULT 0),
RECSZ (DEFAULT 128);
...
OPTION VARIABLE;
BEGIN
INTEGER VAR'MASK=Q-4;
...
IF VAR'MASK.(3:1)=0 THEN @FILE:=@DEFAULT'FILE;
IF VAR'MASK.(4:1)=0 THEN FOPT:=0;
IF VAR'MASK.(5:1)=0 THEN AOPT:=0;
IF VAR'MASK.(6:1)=0 THEN RECSZ:=128;
IF VAR'MASK.(7:1)=0 THEN @DEV:=@DISC'DEVICE;
...
END;
Note only is that easier on the author of FOPEN, but it could
also be more efficient at run-time -- instead of having a whole
bunch of run-time bit extracts and checks, the code generated by
a call such as:
FNUM:=FOPEN (TMPFILE);
might actually have all the default values built in to it (just
as if the user had explicitly specified them), saving a
non-trivial amount of time.
* Finally, another interesting concern. Say that I want to write a
procedure that's "plug-compatible" with the FOPEN intrinsic. In
my MPEX/3000, for instance, I have a SUPER'FOPEN procedure that
checks a global "debugging" flag, prints all of its parameters if
the flag is true, and then calls FOPEN. SUPER'FOPEN also calls
the ZSIZE intrinsic to make sure that FOPEN has as much stack
space as possible to work with; it might also detect and
specially handle certain error conditions, and so on.
In other words, what I want to have is an OPTION VARIABLE
procedure that does some things and then passes all of its
parameters to another OPTION VARIABLE procedure:
INTEGER PROCEDURE SUPER'FOPEN (FILE, FOPT, AOPT, RECSZ, ...);
...
OPTION VARIABLE;
BEGIN
...
SUPER'FOPEN := FOPEN (FILE, FOPT, AOPT, RECSZ, ...);
...
END;
The trouble here is that in my FOPEN call I want to OMIT ALL THE
PARAMETERS THAT WERE OMITTED IN THE SUPER'FOPEN CALL and SPECIFY
ONLY THOSE PARAMETERS THAT WERE SPECIFIED IN THE SUPER'FOPEN
CALL. In other words, in this case I DON'T KNOW WHICH PARAMETERS
I WANT TO OMIT UNTIL RUN-TIME. If I just say:
SUPER'FOPEN := FOPEN (FILE, FOPT, AOPT, RECSZ, ...);
passing all thirteen parameters, FOPEN will think that they're
all the ones I want, whereas many of them are garbage. I want to
say
INTEGER VAR'MASK = Q-4;
...
SUPER'FOPEN := VARCALL FOPEN, VAR'MASK (FILE, FOPT,
AOPT, RECSZ, ...);
somehow telling the compiler: "this isn't an ordinary call, where
you should figure out which parameters are specified and which
aren't; rather, pass to FOPEN the very same VAR'MASK parameter
that I myself was given".
PASSING VARIABLE NUMBERS OF PARAMETERS TO PROCEDURES --
STANDARD PASCAL, PASCAL/3000, ISO LEVEL 1 STANDARD PASCAL
Standard PASCAL, PASCAL/3000, and ISO Level 1 Standard PASCAL do
not allow you to pass variable numbers of parameters to procedures.
Enough said?
Well, maybe not. As I've mentioned before, the mere fact that
language X has a feature that language Y does not doesn't mean that
language X is better than language Y. This isn't a basketball game
where you get 2 points for each feature, and 3 for each one that's
really far out. Maybe PASCAL has a point -- do you really need
procedures with variable numbers of parameters?
Well, the first thing you notice about, say, the CREATE, FOPEN, and
FGETINFO intrinsics -- conspicuous users of the OPTION VARIABLE
features -- is that they aren't very extensible.
Sure, FGETINFO has 20 parameters, and you can specify any and omit
any (except the file number); but what if a new file parameter is
introduced? Since there are all these thousands of programs that use
the old FGETINFO, we can't just add a 21st parameter, since that would
make them all incompatible.
This, in fact, is why the FFILEINFO intrinsic was created --
FFILEINFO takes a file number and five pairs of "item numbers" and
"item buffers". Each item number is a code indicating what piece of
information ought to be returned about a file. Thus, up to five
different pieces of information can be returned by a single FFILEINFO
call. If you need more than five (which is unlikely), you can call
FFILEINFO twice or however many times is necessary. A typical call can
thus look like:
FFILEINFO (FNUM, 8 << item number for "filecode" >>, CODE,
18 << item number for "creator id" >>, CREATOR);
instead of the FGETINFO call, which would be:
FGETINFO (FNUM,,,,,,,,CODE,,,,,,,,,,CREATOR);
Note another advantage of the FFILEINFO approach -- you no longer have
to "count commas" to make sure that your parameter is in the right
place; the item number (which you've presumably declared as a symbolic
constant) indicates what the item you want to get is.
So, instead of the 20-parameter OPTION VARIABLE FGETINFO intrinsic,
we have FFILEINFO. But FFILEINFO is still OPTION VARIABLE! Remember,
FFILEINFO takes up to five item number/item value pairs; in this case
we entered only two. Of course, we could have said:
FFILEINFO (FNUM, 8 << item number for "filecode" >>, CODE,
18 << item number for "creator id" >>, CREATOR,
0, DUMMY, 0, DUMMY, 0, DUMMY);
but who'd want to? Similarly, FFILEINFO might have been defined to
return only one piece of data at a time (and thus always have exactly
three parameters), but again that's not very good. Every FFILEINFO
call has some fixed overhead to it (for instance, finding the File
Control Block from the file number FNUM); why repeat it more often
than you have to?
Another example arises in the CREATEPROCESS intrinsic. The
CREATEPROCESS intrinsic was introduced when some new parameters --
;STDLIST, ;STDIN, and ;INFO -- had to be added to the CREATE
intrinsic.
The CREATE intrinsic, although OPTION VARIABLE, was initially
defined to have 10 parameters. This means that any compiled program
that uses the CREATE intrinsic expects it to have 10 parameters; if
you added three parameters to the CREATE intrinsic in the system SL,
all the old programs would stop working.
An additional problem with CREATEPROCESS is that it wasn't just a
"get me some information" intrinsic like FFILEINFO -- it actually
starts a new process. We can't just say "pass five process-creation
parameters at a time; if you need to pass more, just call it twice"
(like we did for FFILEINFO). All the parameters need to be known to
the CREATEPROCESS intrinsic at once.
The CREATEPROCESS intrinsic, although OPTION VARIABLE, doesn't
really need to be. You can just view it as a five-parameter procedure:
CREATEPROCESS (error, pin, program, itemnumbers, items);
The itemnumbers array contains the item numbers of all the
process-creation parameters; the items array contains the parameters
themselves (either the values or the addresses). Thus, to do the
equivalent of an old
CREATE (PROGRAM, ENTRY'NAME, PIN, PARM, 1, , , MAXDATA);
(which would create a process with entry ENTRY'NAME, ;PARM= PARM,
;MAXDATA= MAXDATA, and "load flags" 1), we'd say
INTEGER ARRAY ITEM'NUMS(0:4);
INTEGER ARRAY ITEMS(0:4);
...
<< Item 1 = entry name, 2 = parm, 3 = load flags, 6 = maxdata; >>
<< 0 terminates the list. >>
<< We probably want to have EQUATEs for these "magic numbers". >>
MOVE ITEM'NUMS:=(1, 2, 3, 6, 0);
ITEMS(0):=@ENTRY'NAME; << the address of the entry name >>
ITEMS(1):=PARM; << ;PARM= >>
ITEMS(2):=1; << load flags >>
ITEMS(3):=MAXDATA; << ;MAXDATA= >>
CREATEPROCESS (ERR, PIN, PROGRAM, ITEM'NUMS, ITEMS);
As you see, the non-OPTION VARIABLE approach may be more extensible,
but it certainly isn't easier to write or read.
Finally, let me point out that OPTION VARIABLE procedures, though
not easily extensible when you have COMPILED CODE that calls them, are
quite easy to extend when you have SOURCE CODE.
If you have your own OPTION VARIABLE procedure MYPROC, then adding
a new parameter to it is a piece of cake; in fact, it's easier than
adding a new parameter to a non-OPTION VARIABLE procedure (for which
you'd have to change all the calls to pass an extra dummy parameter).
All you need to do to extend an OPTION VARIABLE procedure is to
recompile both it and all its callers, so that the newly-compiled code
will appropriately reflect the new parameters of the called procedure.
PASSING VARIABLE NUMBERS OF PARAMETERS TO PROCEDURES --
KERNIGHAN & RITCHIE C
One thing you may have noticed about C is that two of the most
important functions in C -- "printf", which outputs data in a
formatted manner, and "scanf", which inputs data -- have variable
numbers of parameters. An example of a call to "printf" might be:
printf ("Max = %d, min = %d, avg = %d\n", max, min, average);
This call happens to take 4 parameters -- the format string (in which
the "%d"s indicate where the rest of the parameters are to be put in)
and three integers (max, min, and average). Other calls might take
only one parameter (a constant formatted string, e.g. 'printf ("Hi
there!\n")') or two or twenty.
Now, PASCAL's input/output "procedures" (READ, READLN, WRITE, and
WRITELN) also take a variable number of parameters. Unfortunately,
PASCAL isn't being quite honest when it just calls them "procedures";
they can get away with things that ordinary procedures can't, such as
taking parameters of varying types, taking a variable number of
parameters, and even having special parameters prefixed with ":"s
(e.g. "WRITELN(X:10, Y:7:4)").
C, however, is serious when it calls "printf" and "scanf"
procedures. Their source is kept somewhere in some C library source
files; if you don't like them, you can rewrite them yourself, or write
your own procedures that take variable numbers of parameters.
Let's say that we want to do just that. Let's say that on our way
to work, we fall down and knock ourselves on the head. When we wake
up, we find that we've inexplicably fallen in love with FORTRAN and
want to make our "printf" format strings look exactly like FORTRAN
FORMAT statements. (For a slightly more plausible example, say that we
want to add some new directives, such as "%m" for outputting data in
monetary format, with ","s between each thousand -- standard C
"printf" doesn't allow this.)
Well, what we really want to do is write a "writef" procedure:
writef (fmtstring, ???)
char fmtstring[];
???;
{
???
}
Now the good news is that -- unlike PASCAL -- C won't get upset when
we call this procedure as:
writef ("I5,X,F7.2", inum, fnum);
on one line, and as
writef ("I4,X,S,X,I3", i1, s, i2);
on the next; C never checks the number or types of parameters anyway.
(Note that C allows us to omit parameters at the END of the parameter
list; unlike SPL, it doesn't let us omit them from the MIDDLE of the
parameter list.) The question is -- how do we write the "writef"
procedure? The caller might be able to specify a variable number of
parameters, but how will "writef" itself be able to access these
parameters?
This is where the trouble with having "OPTION VARIABLE"-type C
routines comes in. There's no universal, system-independent way for
"writef" and any such procedure to find out how many parameters were
actually passed, or access those parameters that were passed.
Different compilers have different conventions for this sort of
thing. Many compilers assure you that if the user passes, say, 3
parameters to a procedure that expects 10 parameters, then the first 3
procedure parameters will have the right values -- it's just that the
remaining 7 will be set to garbage. In this case, we could write
"writef" as:
writef (fmtstring, p1, p2, p3, p4, p5, p6, p7, p8, p9, p10)
char fmtstring[];
int p1, p2, p3, p4, p5, p6, p7, p8, p9, p10;
Then if we call "writef" using:
writef ("I5,X,F7.2", inum, fnum);
the procedure can look at the format string -- which it knows will be
passed as "fmtstring" -- determine the number of parameters that the
format string expects (in this case, 2, one for the I5 and one for the
F7.2), and then look only at "p1" and "p2", not at "p3" through "p10",
which are known to be garbage.
Some other compiler might always assure you that the number of
parameters passed to a procedure would be kept in some register, which
could then be accessed using an assembly routine. On the other hand,
it might say that if 10 parameters were expected but only 3 were
passed, the actually passed parameters would be accessible as P8, P9,
and P10 instead of P1, P2, and P3. Then, WRITEF would have to call the
assembly routine to determine the number of passed parameters and
would then have to realize that since only 3 parameters were passed,
their data is stored in P8, P9, and P10.
As you see, there are two components here:
* Knowing the number of parameters passed (here determined by
looking at FMTSTRING).
* Being able to determine the value of each parameter that was
passed (here assured by knowing that any parameters that are
passed will become the first, second, etc. parameters of the
procedure).
Somehow -- by some compiler guarantee, or by a calling convention
(e.g. the number of parameters is indicated in FMTSTRING, or the
parameter list is terminated by -1), or by some assembly routine -- we
need to be able to do both of the above things.
Finally, let me point out one other factor. When we declare a
procedure as
writef (fmtstring, p1, p2, p3, p4, p5, p6, p7, p8, p9, p10)
we really don't want to access the last 10 parameters as P1 through
P10; we want to be able to view them all as elements of one big array,
so we could say something like:
for (i = 0; i < 10; i = i + 1)
process_parm (p[i]);
instead of having to say
process_parm (p1);
process_parm (p2);
process_parm (p3);
process_parm (p4);
process_parm (p5);
process_parm (p6);
process_parm (p7);
process_parm (p8);
process_parm (p9);
process_parm (p10);
The same, incidentally, arises with SPL -- we'd like to be able to
access SPL OPTION VARIABLE parameters as elements of one big
"parameters array", too. In SPL, it turns out, we can do that by
saying:
PROCEDURE FOO (P0, P1, P2, P3, P4, P5, P6, P7, P8, P9);
VALUE P0, P1, P2, P3, P4, P5, P6, P7, P8, P9;
INTEGER P0, P1, P2, P3, P4, P5, P6, P7, P8, P9;
OPTION VARIABLE;
BEGIN
INTEGER ARRAY PARMS(*)=P0;
...
END;
The PARMS array here is defined to start at the location occupied by
the by-value parameter P0; it so happens that the way parameters are
allocated on the HP3000, PARMS(3) would be equal to P3, and PARMS(7)
would be equal to P7. Similarly, in C you can say:
foo (p0, p1, p2, p3, p4, p5, p6, p7, p8, p9)
int p0, p1, p2, p3, p4, p5, p6, p7, p8, p9;
{
int *parms;
parms = &p0;
...
x = parms[i]; /* meaning parameter #I */
...
}
and this will work on those C compilers which allocate the parameters
the appropriate way on the stack. As you see, the conclusion here --
just like in the general question of writing OPTION VARIABLE-like
parameters -- is:
* It's probably doable on any particular C implementation, but it's
certainly not portable.
Thus, to summarize, C's support for procedure with variable numbers
of parameters is:
* Unlike PASCAL, C syntax allows you to specify a different number
of parameters in a call than the procedure actually has: you can
call a 10-parameter procedure using "P (1, 2, 3)".
* Unlike SPL, you can't omit any parameters in the middle of a call
-- "P (1,, 3,,, 6)" is illegal.
* Although the CALL is legal and portable, there's no way to
portably write a procedure that EXPECTS a variable number of
parameters.
* On the other hand, on most C compilers, there will be SOME way of
writing an OPTION VARIABLE-type procedure, although as I said
it's likely to be rather different from compiler to compiler.
* Finally -- something that I haven't mentioned yet but that is of
much relevance -- parameters of different types may occupy a
different amount of space on the call stack. If you pass a "long
float" to a procedure that's expecting "int" parameters, the
"long float" will end up occupying two parameters. This means
that the procedure must know when its parameters are "long
float"s (like "writef" can know by looking at the FMTSTRING
parameter), and kludge accordingly.
PASSING VARIABLE NUMBERS OF PARAMETERS TO PROCEDURES --
DRAFT ANSI STANDARD C
Draft ANSI Standard C has a provision for passing variable numbers
of parameters. Like many good things, it's at the same time useful and
confusing. Let's have a look at it.
Calling an OPTION VARIABLE-type procedure in Draft Standard C is
quite similar to the way you'd do it in K&R C:
writef ("I5,X,F7.2", inum, fnum);
The one difference is that the compiler might (or might not) DEMAND
that you establish a function prototype (see "DATA STRUCTURES -- TYPE
CHECKING") to declare that this function takes a variable number of
parameters. The prototype for WRITEF would probably be:
extern int writef (char *, ...);
The "char *" says that there is one REQUIRED parameter, a character
array; the "..." -- literally, three "."s, one after the other --
means that there is a variable number of parameters after this.
Defining a procedure that can take a variable number of parameters
is trickier. Here's an example:
writef (char *fmtstring, ...)
{
va_list arg_ptr;
va_start (arg_ptr, fmtstring);
...
while (!done)
{
...
if (current_fmtstring_descriptor_is_I)
output_integer (va_arg (arg_ptr, int));
else if (current_fmtstring_descriptor_is_F)
output_integer (va_arg (arg_ptr, float));
else if (current_fmtstring_descriptor_is_S)
output_integer (va_arg (arg_ptr, char *);
...
}
...
va_end (arg_ptr);
}
Consider the components of this declaration one at a time:
* The "..." in the header indicates that besides the one required
parameter, this procedure takes an unknown number of optional
parameters.
* The "va_list arg_ptr" declares a variable called "arg_ptr", of
type "va_list" (which is defined in a special #INCLUDE file that
comes with the C compiler).
* The "va_start (arg_ptr, fmtstring)" call initializes "arg_ptr" to
point to the first variable parameter -- the one immediately
after "fmtstring". "va_start" must be passed the last required
parameter (in this case, "fmtstring"); among other things, this
means that every procedure must take AT LEAST ONE fixed parameter
-- you can't have all the parameters be optional.
* The procedure then (presumably) goes through FMTSTRING and finds
out what the types of the parameters is expected to be. As it
determines that the current format descriptor is, say, "I", or
"F", or "S", it "picks up" the next parameter. It does this by
saying
va_arg (arg_ptr, <type>)
The "arg_ptr" is the same variable that was declared using
"va_list arg_ptr"; the <type> indicates which type of object we
want to get (in our case, this may be an "int", a "float", or a
"char *", depending on which format descriptor we're on).
* Finally, at the end, we call
va_end (arg_ptr);
to do whatever stack cleanup needs to be done.
This method is guaranteed (heh, heh) to be portable across all
implementations of the Draft ANSI Standard. (Again, note that since
the Standard is only Draft, many existing implementations might have
no such facility or a slightly different one.) Note its advantages and
disadvantages:
* You can now portably access the optional parameters, and even
easily access them as elements of an array by making a WHILE loop
that calls VA_ARG and sticks the results into a local array.
* You can specify that some of the procedure parameters are
required, thus letting the compiler check that every call to the
procedure contains at least those parameters.
* On the other hand, there's still no way of figuring out exactly
how many parameters were passed to you -- you have to rely on the
user's telling you this, either as an explicit parameter or
implicitly (such as using a format string from which the number
of parameters can be deduced).
* You can't have a procedure where all of the parameters are
optional.
* You still can't have a procedure where a parameter in the MIDDLE
of a parameter list can be omitted (e.g. "P (1,,3,,,6)").
* Accessing parameters that are simply optional is somewhat harder
than in SPL, since you have to get them using VA_ARG rather than
referring to them by name, to wit:
create (char *progname, ...)
{
va_list ap;
char *entryname;
int *pin, param, flags, stack, dl, maxdata, pri, rank;
va_start (ap, progname);
entryname = va_arg (ap, char *);
pin = va_arg (ap, int *);
param = va_arg (ap, int);
flags = va_arg (ap, int);
stack = va_arg (ap, int);
dl = va_arg (ap, int);
maxdata = va_arg (ap, int);
pri = va_arg (ap, int);
rank = va_arg (ap, int);
...
}
As you see, you have to specially extract each optional
parameter, rather than just being able to access it directly like
you can in SPL.
PASSING VARIABLE NUMBERS OF PARAMETERS TO PROCEDURES -- PASCAL/XL
PASCAL/XL's support for procedures with optional parameters seems
to be really nice.
One mechanism that PASCAL/XL provides is "OPTION DEFAULT_PARMS".
PROCEDURE P (A, B, C, D, E, F: INTEGER)
OPTION DEFAULT_PARMS (A:=11, C:=22, E:=55, F:=66);
BEGIN
...
END;
This tells PASCAL/XL several things:
* In any call to P, the first (A), third (C), fifth (E), or sixth
(F) parameters may (or may not) be omitted. Thus, we can say:
P (, 22, ,44); { omitting A, C, E, and F }
P (11, 22, 33, 44); { omitting only E and F }
P (, 22, 33, 44, , 66); { omitting A and E }
or any such combination. Only the parameters without a
DEFAULT_PARMS declaration -- B and D -- must be specified.
* If any of A, C, E, and F are omitted, then when P tries to access
them, it will get their default values instead. To the procedure,
the parameter will look exactly as if it was specified as the
default value.
* HOWEVER, the procedure can (if it wants to) determine if a
parameter was ACTUALLY passed by saying something like
IF HAVEOPTVARPARM(C) THEN
{ C was actually passed }
ELSE
{ we're using C's default value };
"HAVEOPTVARPARM(X)" simply returns TRUE if parameter X was
actually passed to the procedure, and FALSE if parameter X was
not passed and X's value was simply defaulted.
Thus, we get the best of both worlds:
* You can specify a default value, so if the procedure wants to, it
can just see the parameter value as the default.
* If you need to, you can still find out if the parameter was
REALLY specified.
* Since the compiler knows which parameters are optional and which
are required, it can make sure that the required ones are really
specified (unlike SPL, in which any parameters of an OPTION
VARIABLE procedure may be omitted without an error).
Now, interestingly enough, PASCAL/XL also has a different mechanism
to achieve a similar goal. You can also say
PROCEDURE P (A, B, C, D, E)
OPTION EXTENSIBLE 3;
What this means is that all parameters after the first 3 -- in this
case, D and E -- are optional. You can actually combine this with
DEFAULT_PARMS to set default values for these "extension" parameters,
or even set default values for the "non-extension" parameters, thus
making them optional, too.
Practically speaking, saying
PROCEDURE P (A, B, C, D, E: INTEGER)
OPTION DEFAULT_PARMS (D:=NIL, E:=NIL);
will achieve pretty much the same goal (making both D and E
extensible). The advantage of EXTENSIBLE parameters is that their
implementation allows you to add new parameters to an OPTION
EXTENSIBLE procedure WITHOUT HAVING TO RE-COMPILE ANY OF ITS CALLERS!
Thus, if HP had written the CREATE intrinsic in PASCAL/XL, it could
have said:
PROCEDURE CREATE (VAR PROGRAM: STRING;
VAR ENTRY: STRING;
VAR PIN: INTEGER;
PARM, FLAGS, STACK, DL, MAXDATA,
PRI, RANK: INTEGER)
OPTION EXTENSIBLE 3
DEFAULT_PARMS (ENTRY:="");
This would have made PROGRAM and PIN required and all the other
parameters optional -- ENTRY because of the DEFAULT_PARMS and the rest
because of the OPTION EXTENSIBLE. Then, if HP wanted to add STDIN,
STDLIST, and INFO parameters, it could have just changed the
definition of CREATE to:
PROCEDURE CREATE (VAR PROGRAM: STRING;
VAR ENTRY: STRING;
VAR PIN: INTEGER;
PARM, FLAGS, STACK, DL, MAXDATA,
PRI, RANK: INTEGER;
VAR STDIN, STDLIST, INFO: STRING)
OPTION EXTENSIBLE 3
DEFAULT_PARMS (ENTRY:="");
Then, ALL OF THE PROGRAMS THAT WERE COMPILED REFERRING TO THE OLD
CREATE WOULD STILL WORK! You can add new parameters to an OPTION
EXTENSIBLE procedure without causing any incompatibility with
previously compiled callers.
SUMMARY
Thus, to summarize the differences in the ways the various
compilers handle optional parameters and procedures with variable
numbers of parameters: ["STD PAS" includes Standard PASCAL, ISO Level
1 Standard, and PASCAL/3000]
STD PAS K&R STD
SPL PAS /XL C C
CAN YOU HAVE A PROCEDURE YES NO YES YES YES
WITH OPTIONAL PARAMETERS?
IS SUCH A PROCEDURE N/A N/A NO YES
DEFINITION PORTABLE?
CAN YOU OMIT PARAMETERS IN YES YES NO NO
THE MIDDLE OF A CALL?
CAN THE FIRST PARAMETER OF A YES YES YES NO
FUNCTION BE OPTIONAL?
CAN YOU DETECT IF A PARAMETER YES YES NO NO
WAS REALLY PASSED OR NOT?
CAN YOU SPECIFY SOME NO YES NO YES
PARAMETERS AS REQUIRED?
CAN YOU SPECIFY DEFAULT VALUES NO YES NO NO
FOR OPTIONAL PARAMETERS?
CAN YOU ADD NEW PARAMETERS NO YES NO NO
WITHOUT RE-COMPILING ALL CALLERS?
CAN YOU ACCESS PARAMETERS "BY YES NO YES YES
NUMBER" AS WELL AS "BY NAME"?
[This refers to the example we showed
were we wanted to reference, say, the
last 10 parameters as elements of an
array rather as P1, P2, P3, ..., P10]
PROCEDURE AND FUNCTION VARIABLES
Say that you write a B-Tree handling package. B-Trees, as you know,
is the kind of data structure that KSAM is built on; they allow you to
easily find records either by key or in sequential order. Thus, if you
store your data in a B-Tree, you can, for instance, find a record
whose key starts with "JON", even though you don't know the exact key
value.
Now, you're a sophisticated programmer, and you know how to deal
with this sort of thing. You define a record structure type called,
say, BTREE_HEADER_TYPE, that contains the various pointers that your
B-Tree handling procedures need, and then write the following
procedures:
PROCEDURE BTREE_CREATE (VAR B: BTREE_HEADER_TYPE);
...
PROCEDURE BTREE_ADD (VAR B: BTREE_HEADER_TYPE;
K: KEY_TYPE; REC: RECORD_TYPE);
...
FUNCTION BTREE_FIND (VAR B: BTREE_HEADER_TYPE; K: KEY_TYPE):
RECORD_TYPE;
...
You get the drift -- you have all these routines, to which you pass
the appropriate data, and between them, they process the data
structure. No problem.
Now, we said that the B-tree allows you to retrieve records in
"sorted order". Sorted how? If the key is a string, you'd want it
sorted alphabetically; however, what if the key is an integer? Or a
floating point number? Comparing two strings is a different operation
from comparing integers or floating point numbers.
Now, you can write a different set of routines for B-Trees with
string keys, B-Trees with integer keys, B-Trees with packed decimal
keys, etc. You can, but of course you wouldn't want to duplicate the
code. Assume for a moment that you can get around PASCAL's type
checking so that you can pass an arbitrary-type key to the BTREE_ADD
and BTREE_FIND routines; how do you make sure that the procedures do
the appropriate comparisons for the various types?
Well, one possibility is this:
* Have a field in the BTREE_HEADER_TYPE data structure called
"COMPARISON_TYPE".
* Have BTREE_CREATE take a parameter indicating the comparison type
(string, integer, float, packed, etc.) needed; it can then put
this type into the COMPARISON_TYPE field.
* Each BTREE_ADD and BTREE_FIND call will interrogate this
COMPARISON_TYPE field, and do the appropriate comparison; for
instance,
PROCEDURE BTREE_ADD (VAR B: BTREE_HEADER_TYPE;
K: KEY_TYPE; REC: RECORD_TYPE);
BEGIN
...
IF B.COMPARISON_TYPE=STRING_COMPARE THEN
COMP_RESULT:=STRCOMPARE (K, CURRENT_KEY)
ELSE IF B.COMPARISON_TYPE=INT_COMPARE THEN
COMP_RESULT:=INTCOMPARE (K, CURRENT_KEY)
ELSE IF B.COMPARISON_TYPE=FLOAT_COMPARE THEN
COMP_RESULT:=FLOATCOMPARE (K, CURRENT_KEY)
ELSE IF B.COMPARISON_TYPE=PACKED_COMPARE THEN
COMP_RESULT:=PACKEDCOMPARE (K, CURRENT_KEY);
...
END;
Depending on the COMPARISON_TYPE field value, BTREE_ADD can do the
right thing.
The trouble with this approach though, is quite obvious. What if we
(like KSAM) support more than just these four types? What if we need
to add zoned decimal support -- will we have to change BTREE_ADD (and
BTREE_FIND and whatever other procedures do this)? What if we need to
add an EBCDIC collating sequence? We want to allow the B-Tree
package's USER to define his own comparison types without having to
change the source code of the package itself.
In other words, we don't just want to let the user pass us a value,
like a record structure or an integer. We want to let a user pass a
PIECE OF CODE, in this case the code that would do the comparison.
Then, instead of having a big IF (or CASE) statement, our BTREE_ADD
and BTREE_FIND procedures can simply call the code that was passed to
them to do the comparison.
PASCAL, of course, has a facility for doing this (as do SPL and C).
PASCAL lets you declare a parameter to be of type PROCEDURE (or
FUNCTION), and then call that parameter. A good example of this might
be the following procedure:
FUNCTION NUMERICAL_INTEGRATION (FUNCTION F(PARM:REAL): REAL;
START, FINISH, INCREMENT: REAL):
REAL;
VAR X, TOTAL: REAL;
BEGIN
X:=START;
TOTAL:=0.0;
WHILE X<FINISH DO
BEGIN
TOTAL:=TOTAL+F(X)/((FINISH-START)/INCREMENT);
X:=X+INCREMENT;
END;
NUMERICAL_INTEGRATION:=TOTAL;
END;
(And you thought you'd never have to see this sort of thing again once
you finished college!) This procedure takes a function as a parameter
(a function that itself takes one parameter), and then calls that
function several times. The NUMERICAL_INTEGRATION procedure itself
might be called thus:
X:=NUMERICAL_INTEGRATION (SQRT, 0.0, 10.0, 0.01);
This, as you see, passes it the procedure "SQRT" as a parameter. The
same sort of thing, incidentally, can easily be done in SPL:
REAL PROCEDURE NUMERICAL'INTEGRATION (F, START, FINISH, INC);
VALUE START, FINISH, INC;
REAL PROCEDURE F;
REAL START, FINISH, INC;
...
or in C:
float numerical_integration (f, start, finish, inc)
real *f();
real start, finish, inc;
...
And, of course, this is the very sort of thing that we'd do to
implement our BTREE_ADD and BTREE_FIND:
PROCEDURE BTREE_ADD (VAR B: BTREE_HEADER_TYPE;
K: KEY_TYPE; REC: RECORD_TYPE;
FUNCTION COMP_ROUTINE (K1, K2: KEY_TYPE):
BOOLEAN);
...
FUNCTION BTREE_FIND (VAR B: BTREE_HEADER_TYPE; K: KEY_TYPE;
FUNCTION COMP_ROUTINE (K1, K2: KEY_TYPE):
BOOLEAN):
RECORD_TYPE;
...
These declarations, as you see, indicate that both of these procedures
expect a parameter that is itself a function (which takes two keys and
returns a boolean). A typical call might thus be:
BTREE_ADD (BHEADER, K1, R1, STRCOMPARE);
or
R:=BTREE_FIND (BHEADER, K, MY_OWN_EBCDIC_COMPARE_ROUTINE);
or whatever else the user wants to do.
Now, this paper purports to be a comparison between PASCAL, C, and
SPL, but so far we've only discussed a feature that exists -- and is
virtually identical -- in all three languages. What's the difference?
Well, note that we demanded that the user pass the comparison
routine in every call to one of the BTREE_ADD or BTREE_FIND
procedures. In turn, if a procedure called by BTREE_ADD or BTREE_FIND
needs to call the comparison routine, then BTREE_ADD or BTREE_FIND
must pass that procedure the comparison routine, too. This is
cumbersome and also error-prone -- what if the user passes one
procedure for BTREE_ADD and another for BTREE_FIND?
The logical solution seems to be to pass the procedure once to
BTREE_CREATE, i.e.
BTREE_CREATE (BHEADER, STRCOMPARE);
and then have the address of the procedure stored somewhere in the
BHEADER record structure. Then, when BTREE_ADD needed to do a
comparison, it would say something like:
COMP_RESULT:=BHEADER.COMPARE_ROUTINE (K, CURRENT_KEY);
This makes more sense. After all, even KSAM only requires you to
specify the key comparison sequence at file creation time, not on
every intrinsic call.
Examples of this sort of thing are plentiful:
* Trap routines (just like in MPE). MPE's XCONTRAP intrinsic, for
instance, expects you to pass it a procedure (actually, a
procedure's address); it saves this address in a special
location, and then when control-Y is hit, calls this procedure.
Similarly, let's say you're writing a package for packed decimal
arithmetic. You want to have a procedure called
PROCEDURE SET_PACKED_TRAP (PROCEDURE TRAP_ROUTINE);
which will set some global variable to the value passed as
TRAP_ROUTINE. Then, whenever your procedure detects some kind of
packed decimal arithmetic error, it'll call whatever routine was
set up as the trap routine. That way, the user will be able to do
what he pleases; he can set the trap routine to abort the
program, to print an error message, to return a default result --
whatever.
* Say that you have various procedures that do certain things --
build temporary files, set up locks, buffer I/O, etc. -- that
require special processing when the program is terminated. A
classic example of this is buffering your output to a file in
order to do fewer physical I/Os. If the program somehow dies, you
want to be able to flush all the buffered data to the file rather
than having it get lost.
What you'd like to do is have a procedure called ONEXIT, which
would take a single procedure parameter. Then, if your process
dies, the system would know enough to call this procedure,
letting it do whatever cleanup you want. For instance, you might
say
ONEXIT (FLUSH_BUFFERS);
to tell the system to call FLUSH_BUFFERS if the program
terminates for any reason; you might also want to say
ONEXIT (RELEASE_SIRS);
so that the system will release any SIRs (System Internal
Resources) that you may have acquired. In fact, you want ONEXIT
to keep what is essentially an array of procedures:
VAR ONEXIT_PROCS: ARRAY [1..100] OF PROCEDURE;
Then, the system terminate routine will say:
FOR I:=1 UNTIL NUM_ONEXIT_PROCS DO ONEXIT_PROCS[I](); { call
the Ith procedure }
* The file system, for instance, has to keep track of a number of
different file types -- standard files, message files, KSAM
files, remote files, and so on. Although they all look like files
to the user, the various routines that read them, write them,
close them, etc. are quite different. A possible design for, say,
the file control block might be:
RECORD
FILENAME: PACKED ARRAY [1..36] OF CHAR;
FILE_READ_ROUTINE: PROCEDURE (...);
FILE_WRITE_ROUTINE: PROCEDURE (...);
FILE_CLOSE_ROUTINE: PROCEDURE (...);
...
END;
Then, the FREAD intrinsic might simply say
FCB.FILE_READ_ROUTINE (FNUM, BUFFER, LENGTH, ...);
thus calling the file read routine pointed to by the file control
block (this might be one of FREAD_STANDARD, FREAD_MSG,
FREAD_KSAM, etc.) -- this field would have been set by the FOPEN
call.
These are just examples to convince you that it makes sense not
just to be able to pass procedures and functions as parameters, but
also have variables that "contain" procedures and functions, or rather
pointers to them.
This is where the three languages differ. Standard PASCAL and
PASCAL/3000 have no such feature. There simply is no way of either
* declaring a variable to be of type "pointer to a function or
procedure";
* setting a variable to point to a particular function/procedure;
* or calling a procedure/function pointed to by a variable;
None of the three above examples can be implemented in Standard or HP
PASCAL.
C, on the other hand, does support this feature. In C we might say:
typedef struct {...
int (*comp_proc)();
...
} btree_header_type;
thus declaring comp_proc to be a field pointing to a procedure that
returns an integer. Then, BTREE_OPEN might read like:
btree_open (b, proc)
btree_header_type b;
int (*proc)();
{
...
b.comp_proc = proc;
...
}
and BTREE_ADD might say
btree_add (b, k, rec)
btree_header_type b;
int k[], rec[];
{
...
comp_result = (*b.comp_proc) (k, current_key);
...
}
As you see, we simply use "(*b.comp_proc)" -- "the thing pointed to by
the COMP_PROC subfield of record B" in place of a procedure name; C
will then call this procedure, passing it the parameters K and
CURRENT_KEY (of course, doing no parameter checking).
Similarly, our ONEXIT routine (which, by the way, I think is a
singularly useful sort of procedure, and one that Draft Standard C has
defined in its Standard Library) might look like this:
int (*exit_routines[100])();
int num_exit_routines = 0;
onexit (proc)
int (*proc)();
{
exit_routines[num_exit_routines] = proc;
num_exit_routines = num_exit_routines + 1;
}
terminate ()
int i;
{
...
for (i=0; i<num_exit_routines; i++)
(*exit_routines[i]) (); /* call the Ith exit routine */
...
}
Clean and simple.
SPL's solution to this problem is somewhat dirtier. SPL can do it,
because SPL -- with its TOSes and ASSEMBLEs -- can do anything; but,
it can't do it very cleanly.
In SPL, what you'd do is save the procedure's address (actually,
its plabel, but for our purposes that's the same thing) in an integer
variable. Then, you'd use an ASSEMBLE statement to call the procedure.
INTEGER ARRAY EXIT'ROUTINES(0:99);
INTEGER NUM'EXIT'ROUTINES:=0;
PROCEDURE ONEXIT (PROC);
PROCEDURE PROC;
BEGIN
EXIT'ROUTINES(NUM'EXIT'ROUTINES):=@PROC;
NUM'EXIT'ROUTINE:=NUM'EXIT'ROUTINES+1;
END;
PROCEDURE TERMINATE;
BEGIN
INTEGER I;
...
FOR I:=0 UNTIL NUM'EXIT'ROUTINES-1 DO
BEGIN
TOS:=EXIT'ROUTINES(I);
ASSEMBLE (PCAL 0); << call the routine whose addr is on TOS >>
END;
...
END;
If you had to pass and/or receive parameters from this procedure, the
code would be even uglier. To pass, for instance, the integer arrays K
and CURRENT'KEY and to receive a result to be put into COMP'RESULT,
you'd have to say:
TOS:=0; << room for the result >>
TOS:=@K;
TOS:=@CURRENT'KEY;
TOS:=COMP'ROUTINE'PLABEL; << the plabel of the routine to call >>
ASSEMBLE (PCAL 0);
COMP'RESULT:=TOS; << get the return value >>
Ugly, but possible -- more than can be said for Standard PASCAL or
PASCAL/3000.
PASCAL/XL, on the other hand, has a solution rather comparable to
that of C's -- better, if you generally prefer PASCAL to C. In
PASCAL/XL, you could define a "procedural" or "functional" data type,
to wit:
TYPE EXIT_PROC = PROCEDURE; << no parms, no result >>
COMPARE_PROC = FUNCTION (K1, K2: KEY_TYPE): BOOLEAN;
The declaration is much like what you'd put into a procedure header if
you want a parameter to be a procedure or function; however, this kind
of type allows your variables to be procedure/function pointers, too.
Thus, you'd write ONEXIT as:
VAR EXIT_ROUTINES: ARRAY [1..100] OF EXIT_PROC;
NUM_EXIT_ROUTINES: 0..100;
PROCEDURE ONEXIT (P: EXIT_PROC);
BEGIN
EXIT_ROUTINES[NUM_EXIT_ROUTINES]:=P;
NUM_EXIT_ROUTINES:=NUM_EXIT_ROUTINES+1;
END;
PROCEDURE TERMINATE;
VAR I: 0..100;
BEGIN
...
FOR I:=1 UNTIL NUM_EXIT_ROUTINES DO
CALL (EXIT_ROUTINES[I]); { CALL is a special construct }
...
END;
Similarly, our BTREE_ADD procedure would be:
PROCEDURE BTREE_ADD (VAR B: BTREE_HEADER_TYPE;
K: KEY_TYPE; REC: RECORD_TYPE);
...
BEGIN
...
COMP_RESULT:=FCALL (B.COMP_ROUTINE, K, CURRENT_KEY);
...
END;
As you see, "CALL (proc, parm1, parm2, ...)" calls the procedure
pointed to by the variable "proc", passing to it the given parameters.
Similarly, "FCALL (proc, parm1, parm2, ...)" calls a function.
To summarize, then:
* Procedure and function variables -- though apparently rather
obscure -- can actually be very useful.
* Standard PASCAL and PASCAL/3000 supports procedures and functions
as parameters, but not as variables; this is rather inadequate.
* SPL supports procedure/function variables, but in a rather
"dirty" fashion, which is clumsy and uses many TOSes and
ASSEMBLEs.
* C and PASCAL/XL have very clean support for this nifty feature.
C #DEFINEs
One thing in which C may be quite a bit superior to PASCAL is the
#define construct. It can have serious performance advantages, and
also avoid duplication of code in cases where ordinary procedures just
don't do the job.
The #define is a simple macro facility. References to it get
expanded into C code that is compiled in place of its invocation. In
other words, saying
#define square(x) ((x)*(x))
...
printf ("%d %d\n", a, square(a));
is identical to
printf ("%d %d\n", a, ((a)*(a)));
Other useful defines may include
#define min(x,y) (((x)<(y)) ? (x) : (y))
#define max(x,y) (((x)>(y)) ? (x) : (y))
#define push(val,stackptr) *(stackptr++) = (val)
and so on. Note that this is in the same spirit as SPL DEFINEs -- in
fact, if you don't have any parameters, C #define's and SPL DEFINEs
are one and the same -- but allows parameterization, which increases
the power immeasurably.
One question that instantly comes to mind is: how are #define's
better than FUNCTIONs?
* First of all, on any computer, there is an overhead in PROCEDURE
and FUNCTION calls. For instance, an HP PASCAL program running on
a Mighty Mouse took about 140 milliseconds to execute 10,000
calls to a parameter-less PROCEDURE, and about 250 milliseconds
to do 10,000 calls to a PROCEDURE with 3 parameters.
Of course, this isn't a very large amount of time, and certainly
isn't bad enough to convince me to stop writing procedures and
repeat portions of code several times in my program. Still, it is
enough to give one pause in cases where performance is critical;
procedure calls are frequent enough that the total overhead piles
up.
#DEFINEs allow us to avoid code repetition without any of the
overhead of procedure calls. For small, very frequently used
procedures, they can be a very good solution.
* A #DEFINE can replace anything, including declarations, control
structures, etc. For instance, say that you think the C "for"
loop is too complicated, and you'd like to be able to do a
PASCAL-like "FOR". You could say
#define loop(var,start,limit) \
for ((var) = (start); (var) <= (limit); (var)=(var)+1)
Then,
loop (x, 1, 100)
printf ("%d %d %d\n", x, x*x, x*x*x);
would mean the same thing as
for (x = 1; x <= 100; x=x+1)
printf ("%d %d %d\n", x, x*x, x*x*x);
This use of #DEFINE, however, is more than just a sop to people
who are dissatisfied with C terminology and want to make it
look like PASCAL. The fact that #DEFINEs directly expand into
source code rather than just calling functions can be used to:
- Have operations that work with arbitrary datatypes. For
instance, our "min" define will work equally well for
"int"s, for "float"s, for "long"s, etc., since it expands
into a "<" comparison, which works for all those types.
- Define objects that can be stored into as well as fetched
from. For instance, if for some reason you keep all your
arrays in 1-dimensional format, you can say
#define element(array,rowsize,row,col) \
array[rowsize*row+col]
...
x = element (a, numcols, rnum, cnum);
...
element (a, numcols, rnum, cnum) = x;
Since you can't assign anything to a function call, you
couldn't do this if ELEMENT were a function; but, since it's
a macro, this ends up being a simple assignment to an array
element.
- Have #define's that define procedures. Consider the
following mysterious creature:
#define defvectorop(funcname,type,op) \
funcname(vect1,vect2,rvect,len) \
type vect1[], vect2[], rvect[]; \
int len; \
{ \
int i; \
for (i = 0; i<len; i++) \
rvect[i] = vect1[i] op vect2[i]; \
}
This #define allows you to easily define procedures of a
certain format -- to wit, those that operate element-wise on
two arrays (of a given type) to generate a third array. For
instance,
defvectorop(intmult,int,*)
defvectorop(intadd,int,+)
defvectorop(floatadd,float,+)
will define three functions that, respectively, multiply
vectors of integer, add vectors of integers, and add vectors
of floats. A call to
intmult(x1,x2,y,10);
will set elements y[0] through y[9] to x1[0]*x2[0] through
x1[9]*x2[9].
- Finally, using some even weirder constructs, you can deal
with "families" of variables which are identified by their
similar names. For instance, say your convention is that if
your "queue" data structure is stored in an array called
"x", then the header pointer is stored in a variable called
"x_head" and the tail pointer is stored in a variable called
"x_tail". A typical macro might look like
#define queueprint (queuename) \
printf ("Head = %d, Tail = %d\n", \
queuename ## _head, queuename ## _tail); \
print_array_data (queuename);
Note that with the special "##" macro operator, we can have
queueprint (myqueue)
expand into
printf ("Head = %d, Tail = %d\n",
myqueue_head, myqueue_tail);
print_array_date (queuename);
thus deriving the head and tail variable names from the
queue variable name -- clearly something we can't do with a
procedure.
To summarize, the primary advantages of #define's are performance
and the additional flexibility that comes with direct text
substitution.
PASCAL/XL INLINE PROCEDURES
SPL, of course, has DEFINEs, but they don't support parameters and
thus are severely limited. Standard PASCAL and HP PASCAL have nothing
like DEFINEs or #define's, either for performance's or functionality's
sake. PASCAL/XL, however, has a rather interesting feature called
"INLINE".
In HP PASCAL, you can write a procedure like this:
FUNCTION MIN (X, Y: INTEGER): INTEGER
OPTION INLINE;
BEGIN
IF X<Y THEN MIN:=X ELSE MIN:=Y;
END;
What does the "OPTION INLINE" keyword do? It commands the compiler:
whenever a MIN is seen, physically INCLUDE the code of the procedure
instead of simply compiling a procedure call instruction.
This is, of course, done for performance's sake -- to save the time
it would take to do that procedure call. This is thus somewhat
comparable to C's #define's. It isn't as flexible -- MIN is still a
procedure, with fixed parameter types, and so on -- but can be as fast
(or almost as fast).
Actually, I wouldn't be surprised if INLINE procedure calls were
still somewhat slower than #define's (although they don't have to be,
if the compiler is really smart). On the other hand, INLINE procedures
have some advantages:
* Since the compiler treats them like true procedures, it'll make
sure that
MIN(A,F(B))
won't evaluate F(B) twice (like our C #define would).
* Since these are real procedures, we're no longer restricted by
the rules about what can and can't go into an expression. For
instance, the procedure
FUNCTION FIND_NON_SPACE (X: STRING): INTEGER
OPTION INLINE;
VAR I: INTEGER;
BEGIN
I:=0;
WHILE I<STRLEN(X) AND X[I]=' ' DO I:=I+1;
FIND_NON_SPACE:=I;
END;
can't be written as a C #define, since C does not allow "while"
loops inside expressions. Similarly, we can declare local
variables and so forth.
In short, while PASCAL/XL INLINE procedures are in some respects
not quite as flexible as C #define's, they might be as efficient or
almost as efficient, and even more flexible in their own way. Whether
or not they really work depends on how good a job PASCAL/XL does of
optimization. If, for instance, it expands
A:=MIN(B,C)
into
TEMPPARM1:=B;
TEMPPARM2:=C;
IF TEMPPARM1<TEMPPARM2 THEN
RESULT:=TEMPPARM1
ELSE
RESULT:=TEMPPARM2;
A:=RESULT;
then this won't be a big savings. If, on the other hand, it's smart
enough to generate
IF B<C THEN
A:=B
ELSE
A:=C;
then, of course, it'll be every bit as efficient as the C #define.
Remember, though:
* INLINE procedures can only be used in the same kind of context in
which an ordinary procedure is used; i.e., you can't define a new
type of looping construct, declaration, etc.
* INLINE procedures are available only in PASCAL/XL, not in
Standard PASCAL or even PASCAL/3000.
* You rely (like you always do) on the compiler's intelligence in
generating efficient code. When you have MIN defined as a
#define, you KNOW that the compiler will generate EXACTLY the
same code for
min(x,y)
and
(x<y) ? x : y
This will probably be one test, two branches, and some stack
pushes. On the other hand, a call to an INLINE MIN procedure
might do exactly the same thing -- or, it might also build a
stack frame, allocate local variables, etc., taking almost as
much time as an ordinary non-INLINE call.
POINTERS: WHAT AND WHY
One major feature that C emphasizes more than PASCAL is support for
POINTERS. These creatures -- available in SPL as well as C -- are
often very powerful mechanisms, but they have been also accused of
making programs very difficult to read. I can't really objectively
comment on the readability aspect, but some discussions of pointers,
their advantages and disadvantages, is in order.
APPLICATION #1: DYNAMICALLY ALLOCATED STORAGE
If you declare some variable in SPL, C, or PASCAL, what you're
really declaring is a chunk of storage. If you declare a global
variable, the storage is allocated when you run the program and
deallocated when the program is done; if you declare a local variable,
the storage is allocated when you enter a procedure and deallocated
when the procedure is exited.
What if you want to declare storage that is allocated and
deallocated in some other way?
For instance, MPEX (or, for that matter, MPE) needs to read all
your UDC files and keep a table indicating which UDC commands are
defined in which file and at which record number. This table can be
any size from 0 bytes (no UDCs) to thousands of bytes. How do we
allocate it?
The trouble is that we don't know how large the UDC dictionary will
be, so we can't really declare it either as a local or global variable
(in SPL, local arrays can be of variable size, but the UDC dictionary
has to be global anyway). What we need to be able to do is DYNAMICALLY
ALLOCATE IT in the READ_UDC_FILES procedure -- somehow tell the
computer "I need X (a variable) bytes of storage now, and I want to
view it as an object of type so-and-so (say, an array of records)".
Now, there are two issues involved here:
* WE NEED A MECHANISM FOR DYNAMICALLY ALLOCATING STORAGE.
* WE NEED A WAY OF REFERRING TO THIS STORAGE ONCE IT'S ALLOCATED.
The need for a dynamic allocation procedure (e.g. PASCAL's NEW, SPL's
DLSIZE, or C's CALLOC) is obvious; but, the need for a reference
mechanism is equally important! After all, we can't very well declare
our UDC dictionary as
VAR UDC_DICT: ARRAY [1..x] OF RECORD ...;
Our very point is that we don't know the size of the array, and we
DON'T WANT THE COMPILER TO ALLOCATE IT FOR US, which is what the
compiler has to do if it sees an array declaration.
What we have to do is to declare UDC_DICT as a POINTER. A pointer
is an object that can be accessed in one of two modes:
* In one mode, it looks EXACTLY like an object of a given type
(say, an array, a record, a string, etc.). It can be assigned to,
it can have its fields extracted, etc. If UDC_DICT is a pointer
to an ARRAY of RECORDs, we could say
UDC_DICT^[UDCNUM].NAME:=UDC_NAME_STRING;
and assign something to the NAME subfield of the UDCNUMth element
of this ARRAY of RECORDs.
* In another mode, it is essentially an ADDRESS, which can be
changed to make the pointer point to (theoretically) an arbitrary
location in memory. When we say
NEW (UDC_DICT);
we don't really pass an ARRAY of RECORDs to NEW (remember, no
such array has been allocated yet); rather, we pass a variable
that will be set by NEW to a MEMORY ADDRESS, the address of a
newly-allocated array of records that can later be accessed using
"UDC_DICT^".
This two-fold nature is the key aspect of pointers -- they can be
viewed as ordinary pieces of data, OR they can be viewed as the
addresses of data, and thus changed to "point" to arbitrary locations
in memory.
The reason why I gave this definition in the context of a
discussion on "dynamic memory allocation" is that with dynamic memory
allocation, pointers are NECESSARY. If all your data is kept in global
and local variables, you might never need to use pointers, since all
the data can be accessed by directly referring to the variable name.
On the other hand, if you use things like NEW or CALLOC, you must
refer to the dynamically allocated data using pointers.
Let's take another example: We're building a Multi-Processing
Executive system. This'll be an eXtra Large variety, by the way, so we
might call it MPE/XL for short. This system will have lots of
PROCESSES; each process has to have a lot of data kept about it.
It makes a lot of sense for us to declare a special type of record:
TYPE PROCESS_INFO_REC =
RECORD
PROGRAM_NAME: PACKED ARRAY [1..80] OF CHAR;
CURRENT_PRIORITY: INTEGER;
TOTAL_MEMORY_USED: INTEGER;
FATHER_PROCESS: ???;
SON_PROCESSES: ARRAY [1..100] OF ???;
END;
{ Now, declare a pointer to this type }
PROCESS_PTR = ^PROCESS_INFO_REC;
[Let's not talk about whether or not this is a good design.] Now, we
can have a procedure to create a new process:
FUNCTION CREATE_PROCESS (PROGNAME: PROGNAME_TYPE): PROCESS_PTR;
VAR NEW_PROC: PROCESS_PTR;
BEGIN
NEW (NEW_PROC);
NEW_PROC^.PROGRAM_NAME:=PROGNAME;
NEW_PROC^.TOTAL_MEMORY_USED:=1234;
...
CREATE_PROCESS:=NEW_PROC;
END;
Note what this procedure returns -- it returns a POINTER to the newly
allocated PROCESS INFORMATION RECORD (remember, NEW_PROC is the
pointer, NEW_PROC^ is the record). Why does it return a pointer
instead of the record itself?
Remember that each process information record has to indicate who
the process's father pointer is and who the process's sons are. To do
this, we have to have some kind of "unique process identifier" --
well, what better identifier than the POINTER TO THE PROCESS
INFORMATION RECORD?
Thus, our record really looks like this:
TYPE PROCESS_INFO_REC =
RECORD
PROGRAM_NAME: PACKED ARRAY [1..80] OF CHAR;
CURRENT_PRIORITY: INTEGER;
TOTAL_MEMORY_USED: INTEGER;
FATHER_PROCESS: ^PROCESS_INFO_REC;
SON_PROCESSES: ARRAY [1..100] OF ^PROCESS_INFO_REC;
END;
When we create a new process, we can just say:
NEW_PROC_INFO_PTR^.FATHER_PROCESS:=CURR_PROC_INFO_PTR;
All of our dynamically allocated process information records are thus
DIRECTLY LINKED to each other using pointers. To find out a process's
grandfather, for instance, we can just say
FUNCTION GRANDFATHER (PROC_INFO_PTR: PROCESS_PTR): PROCESS_PTR;
BEGIN
GRANDFATHER:=PROC_INFO_PTR^.FATHER_PROCESS^.FATHER_PROCESS;
END;
Now of course, pointers aren't the only way to "point" to data. If,
for instance, all our Process Information Records were not allocated
dynamically, but rather taken out of some global array:
VAR PROCESS_INFO_REC_POOL: ARRAY [1..256] OF PROCESS_INFO_REC;
then we could just use indices into this pool as unique process
identifiers (in fact, that's what PINs in MPE/V are -- indices into
the PCB, an array of records that's kept in a system data segment).
But for true dynamically allocated data (allocated using PASCAL's NEW,
C's CALLOC, or SPL's DLSIZE), pointers are the way to go.
POINTERS BEYOND DYNAMIC ALLOCATION
Another reason why I first introduced pointers in the context of
dynamic allocation and NEW is that in PASCAL, that's all you can
really use pointers for.
In other words, NEW makes a pointer point to a newly-allocated
chunk of data; but there's NO WAY TO MAKE A POINTER POINT TO AN
EXISTING GLOBAL OR LOCAL VARIABLE (OR ARRAY ELEMENT). PASCAL's theory
was that any global or local variables can and should be accessed
without pointers (since presumably we know where they are at
compile-time).
Following with the UDC dictionary example I talked about earlier,
let me explain a bit about the workings of the UDC parser and executor
that I have in MPEX and SECURITY.
Both MPEX/3000 and SECURITY/3000's VEMENU need to be able to
execute UDCs. Unfortunately, MPE's COMMAND intrinsic can't execute
UDCs on my behalf (it can't even execute PREP or RUN!). Thus, I had to
do my own UDC handling.
* The first step in handling UDCs is finding out what UDC files the
user has set up. To do this I look in the directory and
COMMAND.PUB.SYS, which indicate all of the user's UDC files, and
then I FOPEN each one of these files.
* Next, I have to find out what UDCs these files contain and where
they contain them. I read the files and generate a record for
each UDC I find; this record contains the UDC's name, the file
number of the file where I found it, and the record number at
which I found it. (This is where the dynamic allocation I
discussed earlier fits in -- I'd like to be able to allocate the
UDC dictionary dynamically rather than just keep it around as a
fixed-size global variable.)
* Finally, when the time actually comes to execute a UDC,
- I parse the UDC invocation (e.g. "COBOL85 AP010S,AP010U");
- After finding the UDC name (COBOL85), I look it up in my UDC
dictionary to find out where and in which UDC file it's
defined;
- I read the header of the UDC from the UDC file -- it looks
something like "COBOL85 SRC,OBJ=$NEWPASS,LIST=$STDLIST";
- I determine the values of all the UDC parameters from the
header and the invocation -- SRC is AP010S, OBJ is AP010U, and
LIST is $STDLIST (the default);
- I then read the UDC, substituting all the UDC parameters in
each line and then executing it.
Not a trivial process, but a necessary one. The reason I bring it up
is that one aspect of the processing -- determining the UDC parameter
values and then substituting them into the UDC commands -- is quite
well-tailored to the use of POINTERS.
In order to determine the values of all the UDC parameters, I have
to parse the UDC invocation ("COBOL85 AP010S,AP010U") and the UDC
header ("COBOL85 SRC,OBJ=$NEWPASS,LIST=$STDLIST"). From the UDC
invocation I determine the values of the specified parameters (AP010S
and AP010U); from the UDC header I get the parameter names (SRC, OBJ,
and LIST), and the default values ($NEWPASS and $STDLIST). (For the
purposes of this discussion, I'm ignoring keyworded UDC invocation --
if you don't know what I mean by this, that's good; it's not relevant
here.)
Thus, my parsing has essentially generated three tables:
Parameter Names Values Given Default Values
SRC AP010S none
OBJ AP010U $NEWPASS
LIST none $STDLIST
Two of these tables -- the Values Given and Default Values -- I have
to merge into one table, the Actual Values table. If a value was
given, it becomes the Actual Value; if it wasn't, the default value is
used.
Let's look at the kind of data structure we'd use to do this sort
of thing:
TYPE STRING_TYPE = PACKED ARRAY [1..80] OF CHAR;
VAR PARM_NAMES: ARRAY [1..MAX_PARMS] OF STRING_TYPE;
VALUES_GIVEN: ARRAY [1..MAX_PARMS] OF STRING_TYPE;
DEFAULT_VALUES: ARRAY [1..MAX_PARMS] OF STRING_TYPE;
ACTUAL_VALUES: ARRAY [1..MAX_PARMS] OF STRING_TYPE;
As you see, we've declared four arrays of strings. Each one contains
up to MAX_PARMS (say, 16) strings, one for each UDC parameter.
Of course, if we do it this way, we'll be using 3*16*80 = 3840
bytes. Since actually each parameter could be up to 256 bytes long,
we'd actually need more like 12,000 bytes to fit all our data! What a
waste, especially, since all of these values were derived from two
strings -- the UDC invocation and UDC header -- each of which was at
most 256 bytes.
In other words,
* All the elements of VALUES_GIVEN are simply substrings of the
UDC_INVOCATION array;
* All the elements of PARM_NAMES and DEFAULT_VALUES are substrings
of the UDC_HEADER array.
Why actually copy these substrings out? It takes a lot of space and
more than a little time -- instead, let's just keep the INDICES and
LENGTHS of the substrings in their original arrays:
VAR PARM_NAME_INDICES: ARRAY [1..MAX_PARMS] OF INTEGER;
PARM_NAME_LENGTHS: ARRAY [1..MAX_PARMS] OF INTEGER;
GIVEN_VALUE_INDICES: ARRAY [1..MAX_PARMS] OF INTEGER;
GIVEN_VALUE_LENGTHS: ARRAY [1..MAX_PARMS] OF INTEGER;
DEFAULT_VALUE_INDICES: ARRAY [1..MAX_PARMS] OF INTEGER;
DEFAULT_VALUE_LENGTHS: ARRAY [1..MAX_PARMS] OF INTEGER;
Note that, so far, this is a classical PASCAL solution; if you want
to "point" to data that's in your program's variables (rather than
dynamically allocated using NEW), you just keep indices instead of the
actual data. Then, you can say
SUBSTR(UDC_HEADER,PARM_NAME_INDICES[PNUM],PARM_NAME_LENGTHS[PNUM])
and thus refer to the PNUMth PARM_NAME (assuming your PASCAL has a
SUBSTR function); similarly, you can use
SUBSTR(UDC_HEADER,DEFAULT_VALUE_INDICES[PNUM],
DEFAULT_VALUE_LENGTHS[PNUM])
and
SUBSTR(UDC_INVOCATION,GIVEN_VALUE_INDICES[PNUM],
GIVEN_VALUE_LENGTHS[PNUM])
Remember, you KNOW where the substrings came from anyway, so with the
indices and the lengths you can always "reconstitute" them whenever
you like instead of having to keep them around in separate arrays.
However, think about the ACTUAL_VALUE table. This contains the
ACTUAL VALUES of the UDC parameters, which might have come either from
the DEFAULT VALUES on the UDC header or the GIVEN VALUES on the UDC
invocation. How can you represent the actual values without having to
copy each one out into a separate string?
You see, you can't just keep the index of the actual value around,
since in this case, you're not sure WHAT string this is an index into.
You'd have to have a special array of flags:
VAR ACTUAL_VALUE_FROM: ARRAY [1..MAX_PARMS] OF (HEADER,INVOCATION);
VAR ACTUAL_VALUE_INDICES: ARRAY [1..MAX_PARMS] OF INTEGER;
VAR ACTUAL_VALUE_LENGTHS: ARRAY [1..MAX_PARMS] OF INTEGER;
and then use it like this:
PROCEDURE PRINT_ACTUAL_VALUES (NUM_PARMS: INTEGER);
VAR PNUM: 1..MAX_PARMS;
BEGIN
FOR PNUM:=1 TO NUM_PARMS DO
BEGIN
WRITE ('PARAMETER NUMBER ', PNUM:3, ' IS: ');
IF ACTUAL_VALUE_FROM(PNUM)=HEADER THEN
WRITELN (SUBSTR(UDC_HEADER,ACTUAL_VALUE_INDICES[PNUM],
ACTUAL_VALUE_LENGTHS[PNUM]))
ELSE
WRITELN (SUBSTR(UDC_INVOCATION,ACTUAL_VALUE_INDICES[PNUM],
ACTUAL_VALUE_LENGTHS[PNUM]));
END;
END;
The point I'm trying to make here is that:
* THERE ARE CASES IN WHICH YOU WANT TO HAVE A VARIABLE "POINTING"
INTO ONE OF SEVERAL ARRAYS -- OR, IN GENERAL, ONE OF A NUMBER OF
POSSIBLE LOCATIONS. TO DO THIS IN PASCAL, YOU HAVE TO KEEP TRACK
OF *BOTH* WHICH LOCATION IT'S POINTING TO AND WHAT ITS INDEX INTO
THAT LOCATION IS. THEN, TO REFERENCE IT, YOU'LL NEED AN "IF" OR A
"CASE".
Imagine, though, that in PASCAL you were able to set a pointer to
the address of a global or procedure local variable (and, presumably,
have some way of using that pointer as a string). Then, you could have
VAR PARM_NAMES: ARRAY [1..MAX_PARMS] OF ^STRING;
PARM_NAME_LENGTHS: ARRAY [1..MAX_PARMS] OF INTEGER;
GIVEN_VALUES: ARRAY [1..MAX_PARMS] OF ^STRING;
GIVEN_VALUE_LENGTHS: ARRAY [1..MAX_PARMS] OF INTEGER;
DEFAULT_VALUES: ARRAY [1..MAX_PARMS] OF ^STRING;
DEFAULT_VALUE_LENGTHS: ARRAY [1..MAX_PARMS] OF INTEGER;
ACTUAL_VALUES: ARRAY [1..MAX_PARMS] OF ^STRING;
ACTUAL_VALUE_LENGTHS: ARRAY [1..MAX_PARMS] OF INTEGER;
Not only will you be able to say
SUBSTR(PARM_NAMES[PNUM]^,1,PARM_NAME_LENGTHS[PNUM])
instead of
SUBSTR(UDC_HEADER,PARM_NAME_INDICES[PNUM],PARM_NAME_LENGTHS[PNUM])
thus being able to forget where the PARM_NAMES, GIVEN_VALUES, and
DEFAULT_VALUES arrays happened to be derived from, but you could also
have ACTUAL_VALUES point into EITHER the header or the invocation,
thus reducing our "PRINT ACTUAL VALUES" procedure to:
PROCEDURE PRINT_ACTUAL_VALUES (NUM_PARMS: INTEGER);
VAR PNUM: 1..MAX_PARMS;
BEGIN
FOR PNUM:=1 TO NUM_PARMS DO
BEGIN
WRITE ('PARAMETER NUMBER ', PNUM:3, ' IS: ');
WRITELN (SUBSTR(ACTUAL_VALUES[PNUM]^,1,
ACTUAL_VALUE_LENGTHS[PNUM]));
END;
END;
Unfortunately, in PASCAL you can't do this because there was no way
to fill the various arrays of pointers (ACTUAL_VALUES et al.) with
data -- there's no way of determining the pointer to, say,
UDC_INVOCATION[33] or UDC_HEADER[2].
What I've been trying to convince you is that this is a non-trivial
lack and there are cases in which it's desirable to be able to set a
pointer to point to any object in your data space, and then work from
that pointer rather than, say, indexing into an array.
OTHER POINTER APPLICATIONS
There are other cases in which many C and SPL users use pointers.
In these cases, a PASCAL user can quite as readily use an index into a
string or an array; it's hard to tell which solution is better.
For instance, consider these three procedures: [Note: I've
intentionally avoided using certain language features like PASCAL
sets, SPL three-way <=, some automatic type conversions, etc. to make
the examples as similar as possible]
TYPE PAC256 = PACKED ARRAY [0..255] OF CHAR;
PROCEDURE UPSHIFT_WORD (S: PAC256);
{ Upshifts all the letters in S until a non-alphabetic }
{ character is reached; expects there to be at least one }
{ special character somewhere in S to act as a terminator. }
VAR I: INTEGER;
BEGIN
I:=0;
WHILE 'A'<=S[I] AND S[I]<='Z' OR 'a'<=S[I] AND S[I]<='z' DO
BEGIN
IF 'a'<=S[I] AND S[I]<='z' THEN
S[I]:=CHR(ORD(S[I])-32); { upshift character }
I:=I+1;
END;
END;
PROCEDURE UPSHIFT'WORD (S);
BYTE ARRAY S;
<< Upshifts all the letters in S until a non-alphabetic >>
<< characters is reached; expects there to be at least one >>
<< special character somewhere in S to act as a terminator. >>
BEGIN
BYTE POINTER P;
@P:=@S;
WHILE "A"<=P AND P<="Z" OR "a"<=P AND P<="z" DO
BEGIN
IF "a"<=P AND P<="z" THEN
P:=BYTE(INTEGER(P)-32);
@P:=@P(1);
END;
END;
upshift_word (s)
char s[];
{
char *p;
p = &s[0];
while ('A'<=*p && *p<='Z' || 'a'<=*p && *p<='z')
{
if ('a'<=*p && *p<='z')
p = (char) (*p - 32);
p = &p[1];
}
}
The first example is PASCAL, using indices into an array of
characters; the next two are SPL and C, using character pointers.
Which is better?
* In PASCAL, since you're using indices into an array whose size is
known, the compiler can do run-time bounds checking to make sure
that the index is valid; also, every "S[I]" reference makes it
clear where you're getting the data from.
* In SPL and C, instead of "S[I]" you just say "P" (in SPL) or "*P"
(in C). This is, incidentally, probably faster than PASCAL, since
it would typically require just an indirect memory reference
instead of an indirect reference with indexing (unless you have a
very smart compiler).
* Some people think that "S[I]" is more readable; others don't like
to always repeat the index (especially if its something like
"MY_STRING[MY_INDEX]") and prefer "P" or "*P". This is where it
gets quite subjective; you've got to decide for yourself.
MORE ON DYNAMICALLY ALLOCATING MEMORY
As you recall, we started our discussion of pointers with an
example involving dynamic memory allocation. This is really good for
me, since I have things to say about dynamic memory allocation, and
I'd have a hard time sneaking them into any other chapter. Thus, with
this tenuous connection established, let's talk some more about
dynamic memory allocation.
I use "dynamic memory allocation" to mean allocating memory other
than what's automatically allocated for you in the form of GLOBAL or
PROCEDURE LOCAL variables. You typically use this mechanism when you
don't know at compile-time how much memory you'll need.
I've already given some examples of uses of dynamic memory
allocation:
* Allocating a "UDC command dictionary" that could be 0 bytes or
10,000 bytes.
* Allocating "process information records", which might themselves
be rather small, but of which there might be either none or a
thousand.
* Implementing commands like MPE's :SETJCW, with which you can
define any number of objects at the user's command.
Let's consider the latter example -- you're writing a command-
driven program, in which the user might use a "SETVARIABLE" command to
define a new variable and give it a value (say, an integer).
Naturally, you have some top-level prompt-and-input routine, which
then sends the user-supplied command to the parser (called, say,
PARSE'COMMAND): The parser identifies this as a SETVARIABLE command,
and calls this procedure:
PROCEDURE SETVARIABLE (VAR'NAME, VAR'LEN, INIT'VALUE);
VALUE VAR'LEN;
VALUE INIT'VALUE;
BYTE ARRAY VAR'NAME;
INTEGER VAR'LEN;
INTEGER INIT'VALUE;
BEGIN
...
END;
Now SETVARIABLE has all the data already at its disposal in its
parameters; however, it has to SAVE all this information somewhere, so
that it'll stay around long after the SETVARIABLE procedure and even
the PARSE'COMMAND procedure is exited. Presumably there is a
FINDVARIABLE procedure somewhere that, given the variable name will
extract the value that was put into it using SETVARIABLE.
Where should SETVARIABLE put the data -- the variable name and
initial value? Well, we could have a global array:
BYTE ARRAY VARIABLES(0:4095);
This gives us up to 4096 bytes of room for our "user-defined
variables", names, data, and all.
Clearly, though, this solution is both inefficient and inflexible.
What if the user doesn't define any variables? We've wasted 4K bytes.
What if the user defines too many variables? He won't be able to.
What we want to do, thus, is to have SETVARIABLE request from the
system a chunk of memory containing VAR'LEN+3 bytes -- VAR'LEN for the
name, 1 for the name length, and 2 for the variable value. Then, we
can keep the addresses of all these chunks somewhere (perhaps in a
linked list), and FINDVARIABLE can then just go through these chunks
to find the variable it's looking for.
* In SPL, the only easy way of dynamically allocating memory is by
calling DLSIZE. DLSIZE will get space from the "DL-DB" area;
unfortunately, it'll only get it in 128-word chunks (if you ask
for a 2-word chunk it'll give you a 128 words). Also, if we ever
see a DELETEVAR command, there's no easy way of "returning space"
to the system (always a difficult task).
* In PASCAL, we'd use the NEW procedure. You pass to NEW a pointer
to a given data type, and it will set that pointer to point to a
newly allocated variable of that data type. Thus, you'd say:
TYPE NAME_TYPE = PACKED ARRAY [1..80] OF CHAR;
VAR_REC = RECORD
CURRENT_VALUE: INTEGER;
NAME_LEN: INTEGER;
NAME: NAME_TYPE;
END;
VAR_REC_PTR = ^VAR_REC;
...
FUNCTION SETVARIABLE (VAR NAME: NAME_TYPE;
NAME_LEN: INTEGER;
INIT_VALUE: INTEGER): VAR_REC_PTR;
VAR RESULT: VAR_REC_PTR;
BEGIN
NEW (RESULT);
RESULT^.CURRENT_VALUE:=INIT_VALUE;
RESULT^.NAME_LEN:=NAME_LEN;
RESULT^.NAME:=NAME;
SETVARIABLE:=RESULT;
END;
Simple, eh?
* In C, you'd do almost the same thing. I say almost because the
only real difference is that C's equivalent of NEW, called
CALLOC, takes the number of elements to allocate (in our case 1,
since this is just a record and not an array of records) and the
size of each element.
typedef struct {int current_value;
int name_len;
char name[0];} var_rec;
typedef *var_rec var_rec_ptr;
...
var_rec_ptr setvariable (name, name_len, init_value)
char name[];
int name_len, init_value;
{
var_rec_ptr result;
/* Allocate an object, cast it to type "VAR_REC". */
result = (*var_rec) calloc (1, sizeof(var_rec) + name_len);
*result.curr_value = init_value;
strcpy (*result.name, name);
}
Compare PASCAL and C; ignore the small differences like the STRCPY
call (it just copies once string into another). The important
difference is in the CALLOC and NEW calls, and it's an important one
indeed!
* IN PASCAL, A CALL TO "NEW" ALLOCATES A NEW OBJECT OF A GIVEN
DATATYPE. THE SIZE OF THE OBJECT IS UNIQUELY DEFINED BY THE
DATATYPE. WHAT ABOUT STRINGS???
NEW is just great for allocating fixed-size objects, like the Process
Information Records we talked about earlier. But what about
variable-length things?
When we defined the VAR_REC data type, we defined NAME to be a
PACKED ARRAY [1..80] OF CHAR. This, however, isn't quite precise. What
this means to us is that NAME can be UP TO 80 characters long. But
when NEW allocates new objects of type VAR_REC, it will ALWAYS
allocate them with room for 80 characters in NAME! Never mind that we
know how long NAME should REALLY be -- we have no way of telling this
to NEW.
C's CALLOC, on the other hand, allows us to specify the number of
bytes we need to allocate. The disadvantage of this is that we need to
figure out this number; this, however, isn't hard -- we just say
sizeof (<datatype>)
e.g.
sizeof (var_rec)
[Note that VAR_REC was cunningly defined to have the NAME field be 0
characters long -- since C never does bounds checking anyway, this
won't cause any problems, but will make SIZEOF return the size of only
the fixed-length portion of VAR_REC.]
The great advantage of CALLOC is that for variable-length objects we
can EXACTLY indicate how much space is to be allocated. Since space
savings is one of the major reasons we do dynamic memory allocation,
this advantage of CALLOC -- or, more properly, disadvantage of NEW --
becomes very serious indeed. Not only does it waste space, but it also
impairs flexibility, since in trying to save space we restrict the
maximum variable name length to 80 bytes, when we should really make
it virtually unlimited.
Thus, to summarize, C and PASCAL both have relatively easy-to-use
dynamic memory allocation mechanisms (as well as deallocation
mechanisms, called DISPOSE in PASCAL and CFREE in C). They both work
very well for allocating fixed-length objects (or, parenthetically,
so-called "variant records" which are really variable-length in that
they can have one of several distinct formats). However, if you want
to allocate variable-length objects -- e.g. strings or records
containing strings -- PASCAL CAN'T DO IT WITHOUT WASTING INORDINATE
AMOUNTS OF MEMORY!
DYNAMIC MEMORY ALLOCATION -- PASCAL/3000 AND PASCAL/XL
Naturally, I'm not the first one to notice this kind of problem.
PASCAL/XL has a rather nice solution to it (I only wish the Standard
PASCAL authors thought of it):
P_GETHEAP (PTR, NUM_BYTES, ALIGNMENT, OK);
Using this built-in procedure, you can set PTR (which can be a pointer
of any type) to point to a newly-allocated chunk of memory NUM_BYTES
long. ALIGNMENT indicates how to physically align this chunk (on a
byte, half-word, word, double-word, or page boundary), and OK
indicates whether or not this request succeeded (another thing that
Standard PASCAL NEW doesn't give you). The counterpart to DISPOSE here
is
P_RTNHEAP (PTR, NUM_BYTES, ALIGNMENT, OK);
These procedures seem every bit as good as CALLOC and CFREE -- they
let you allocate EXACTLY as much space as you need.
PASCAL/3000 does not have the P_GETHEAP and P_RTNHEAP procedures;
however, hidden away in Appendix F of the PASCAL/3000 manual there is
a subsection called "PASCAL Support Library" (did you see this section
when you read the manual?) Here, with the strong implication that
these procedures are to be used from OTHER languages rather than
PASCAL, are documented two procedures:
GETHEAP (PTR, NUM_BYTES, OK);
and
RTNHEAP (PTR, NUM_BYTES, OK);
It appears that these two procedures do pretty much the same thing
as P_GETHEAP and P_RTNHEAP -- they allocate an arbitrary amount of
space, and return a pointer to it in the variable PTR, which can be of
any type.
Again, I'm not sure whether they were even INTENDED to be called
from PASCAL or only from other languages; however, it appears that
they ought to work from PASCAL, too.
PASCAL/XL AND POINTERS
As I mentioned earlier, Standard PASCAL allows pointers in one case
and one case alone -- pointers to dynamically allocated (NEWed) data.
There's no way to make a pointer to point to a global or
procedure-local variable. PASCAL/3000 (and PASCAL/3000) share this
lack; PASCAL/3000 lets you get the address of an arbitrary variable
(by calling WADDRESS, e.g. "WADDRESS(X)"), but it doesn't allow you to
do the inverse -- go from the address (which is an integer) back to
the value.
SPL and C, of course, allow you to do both. In SPL, you can say
INTEGER POINTER IP;
INTEGER I, J;
@IP:=@I; << set IP to the address of I >>
J:=IP+1; << get the value pointed to by IP >>
In C, you'd write
int *ip;
int i, j;
ip = &i; << Set IP to the address of I >>
j = *ip + 1;
Note that C provides you two operators -- "&" to get the address of a
variable, and "*" to get the value stored at a particular address.
PASCAL/XL, in essence, allows you to do much the same thing. Its
ADDR operator determines the address of an arbitrary variable and
returns it as a pointer. Thus, you can say:
VAR IP: ^INTEGER;
I, J: INTEGER;
IP:=ADDR(I);
J:=IP^+1;
A very simple addition, but it allows you to do virtually all of the
pointer manipulation described earlier in the section -- you can have
pointers that point to one of several local arrays, pointers that step
through a string, etc. To revive an example from before, PASCAL/XL
lets you write:
TYPE PAC256 = PACKED ARRAY [0..255] OF CHAR;
PROCEDURE UPSHIFT_WORD (S: PAC256);
{ Upshifts all the letters in S until a non-alphabetic }
{ character is reached; expects there to be at least one }
{ special character somewhere in S to act as a terminator. }
VAR P: ^CHAR;
BEGIN
P:=ADDR(S);
WHILE 'A'<=P^ AND P^<='Z' OR 'a'<=P^ AND P^<='z' DO
BEGIN
IF 'a'<=P^ AND P^<='z' THEN
P^:=CHR(ORD(P^)-32); { upshift character }
P:=ADDTOPOINTER(P,SIZEOF(CHAR));
END;
END;
Compare this with the corresponding C code:
upshift_word (s)
char s[];
{
char *p;
p = &s;
while ('A'<=*p && *p<='Z' || 'a'<=*p && *p<='z')
{
if ('a'<=*p && *p<='z')
p = (char) ((int)*p - 32);
p = &p[1];
}
}
As I mentioned before, one can legitimately say that you should be
indexing into the string (e.g. S[I]) rather than using a pointer -- in
fact, since you can't use pointers to local variables in Standard
PASCAL, you'd have to use indexing. On the other hand, many people
prefer using pointers and, as you see, PASCAL/XL allows you to do this
as easily as in C.
SPL AND ITS LOW-LEVEL ACCESS MECHANISMS
SPL, being a language designed explicitly for the HP3000 and for
nitty-gritty systems programming, has a lot of "low-level" access
mechanisms. These include:
* The ability to execute any arbitrary machine instruction (using
ASSEMBLE).
* The ability to push things onto and pop things off the stack
(using TOS).
* The ability to examine (PUSH), set (SET), and reference data
relative to various system registers (Q, DL, DB, S, X, etc.).
Standard C and PASCAL, naturally, do not have such mechanisms; neither
does PASCAL/XL. CCS, Inc.'s C/3000 does have "ASSEMBLE"- and
"TOS"-like constructs (although their ASSEMBLE is more difficult to
use than SPL's); however, it's by no means certain that C/XL will have
them.
Now, if we were simply counting features, things would be simple.
We'd credit SPL with 3 new statements (ASSEMBLE, PUSH, and SET) and 5
new addressing modes (TOS, X register, DB-relative, Q-relative, and
S-relative), and that'd be that. Score: SPL 37, PASCAL 22, C 31.
Of course, not every feature is worth as much as any other feature.
Many people complain that it's BAD for SPL to have these features;
that whatever performance advantages you can get aren't worth the
additional complexity and opacity; that, in general, PASCAL and C are
better off without them.
Now this may end up being a moot point, especially if C/XL ends up
not having ASSEMBLE and similar constructs. On the other hand, it
might be nice to consider any cases there may be where such constructs
are really necessary -- if only for old times' sake.
THE ARGUMENTS AGAINST ASSEMBLE, TOS, AND FRIENDS
Before we go further, let me outline the arguments -- most of them
perfectly valid -- that have been made against SPL's (and other
languages') low-level constructs:
* If you're using low-level constructs for performance's sake,
you're wasting your time. Most compilers these days are good
enough that they generate very efficient code, and you can't do
much better using assembly. On the other hand, assembly is much,
much harder to write, read, and maintain than high-level code.
It's just not worth it.
* If you're using low-level constructs for functionality, all that
means is that the system isn't providing you with enough
fundamental primitives that you could use in place of assembly.
For instance, the old trick of going through the stack markers, Q
register, etc. to get your ;PARM= and ;INFO= -- there should have
been an intrinsic to do that in the first place.
* If the language has low-level programming constructs, people will
use them out of thoughtlessness or a misguided sense of
efficiency, and thus produce awful, impossible to maintain, code.
Languages with ASSEMBLE, TOS, and the like, are like a loaded
gun, an open invitation for anybody to shoot himself in the foot
(or worse).
* Finally, the more sophisticated (and efficient) the compiler, the
more likely that it CAN'T let you do low-level stuff. How can you
use register-manipulation code if you don't know what registers
the compiler uses for itself (and it can use different ones in
different cases)? How can you get things off the stack if you
don't know whether the compiler is keeping them on the stack or
in registers? How can you trace back the stack markers if the
compiler may do special call instructions or place code inline at
its own discretion?
The first of these arguments, in my opinion, is on the whole very
sound. Very rarely do I find it desirable to use low-level constructs
for efficiency's sake. Compared to programmer and maintainer time,
computer time is cheap. On the other hand, when efficiency is really
very important -- and it's often the case that 5% of the code uses 95%
of the CPU time -- using low-level constructs for performance sake may
be quite necessary.
The fourth argument -- that a smart modern compiler can't assure
you about the state of the world and thus can't let you muck around
with it -- is very potent as well. In SPL, you can always say
TOS:=123;
<< now, execute an instruction that expects something on TOS >>
What if on Spectrum you have an instruction that expects a value in
register 10? You can't just say
R10:=123;
<< execute the instruction >>
What if the compiler stored one of your local variables in R10? How
does it know that it has to save it before your operation? Will saving
it damage the machine state (e.g. condition codes) enough that the
operation won't work anyway? The classic example of this in SPL is
condition codes -- saying
FNUMS'ARRAY(I):=FOPEN (...);
IF <> THEN
...
The very act of saving the result of FOPEN in the Ith element of
FNUMS'ARRAY resets the condition code, thus making the IF <> THEN do
something entirely unexpected! Now, an SPL user can know which
operations change the condition code, and which don't (maybe), and
thus avoid this sort of error -- but what about a PASCAL/XL user? Will
HP be obligated to tell all the users which registers and condition
codes each operation modifies?
The third of the arguments has some merit, too. In fact, all you
need to do is to look at a certain operating system provided by the
manufacturer of a certain large business minicomputer, to see how
dangerous TOSes and ASSEMBLEs are. I've seen pieces of code that are
utterly impossible to understand, where something is pushed onto the
stack only to be popped 60 lines and 10 GOTOs later -- ugh! There are
some language constructs that are just plain DANGEROUS.
It's the second argument -- that all the cases where you need to
use low-level code shouldn't have existed in the first place -- that I
don't quite buy. What SHOULD be and what IS are two different things.
I personally wish that every case where I needed to use low-level code
was already implemented for me by a nice, readable, easy-to-use HP
intrinsic. Unfortunately, that's not always the case, and I shudder to
think what would have happened if I DIDN'T have a way of doing all the
dirty assembler stuff myself.
Thus, the point here is: every system SHOULD provide you all you
want, and it SHOULD optimize your code well (perhaps even better than
you could do it yourself using assembler). Unfortunately, it often
DOESN'T, and you need some way of getting around it to do things
yourself.
EXAMPLES OF THINGS YOU NEED LOW-LEVEL OPERATIONS FOR
SETTING CONDITION CODES
For better or worse, HP decided that its intrinsics indicate some
part of their return result as the "condition code". This value,
actually 2 bits in the STATUS register, can be set to the so-called
"less than", "greater than", and "equal" values; to see what the
current condition code value is, you can say in SPL:
IF <> THEN << or <, >, <=, >=, or = >>
or, in FORTRAN:
IF (.CC.) 10, 20, 30 << go to 10 on <, 20 on =, 30 on > >>
(A similar mechanism exists in COBOL II.)
Now, have you ever wondered how HP's intrinsics actually SET this
condition code? You can't just say:
CONDITION'CODE:=<;
to set it to "less than". What can you do?
Now, one can say -- and quite correctly -- that it's not a good
thing for a procedure to return condition codes. Condition codes are
volatile things; they're changed by almost every machine instruction;
for instance, as I mentioned before,
FNUMS(I):=FOPEN (...);
IF <> THEN
...
won't do what you expect, since the instructions that index into the
FNUMS array reset the condition code. Thus, if you have the choice,
you ought to return data as, say, the procedure's return value, or a
by-reference parameter.
Still, sometimes it's necessary to return a condition code. Say,
for instance, that you have a program that's been written to use FREAD
calls, and you decide to change it to call your own procedure called,
say, MYREAD. MYREAD might, for instance, do MR NOBUF I/O to speed
things up, or whatever -- the important thing is that you want it to
be "plug compatible" with FREAD. You just want to be able to say
/CHANGE "FREAD","MYREAD",ALL
and not worry about changing anything else.
Well, in C or PASCAL, you'd be USC (that's Up Some Creek). In SPL,
though, you can do it. You have to know that the condition code is
kept in the STATUS register, a copy of which is in turn kept in your
procedure's STACK MARKER at location Q-1. When you do a return from
your procedure to the caller, the EXIT instruction sets the status
register to the value stored in the stack marker. Thus, you just need
to set the condition code bits in Q-1 (something you can't do in any
language besides SPL) before returning from the procedure:
INTEGER PROCEDURE MYREAD (FNUM, BUFFER, LEN);
VALUE FNUM, LEN;
INTEGER FNUM, LEN;
ARRAY BUFFER;
BEGIN
INTEGER STATUS'WORD = Q-1;
DEFINE CONDITION'CODE = STATUS'WORD.(6:2) #;
EQUATE CCG=0, CCL=1, CCE=2; << possible CONDITION'CODE values >>
...
CONDITION'CODE:=CCE;
END;
Relatively clean as you see, but not doable without "low-level" access
(in this case, Q-relative addressing).
Incidentally, don't think this is just a speculative example that
doesn't happen in real life. I usually avoid using condition codes in
my RL (most of my procedures return a logical value indicating success
or failure), but I have several just like this -- FREAD
plug-compatible replacements.
Also, I've sometimes had to write SL procedures that exactly mimic
intrinsics like READX, FREAD, FOPEN, etc. so that I can patch a
program file to call this procedure instead of the HP intrinsic.
(VESOFT's hook mechanism, which implements RUN, UDCs, super REDO, and
MPEX commands from within programs like EDITOR, QUERY, etc. works like
this.)
One can say that HP should have provided a SET'CCODE intrinsic in
the first place to do this; my only response is that it didn't, and I
have to somehow get my job done in spite of it.
SYSTEM TABLE ACCESS
System table access, another thing that I like to do in a
high-level fashion, with as few ASSEMBLEs, TOSes, EXCHANGEDBs, etc. as
possible, sometimes needs low-level access.
The classic example is accessing system data segments, e.g.
TOS:=@BUFFER;
TOS:=DSEG'NUMBER;
TOS:=OFFSET;
TOS:=COUNT;
ASSEMBLE (MFDS 4); << Move From Data Segment >>
Originally, in SPL, this was the ONLY way to access a system data
segment (like the PCB, JMAT, your JIT, etc.). I didn't have an OPTION
-- do it this way or some other, somewhat slower way; it was either
use TOS and ASSEMBLE (which I was scared to death of) or not do it at
all.
Now, of course, SPL has the MOVEX statement, with which I can say
MOVEX (BUFFER):=(DSEG'NUMBER,OFFSET),(COUNT);
to do exactly the same thing without any unsightly ASSEMBLEs.
Remember, though, that this construct is a recent addition to SPL;
when I started hacking on the HP3000 in 1979, it wasn't there, so I
had to do without it.
Another example of system tables access is access to the PXGLOB, a
table that lives in the DL- negative area of your stack. Most
languages can't access this area, and even SPL doesn't give you a
direct way of getting to it; but, with the PUSH statement, it can be
done:
INTEGER ARRAY PXGLOB(0:11);
INTEGER POINTER DL'PTR;
...
PUSH (DL);
@DL'PTR:=TOS;
MOVE PXGLOB:=DL'PTR(-DL'PTR(-1)),(12);
We use SPL's ability to set a pointer to point to any location in the
stack (in this case, the location pointed to by the DL register), and
then index from there. Again, there's no way of doing this without
using PUSH and TOS.
OTHER APPLICATIONS OF LOW-LEVEL CONSTRUCTS
Some other cases where ASSEMBLE, TOS, etc. are necessary to do
things:
* If you look in the "CONTROL STRUCTURES" section of this paper,
you'll see a PASCAL/XL construct called TRY .. RECOVER. For
reasons that I explain in that section, I think that it's a very
useful construct, and I've implemented it in SPL for the benefit
of my SPL programs.
Note that in any other language, I couldn't do this; only in SPL,
with its register access and especially the ability to do
Q-relative addressing to access stack markers, could I implement
this entirely new control structure.
* Whenever you write an SPL OPTION VARIABLE procedure, you need to
be able to access its "option variable bit mask", which indicates
which parameters were passed and which were omitted. This
information is stored at Q-4 (and also sometimes at Q-5); with
SPL's Q-relative addressing, you can access it. Again, maybe SPL
should have a built-in construct that lets you find out the
presence/absence of an OPTION VARIABLE parameter; however, it
does not.
* To determine your run-time ;PARM= or ;INFO= value, you need to
look at your "Qinitial"-relative locations -4, -5, and -6.
Qinitial refers to the initial value of the Q register; to get to
it, you have to go through all your stack markers, which requires
Q-relative addressing. HP's new GETINFO intrinsic does this for
you; it was released in 1987, whereas the HP3000 was first put
out in 1972.
* HP's LOADPROC intrinsic dynamically loads a procedure from the
system SL and returns you the procedure's plabel. How do you call
the loaded procedure? You have to push all the parameters onto
the stack and then do an ASSEMBLE, to wit:
TOS:=0; << room for the return value >>
TOS:=@BUFF; << parameter #1 >>
TOS:=I+7; << parameter #2 >>
TOS:=PLABEL; << the plabel returned by LOADPROC >>
ASSEMBLE (PCAL 0);
RESULT:=TOS; << collect the return value >>
* Say that, in the middle of executing a procedure, you need to
allocate X words of space. You can try allocating it in your DL-
area, but then you'll have a hard time deallocating it (unless
you want to write your own free space management package). If you
need this space only until the end of the procedure, you can
simply say:
INTEGER S0 = S-0; << S0 now refers to the top of stack >>
INTEGER POINTER NEWLY'ALLOCATED;
...
@NEWLY'ALLOCATED:=@S0+1;
TOS:=X; << the amount of space to allocate >>
ASSEMBLE (ADDS 0); << allocate the space on the stack >>
NEWLY'ALLOCATED now points to the X words of newly allocated
stack space. Exiting the procedure will deallocate the space, as
will saying
TOS:=@NEWLY'ALLOCATED-1;
SET (S);
* XCONTRAP sets up a procedure as a control-Y trap procedure; when
the user hits control-Y, the procedure is called. However, if the
control-Y is hit at certain times (say, in the middle of an
intrinsic call), there'll be some junk left on the stack that the
trap procedure will have to pop. The way the SPL manual suggests
you do this is by saying:
PROCEDURE TRAP'PROC;
BEGIN
INTEGER SDEC=Q+1; << indicates the amount of junk to pop >>
...
TOS:=%31400+SDEC; << build an EXIT instruction! >>
ASSEMBLE (XEQ 0); << execute the value in TOS! >>
END;
This is, of course, incredibly ugly code -- you build an
instruction on top of the stack and then execute it! -- and HP
should certainly have designed its control-Y trap mechanism some
other way. On the other hand, I can't do anything about it; I'm
stuck with it, and I have to have some way of dealing with it.
CONCLUSION: HOW BAD IS LOW-LEVEL?
As you see, for all the bad things that have been said about
ASSEMBLEs and TOSes, sometimes they are necessary to get things done.
Almost by definition, every time when they are necessary indicates
something wrong with the operating system. In every case shown above,
I SHOULDN'T have to stoop to ASSEMBLEs et al.; there SHOULD be HP
intrinsics to set the condition code, get the ;PARM= and ;INFO=, move
things to/from data segments, access DL-negative area, allocate things
on the stack, do TRY .. RECOVER, etc.
And, as you see, many of the problems I discussed above HAVE been
fixed -- in new version of MPE, of SPL, of PASCAL/XL. The root of the
problem, though, remains the same:
* HP WILL NEVER THINK OF EVERYTHING.
The users' needs will always outstrip HP's clairvoyance. The big
advantage of SPL was that it gave you the tools to satisfy almost any
need you had (at a substantial cost in blood, sweat, toil, and tears).
I only hope that on Spectrum, HP has some mechanism -- an ASSEMBLE
construct in PASCAL/XL or C/XL, or, perhaps, a separate assembler that
is accessible to the users -- with which Spectrum users can attack
problems that HP hasn't thought of.
THE STANDARD LIBRARIES IN DRAFT STANDARD C
One of C's features that its fans are justifiably proud of is the
tendency of many C compilers to provide lots of useful built-in
"library functions", which do things like I/O, string handling,
mathematical functions (exp, log, etc.), and more. In addition to
getting "the language" itself, C proponents say, you also get a lot of
nice functionality that you COULD have implemented yourself, but would
rather not have to.
Now, Standard PASCAL has some such functions (mostly in the field
of I/O and mathematics); PASCAL/3000 and PASCAL/XL add more (mostly
strings and more I/O); SPL, being fixed to the HP3000, relies on the
HP3000 System Intrinsics.
Kernighan & Ritchie C, to be honest, is actually INFERIOR to
Standard PASCAL insofar as built-in functions go -- although it
defines a standard set of I/O functions, it doesn't define any
standard mathematical functions, nor does it define standard string
handling functions (which Standard PASCAL doesn't, either).
However, many C compilers quickly evolved their own sets of
supported library functions, and the Draft Proposed C Standard
enumerates and standardizes them all. Remember, though, the
considerations involved in relying on the Draft Proposed C Standard:
* On the one hand, being new and not even finalized yet, most
existing compilers are likely to differ with it in quite a few
respects. In fact, it'll probably be years before most C
compilers fully conform to the new Standard.
* On the other hand, the Draft Standard is not created from
scratch. All or most of the functionality that it sets down has
already been implemented in one or more of the existing C
compilers. In particular, at least string handling packages and
mathematical functions are available in virtually all C compilers
(although not necessarily entirely standardized).
The question of these sorts of built-in support functions is not an
earth-shaking one; almost by definition of C, any function described
here can be implemented by any C programmer, and most of these
functions are probably ordinary C-written functions that just happened
to have been provided by the compiler writer.
However, I think that it is somewhat important to mention these
functions simply because although you CAN write them, you'd rather not
write anything you don't have to. Any time the standard is thoughtful
enough to provide date and time support (how many thousands of various
personal implementations of date handlers are there? and how many of
them actually work?) or built-in binary search or sort mechanisms,
that's something to be thankful for.
INTERESTING FUNCTIONS PROVIDED BY DRAFT STANDARD C
* RAND, a random number generator. Quite simple to implement yourself.
* ATEXIT, which allows you to specify one or more functions that are
to be called when the program terminates normally. These may release
various resources, flush buffers to disc, etc.
Actually, this is a very useful construct, one that I think should
be available in any language on any operating system. The major
problem here is that you'd like the ATEXIT functions to be called
whenever the program terminates, whether normally or abnormally.
* BSEARCH, which does a binary search of a sorted in-memory array. A
nice thing, especially since it often isn't provided by the
underlying operating system (for instance, there's no intrinsic to
do this on the HP3000). Note, however, that this is quite limited in
application, since you usually want to search files or databases,
not just simple arrays.
* QSORT, which sorts an in-memory array. Again, rather nice, but
limited because it only works on in-memory arrays and not on files
or databases.
* MEMCPY and MEMMOVE, which can copy arrays very fast (presumably
faster than a normal FOR loop). This is comparable with, but
different from, PASCAL/XL's MOVE_FAST, MOVE_L_TO_R, and MOVE_R_TO_L.
Note that this is somewhat different in spirit from the string
handling functions like STRCPY and STRNCPY -- this is intended for
arbitrary arrays, and doesn't care about, say, '\0' string
terminators.
* MEMCHR finds a character in an array; MEMSET sets all elements of an
array to a given character. Again, note the emphasis here on speed
(if the computer supports special fast search/fast set instructions
(like the HP3000 does), these functions ought to use them) and on
working with ARRAYS rather than STRINGS (neither function cares
about '\0' string terminators).
* Built-in DATE and TIME handling functions:
- Return the current date and time.
- Convert the internal date/time representation to a structure
containing year, month, day, hour, minute, and second; convert
backwards, too.
- Compute the difference between two days and times.
- Convert an internal time into a text string of an arbitrary
user-specified format. You can, for instance, say,
strftime (s, 80, "%A, %d %B %Y; %I:%M:%S %p", &time);
and the string S (whose maximum length was given as 80) would
contain a representation of "time" as, for instance:
Friday, 29 February 1968; 04:50:33 PM
The third parameter to STRFTIME is a format string; "%A" stands
for the full weekday name, "%d" for the day of the month, "%B" for
the full month name, etc. As you can see, this is a non-trivial
feature, one that many operating systems (e.g. MPE) don't provide,
and one that you'd rather not have to implement yourself.
* Character handling functions, such as "isalpha" (is a character
alphabetic or not?), "isdigit" (is it a digit"), "toupper" (convert
a character to uppercase), etc.
* If you care about these sorts of things, Draft Standard C provides
for "native language" support (called "localization" in the C
standard). This means that "isalpha", string comparisons, the
time-handling functions, etc. are defined to return whatever is
appropriate for the local language and character set, be it English,
Dutch, Czech, or Swahili (well, maybe not Swahili).
SUMMARY
If you have not yet guessed, I am by nature a loquacious man. For
every issue I've raised, I've spent pages providing examples, giving
arguments, discussing various points of view.
This was all intentional; rather than just presenting my own
opinions, I wanted to give as many of the facts as possible and let
you come to your own conclusions. However, this resulted in a paper
that was 200-odd pages long -- not, I would conjecture, the most
exciting and tittilating 200 pages that were ever written.
In this section I want to present a summary of what I think the
various merits and demerits of SPL, PASCAL, and C are. All of the
things I mention are discussed in more detail elsewhere in the paper,
so if you want clarification or evidence, you'll be able to find it. I
hope, though, that these lists themselves might put all the various
arguments and alternatives in better perspective.
Remember, however, as you read this -- if this all sounds
opinionated and subjective, all the evidence is elsewhere in the
paper, if you want to read it!
THE TROUBLE WITH SPL
[This section includes all those things that make SPL hard to work
in. This isn't just "features that exist in other languages but not in
SPL" -- these are what might be considered drawbacks (serious or not),
things that you're likely to run into and regret while programming in
SPL.]
* SPL IS COMPLETELY NON-PORTABLE. There is no HP-supplied Native
Mode SPL on Spectrum, and certainly not on any other machine.
(Note: Software Research Northwest intends to have a Native Mode
SPL compiler released by MAY 1987.)
* SPL's I/O FACILITY FRANKLY STINKS. Outputting and inputting
either strings or numbers is a very difficult proposition -- I
think this is the major reason why more HP3000 programmers
haven't learned SPL.
* SPL HAS NO RECORD STRUCTURES. This is a severe problem, but not
fatal -- there are workarounds, though none of them is very
clean. See the "DATA STRUCTURES" section for more details.
THE TROUBLE WITH STANDARD PASCAL
* STANDARD PASCAL's PROCEDURE PARAMETER TYPE CHECKING IS MURDEROUS:
- YOU CAN'T WRITE A GENERIC STRING PACKAGE OR A GENERIC
MATRIX-HANDLING PACKAGE because the same procedure can't take
parameters of varying sizes! That's right -- you either have to
have all your strings be 256-byte arrays (or some such fixed
size), or have a different procedure for each size of string!
Try writing a general matrix multiplication routine; it's even
more fun.
- YOU CAN'T WRITE A GENERIC ROUTINE THAT HANDLES DIFFERENT TYPES
OF RECORD STRUCTURES OR ARRAYS FOR ARGUMENTS. Say you want to
write a procedure that, say, does a DBPUT and aborts nicely if
you get an error; or does a DBGET; or does anything that might
cause it to want to take a parameter that's "AN ARRAY OR A
RECORD OF ANY TYPE". You can't do it! You must have a different
procedure for each type!
- YOU CAN'T WRITE A WRITELN-LIKE PROCEDURE THAT TAKES INTEGERS,
STRINGS, OR FLOATS (perhaps to format them all in some
interesting way).
* IN STANDARD PASCAL, YOUR PROGRAM AND ALL THE PROCEDURES IT CALLS
MUST BE IN THE SAME FILE! That's right -- if your program is
20,000 lines, it must all be in one file, and all of it must be
compiled together.
* STANDARD PASCAL HAS NO BUILT-IN STRING HANDLING FACILITIES, AND
NO MECHANISM FOR YOU TO IMPLEMENT THEM. Not only are simple
things like string comparison, copying, etc. (which are built
into SPL) missing; you can't write generic string handling
routines of your own unless all your strings have the same length
and occupy the same amount of space (see above)!
* STANDARD PASCAL's I/O FACILITY IS ABYSMAL.
- YOU CAN'T WRITE A STRING WITHOUT CAUSING THE OUTPUT TO DEVICE
TO GO TO A NEW LINE (i.e. you can't just "prompt the user for
input" and have the cursor remain on the same line).
- YOU CAN'T OPEN A FILE FOR APPEND OR INPUT/OUTPUT ACCESS.
- YOU CAN'T OPEN A FILE WITH A GIVEN NAME. So you want to prompt
the user for a filename and open that file? Tough cookies --
Standard PASCAL has no way of letting you do that.
- IF YOU PROMPT THE USER FOR NUMERIC INPUT, THERE'S NO WAY FOR
YOUR PROGRAM TO CHECK IF HE TYPED CORRECT DATA. Say you ask him
for a number and he types "FOO"; what happens? The program
aborts! It doesn't return an error condition to let you print
an error and recover gracefully -- it juts aborts!
- SIMILARLY, IF YOU TRY TO OPEN A FILE AND IT DOESN'T EXIST (or
some such file system error occurs on any file system
operation), YOU DON'T GET AN ERROR CODE BACK -- YOU GET
ABORTED! What a loss! -- YOU CAN'T DO "DIRECT ACCESS" --
READ OR WRITE A PARTICULAR RECORD GIVEN ITS RECORD NUMBER.
Think about it -- how can you write any kind of disc-based data
structure (like a KSAM- or IMAGE-like file) without some direct
access facility? Even FORTRAN IV has it!
* OTHER, LESS PAINFUL, BUT STILL SIGNIFICANT LIMITATIONS INCLUDE:
(These are things which you can certainly live without, unlike
some of the problems above, which can be extremely grave.
However, although you can live without them, they are still
desirable, and in Standard PASCAL -- partly because of its
restrictive type checking -- you CAN'T emulate them with any
degree of ease. Their lack, incidentally, is felt particularly in
writing large system programming applications.)
- STANDARD PASCAL DOESN'T ALLOW YOU TO DYNAMICALLY ALLOCATE A
STRING OF A GIVEN SIZE (where the size is not known at
compile-time). PASCAL talks much about its NEW and DISPOSE
functions, which dynamically allocate space at run-time.
These functions are certainly very useful, and are in fact
essential to many systems programming applications. However,
say you want to allocate an array of X elements, where X is not
known at compile-time -- YOU CAN'T! You can allocate an array
of, say, 1024 elements, forbidding X to be greater than 1024
and wasting space if X is less than 1024; you CAN'T simply say
"give me X bytes (or words) of memory".
- STANDARD PASCAL HAS NO REASONABLE MECHANISM FOR DIRECTLY
EXITING OUT OF SEVERAL LAYERS OF PROCEDURE CALLS. Say that your
lowest-level parsing routine detects a syntax error in the
user's input and wants to return control directly to the
command input loop (the larger and more complicated your
application, the more common it is that you want to do
something like this).
"Un-structured" as this may seem, it can be quite essential
(see the "CONTROL STRUCTURES -- LONG JUMPS" chapter), and
Standard PASCAL provides only very shabby facilities of doing
this.
- STANDARD PASCAL HAS NO WAY FOR HAVING VARIABLES THAT POINT TO
FUNCTIONS AND PROCEDURES. Strange as this may seem, variables
that point to procedures/ functions can be VERY useful -- see
the chapter on "PROCEDURE AND FUNCTION VARIABLES" for full
details. Interestingly, even Standard PASCAL recognizes their
utility by implementing PARAMETERS that point to procedures and
functions, but it doesn't go all the way and let arbitrary
VARIABLES do it.
If you respond that many PASCALs fix many of these drawbacks, I'll
agree -- BUT WHAT HAPPENS TO PORTABILITY? If you use PASCAL/3000's
string handling package (a pretty nice one, too), how are you going to
port your program to, say, a PC implementation that has a different
string handling package? What's more, some implementations -- like
PASCAL/3000 itself -- don't solve many of the most important problems
listed above!
THE TROUBLE WITH KERNIGHAN & RITCHIE C
* WHERE STANDARD PASCAL's PROCEDURE PARAMETER TYPE CHECKING IS TOO
RESTRICTIVE, K&R C's IS NON-EXISTENT! If you write a procedure
p (x1, x2, x3)
int x1;
char *x2;
int *x3;
and call it by saying
p (13.0, i+j, 77, "foo")
the compiler won't utter a peep. Not only won't it automatically
convert 13.0 to an integer -- it won't print an error message
about that, OR that "I+J" is not a character array, OR that 77 is
not an integer pointer (which probably means that P expects X3 to
be a by-reference parameter and you passed a by-value parameter),
OR EVEN THAT YOU PASSED THE WRONG NUMBER OF PARAMETERS!
* WHILE NOT AS BAD AS PASCAL's, K&R C's I/O FACILITIES STILL HAVE
SOME MAJOR LACKS. Most serious are:
- NO DIRECT ACCESS (read record #X).
- NO INPUT/OUTPUT ACCESS.
* K&R C, THOUGH FAIRLY STANDARD, HAS NO STANDARD STRING PACKAGE.
Unlike in Standard PASCAL, though, it's fairly easy to write,
since you CAN write a C procedure that takes, say, a string of
arbitrary length.
* C IS UGLY AS SIN. At least that's what some PASCAL programmers
say; C programmers obviously disagree. People complain that C is
just plain UGLY and thus (subjectively) difficult to read; they
talk about everything from the "{" and "}" that C uses instead of
"BEGIN" and "END" to C's somewhat arcane operators, like "+=" and
"--". I'm not saying that this is either TRUE or FALSE;
unfortunately, it's much too subjective to discuss in this paper.
However, don't be surprised if you decide that on all the merits,
C is superior but you can't stand writing with all these funny
special characters; or, that PASCAL is the best, but it's much
too verbose for you!
IS ISO LEVEL 1 STANDARD PASCAL ANY BETTER THAN THE ANSI STANDARD?
* THE ONLY DIFFERENCE BETWEEN ISO LEVEL 1 STANDARD PASCAL AND THE
ANSI STANDARD (what I call simply "Standard PASCAL") IS THAT IT
ALLOWS YOU TO WRITE PROCEDURES THAT TAKE ARRAYS OF VARIABLE
SIZES. This eliminates one of the worst problems in Standard
PASCAL -- that you can't write a generic string handling package,
or a matrix multiplication routine, etc.; however, all the other
problems (lack of separate compilation, bad I/O, etc.) still
remain.
* NOTE THAT IT'S NOT CLEAR HOW MANY NEW PASCAL COMPILERS WILL
FOLLOW THE ISO LEVEL 1 STANDARD. The ISO Standard document makes
it clear that implementing this feature is optional (without
them, a compiler will conform only to the "ISO Level 0
Standard"); PASCAL/3000 doesn't implement it, but PASCAL/XL does.
IS PASCAL/3000 ANY BETTER THAN ANSI STANDARD PASCAL?
* PASCAL/3000 supports:
- A PRETTY GOOD STRING PACKAGE.
- IMPROVED, though still somewhat difficult, I/O.
- THE ABILITY TO COMPILE A PROGRAM IN SEVERAL PIECES.
- THE ABILITY TO WRITE A PROCEDURE OR FUNCTION THAT TAKES A
STRING (BUT NOT ANY OTHER KIND OF ARRAY) OF VARIABLE SIZE.
* The remaining problems still include:
- PARAMETER TYPE CHECKING STILL WAY TOO TIGHT. You still can't
write a procedure that takes an integer array of arbitrary size
or an arbitrary array/record (say, to do DBGETs or DBPUTs
with); you still can't write, say, a matrix multiplication
routine (just as an example).
- I/O STILL HAS PROBLEMS:
* IT'S STILL VERY DIFFICULT (not impossible, but still very
painful) TO TRAP ERROR CONDITIONS, SUCH AS FILE SYSTEM ERRORS
OR INCORRECT NUMERIC INPUT.
* YOU CAN'T OPEN A FILE USING "FOPEN" AND THEN USE THE PASCAL
I/O SYSTEM WITH IT. This means that any time you need to
specify a feature that PASCAL's OPEN doesn't have (such as
"open temporary file", "build a new file with given
parameters", "open a file on the line printer", etc.), you
can't just call FOPEN and then use PASCAL's I/O facilities.
You either have to do all FOPEN/FWRITE/FCLOSEs, or you have
to issue a :FILE equation, which is difficult and still
doesn't give you all the features you want.
- THE "LESS IMPORTANT BUT STILL SUBSTANTIAL" LIMITATIONS STILL
EXIST -- it's hard to allocate variable-size strings, you can't
immediately exit several levels of nesting, and you can't have
variables that point to functions or procedures.
* REMEMBER -- YOU CAN'T COUNT ON "A PARTICULAR IMPLEMENTATION" TO
SAVE YOU HERE! If you could live with Standard PASCAL's
restrictions by knowing that, say, string handling or a good I/O
facility would surely be implemented by any particular
implementation, remember: PASCAL/3000 is a particular
implementation! If you run into a restriction with PASCAL/3000,
that's it; you either have to work around it or use a different
language.
IS PASCAL/XL ANY BETTER THAN THE ANSI STANDARD?
Surprisingly, yes. ALL OF THE MAJOR PROBLEMS I POINTED OUT IN
STANDARD PASCAL SEEM TO HAVE BEEN FIXED IN PASCAL/XL! The only words
of caution are:
* IT MAY BE GREAT, BUT IT'S NOT PORTABLE -- NOT EVEN TO
PRE-SPECTRUMS! HP still hasn't announced when (if ever) it'll
implement all of PASCAL/XL's great features on the pre-Spectrum
machines. As long as it doesn't, you'll have to either avoid
using of all PASCAL/XL's wonderful improvements, or be stuck with
code that won't run on pre-Spectrum 3000's!
* BE SKEPTICAL. "New implementations" always look great, precisely
because we haven't had the chance to really use them. For all we
know, the compiler may be ridden with bugs, or it might be
excruciatingly slow in compiling your program, or it might
generate awfully slow code! Even more likely, there may be
serious design flaws that make programming difficult -- it's just
that we won't notice them until we've programmed in it for
several months! As I said, BE SKEPTICAL.
IS DRAFT ANSI STANDARD C BETTER THAN KERNIGHAN & RITCHIE C?
Again, it seems it might be! It's standardized the I/O and string
handling facilities (and they're pretty good ones at that), AND it's
implemented some nice-looking parameter checking. Still, beware:
* BEING A "DRAFT STANDARD", IT MIGHT BE YEARS (OR DECADES) BEFORE
ALL OR MOST C COMPILERS HAVE ALL OF ITS FEATURES. Note, however,
that most modern C compilers already include some of the Draft
Standard's new features, except for the strengthened parameter
checking, which is still relatively rare.
* IF YOU THOUGHT KERNIGHAN & RITCHIE C WAS UGLY, YOU'LL STILL THINK
THIS ABOUT DRAFT STANDARD C. I don't want to imply that K&R C IS
ugly -- it's just that many old SPL, PASCAL, and ALGOL
programmers think so. It may not be objectively demonstrable, or
even objectively discussible; however, that's the reaction I've
seen in some (more than a few!) people. All I can say is this --
if you suffer from it, the Draft Standard still won't help you.
NICE FEATURES THAT SOME LANGUAGES DON'T HAVE AND OTHERS DO
The "PROBLEMS WITH" sections discussed things that could make
programming in SPL, PASCAL, or C a miserable experience. It emphasized
some things that were show-stoppers and others that simply frayed on
the nerves; one thing it conspicuously EXCLUDED were the good features
that you could live without, but would rather have. The following is a
summary of all these, plus some of the things we've already mentioned
above.
[Legend: "STD PAS" = Standard PASCAL or ISO Level 1 Standard;
"STD C" = Draft Proposed ANSI Standard;
"YES" = good implementation of this feature;
"YES+" = excellent or particularly nice implementation;
"YES-" = OK, so they've got it, but it's rather ugly;
"NO" = no;
"HNO" = Hell, no!;
"---" = Major loss! No support of REALLY IMPORTANT feature]
STD PAS/ PAS/ K&R STD
PAS 3000 XL C C SPL
RECORD STRUCTURES YES YES YES YES YES NO
STRINGS --- YES+ YES+ YES- YES+ YES
ENUMERATED DATA TYPES YES YES YES NO YES- NO
(see "DATA STRUCTURES")
SUBRANGE TYPES YES YES YES NO NO NO
(see "DATA STRUCTURES"; may not
be all that useful)
OPTIONAL PARAMETER/VARIABLE NUMBER NO NO YES+ YES- YES YES
OF PARAMETERS SUPPORT
(like SPL "OPTION VARIABLE")
NUMERIC FORMATTING/INPUT YES- YES- YES YES+ YES+ YES-
FILE I/O YES- YES YES+ YES- YES YES
(see "FILE I/O" chapter for more)
BIT ACCESS NO YES YES YES YES YES+
(see "OPERATORS")
POINTER SUPPORT NO NO YES YES YES YES
THE ABILITY TO WRITE PROCEDURE-LIKE NO NO YES YES YES NO
CONSTRUCTS THAT ARE COMPILED
"IN-LINE", FOR MAXIMUM EFFICIENCY
PLUS MAXIMUM MAINTAINABILITY
LOW-LEVEL ACCESS HNO HNO HNO NO NO YES
(ASSEMBLEs, TOS, registers --
often useless, sometimes vital!)
REALLY NICE FEATURES TO PAY ATTENTION TO
Just some interesting things, mostly implemented in only one of the
three languages. I just wanted to draw your attention to them, because
they can be quite nice:
* PASCAL/XL'S TRY/RECOVER CONSTRUCT. A really nifty contraption --
see the "CONTROL STRUCTURES" chapter for more info.
* C's "FOR" LOOP. You might think it's ugly, but it's quite a bit
more powerful -- in some very useful ways -- than SPL's or
PASCAL's looping constructs.
* C's "#define" MACRO FACILITY. I wish that PASCAL and SPL had it
too; it lets you do procedure-like things without the overhead of
a procedure call AND without the maintainability problems of
writing the code in-line. ALSO, it lets you add interesting new
constructs to the language (like defining your own looping
constructs, etc.).
* SPL's LOW-LEVEL SYSTEM ACCESS. Although you'd rather not have to
worry about registers, TOSs, ASSEMBLEs, etc., sometimes you need
to be able to manipulate them -- SPL lets you to do it.