WHAT TO DO WHEN YOUR PROGRAM ABORTS
by Eugene Volokh, VESOFT
1135 S. Beverly Dr.
Los Angeles, CA 90035 USA
(213) 282-0420
INTRODUCTION
Your program aborts.
ABORT :FOOBAR.TEST.PROD.%1.%7331
PROGRAM ERROR #24 :BOUNDS VIOLATION
What happened? Where did the abort take place? Your program is three
thousand lines long -- which one of them has the bug? You might put in
DISPLAY statements to figure out exactly where the abort happened, but
that could take a whole lot of DISPLAYs, and requires at least one
recompilation. (Sometimes, though not often, the very act of putting
in a debugging display might make the problem go away -- it will then
reappear as soon as you take the debugging display out!)
How do you figure out where your program is aborting WITHOUT
putting in debugging displays, WITHOUT recompiling, and WITHOUT having
to learn DEBUG, breakpoints, Q registers, indirect addressing, and all
that nonsense?
WHAT THIS PAPER WILL TELL YOU
Debugging a program usually takes up to an order of magnitude more
time than actually writing it. As any programmer can tell you, it is
often a harrowingly frustrating experience. You'd like to have lots of
tools (such as symbolic debuggers or program analyzers) that will help
you debug your programs, but unfortunately few such tools are
available on the HP 3000. In particular, standard DEBUG does NOT allow
you to look at and modify variables symbolically (by name), easily put
breakpoints at given procedures or statements, do symbolic stack
tracebacks, or anything like that.
This paper will NOT tell you how to use DEBUG to watch what your
program is doing or examine the state of your variables or anything
like that. Experience has shown that DEBUG is often too complicated
even for highly skilled programmers -- not because it somehow requires
a lot of smarts to understand, but just because it is so cumbersome
(to look at your variables, you have to know their machine addresses;
to set a breakpoint at a given statement, you have to know both the
starting address of a procedure and the code address of the statement
inside the procedure).
What this paper will explain is how to solve a specific problem:
* WHEN A PROGRAM ABORTS (with a bounds violation, stack overflow,
integer overflow, QUIT, library procedure error, or whatever),
HOW CAN ONE FIGURE OUT WHERE -- in what procedure, and perhaps at
what statement -- THE ABORT OCCURRED?
This turns out to be a not so very complicated problem after all, and
once you master a few key concepts, you'll be able to unerringly
pinpoint the locations of your aborts with very little difficulty.
THE FPMAP CAN BE YOUR FRIEND
Let's look at that abort message again:
ABORT :FOOBAR.TEST.PROD.%1.%7331
PROGRAM ERROR #24 :BOUNDS VIOLATION
It includes, of course, the name of the program and the type of error
that occurred (BOUNDS VIOLATION). It also includes two numbers -- %1
and %7331 -- which are the keys to solving our problem.
The two numbers are simply the segment (%1) and location (%7331) at
which the error happened. They are, however, quite unlikely to be very
informative to you. What you really want is the NAME of the procedure
in which this location (%1.%7331) resides (actually, what you'd really
like is the actual source code line number, but that's much harder to
get).
Once upon a time (about T MIT, if my memory serves me right), a
certain new keyword was added to the MPE :PREP command. This keyword
is called ;FPMAP, and what it really means is that the program file
created by the :PREP command will have inside it -- stored in a
specially-formatted table -- the NAMES and STARTING ADDRESSES of all
the procedures in the file. Therefore, if your procedure MUNGARRAY is
stored in location %7105 through %7472 of segment %1 in your FOOBAR
program, the program file will contain this information.
The purpose of this ;FPMAP parameter was, in fact, ease of
debugging -- given a segment number and segment-relative address of an
instruction in a program file, you could now get the name of the
procedure which contains this instruction (assuming the program file
was :PREPped with ;FPMAP). Of course, it would make sense for the MPE
code that prints the BOUNDS VIOLATION abort message to look at the
FPMAP and automatically decode the segment and location into something
sensible; but, unfortunately, this was not to be. However, if you want
to, you can do this decoding yourself.
Remember, the first thing we must do is :PREP the program with
;FPMAP:
:PREP FOOUSL,FOOBAR;FPMAP;RL=MYRL;MAXDATA=27000
Now, we run the program and get the abort message, which tells us that
the abort occurred at segment %1, location %7331.
If we look at our MPE System Intrinsics Manual (provided, of
course, that we have a recent enough edition), we find a procedure
called FINDPMAPADDR:
IV IV IV IA IV IR
FINDPMAPADDR (FNUM, SEG, LOC, RECORD, SIZE, STATUS);
FNUM = the file number of the program file (which must have
been :PREPped with ;FPMAP and FOPENed with MR NOBUF)
SEG = the segment number of the instruction to find
LOC = the instruction's segment-relative location
RECORD = the specially-formatted FPMAP record that FINDPMAPADDR
returns -- it describes the segment and procedure that
contain the location given by SEG and LOC
SIZE = the amount of room you've allocated in your stack for
RECORD; should be 36 words
STATUS = the value returned by FINDPMAPADDR that indicates how
things went; 0 if all OK, non-zero in case some kind of
error occurred
Thus, if you FOPEN your program file MR NOBUF and then call
FINDPMAPADDR, you'll be able to translate the segment and location
into the appropriate segment name and procedure name.
You can easily write a program that uses FINDPMAPADDR to decode a
segment and location; I already have -- MPEX has a command called
PROGINST:
%PROGINST FOOBAR, %1.%7331
(this command may, of course, be executed either from MPEX
directly or from EDITOR, QUERY, etc. using the MPEX hook)
Segment UTILSEG, procedure MUNGARRAY, offset from start of code %224
Now we know that the bounds violation occurred in the MUNGARRAY
procedure. Of course, we'd really like to know the location in more
detail (for instance, down to the line number in the procedure), but
this should give us a pretty good idea of what's happening.
TRACING BACK THE ENTIRE CALLING SEQUENCE
So far, we've taken the segment and location number that are
printed in the abort message and decoded them into a segment name and
a procedure name.
Let's say, though, that MUNGARRAY is a very commonly used utility
procedure. It might have been called from any one of a dozen other
procedures, which in turn might have been called from one of a number
of places. What we'd really like to do is to determine not just the
name of the procedure in which the abort occurred, but also the
procedure that called it, the procedure that called the procedure that
called it, and so on, all the way up to the main body of the program.
Now, although the abort message normally only displays the segment
and location at which the abort actually took place, there is a way to
make aborts to trace back the entire calling sequence of the program.
This is done using the little-known :SETDUMP command:
:SETDUMP
That's all there is to it -- no options, no parameters. (There are
some parameters that you can specify if you'd like; however, they
display information that can only be interpreted using the variable
map and DEBUG.) Now watch what happens when we run FOOBAR and get it
to abort:
:SETDUMP
:RUN FOOBAR.TEST.PROD
ABORT :FOOBAR.TEST.PROD.%1.%7331
PROGRAM ERROR #24 :BOUNDS VIOLATION
*** ABORT STACK ANALYSIS ***
S=005522 DL=176650 Z=012262
Q=005526 P=007331 LCST= 001 STAT=U,1,1,R,0,0,CCE X=176652
Q=005515 P=004672 LCST= 003 STAT=U,1,1,L,0,0,CCG X=176652
Q=005012 P=004366 LCST= 000 STAT=U,1,1,L,0,1,CCG X=176652
Q=004010 P=002640 LCST= 001 STAT=U,1,1,L,0,1,CCG X=176706
Q=003517 P=001670 LCST= 001 STAT=U,1,1,L,0,0,CCL X=176662
Q=003265 P=000307 LCST= 002 STAT=U,1,1,L,0,0,CCL X=000001
Q=003222 P=000004 LCST= 002 STAT=U,1,1,L,0,0,CCG X=000000
Q=003210 P=177777 LCST= S024 STAT=P,1,0,L,0,0,CCG X=000000
*DEBUG* 1.7332
?E@
PROGRAM TERMINATED IN AN ERROR STATE. (CIERR 976)
Immediately after the abort message, MPE prints an "ABORT STACK
ANALYSIS", which indicates (in its own rather cryptic fashion) the
state of all the "ancestors" of the currently-executing procedure.
Each of the line that starts with "Q=" represents one calling
procedure.
The important numbers in each line are the LCST value (which stands
for Logical Code Segment Table index), which is really a segment
number, and the P value, which is a segment-relative address (both
values are in octal). As you see, the first line is LCST=001 and
P=007731, which is of course our %1.%7331. The next line -- which is
the procedure that called the procedure that caused the abort -- is
LCST=003, P=004672 (%3.%4672); the line after that (the procedure that
called the %3.%4672 procedure) is %0.%4366, and so on. The bottom-most
line, you'll find, has LCST=S024, which means "System SL segment %24".
This is the address of the system procedure that first called your
program's outer block and thus began the execution of your program.
Now that we have the raw data, we can use MPEX's %PROGINST (which
in turn uses the FINDPMAPADDR intrinsic) to interpret it:
%PROGINST FOOBAR.TEST.PROD, %1.%7331
Segment UTILSEG, procedure MUNGARRAY, offset from start of code %224
%PROGINST FOOBAR.TEST.PROD, %3.%4672
Segment PROCESSDATA, procedure SEARCHFILE, offset from start of code %32
%PROGINST FOOBAR.TEST.PROD, %1.%2640
Segment UTILSEG, procedure NEXTPARM, offset from start of code %251
%PROGINST FOOBAR.TEST.PROD, %1.%1670
Segment UTILSEG, procedure PARSECOM, offset from start of code %1023
%PROGINST FOOBAR.TEST.PROD, %2.%307
Segment MAINSEG, procedure MAINLOOP, offset from start of code %102
%PROGINST FOOBAR.TEST.PROD, %2.%4
Segment MAINSEG, procedure OB', offset from start of code %4
As you see, :SETDUMP is a rather powerful tool for this sort of
thing. I always do a :SETDUMP in my OPTION LOGON UDC -- when
everything goes well, the :SETDUMP doesn't interfere, but when a
program aborts I automatically get the ABORT STACK ANALYSIS, which I
want. Watch out for one thing, though: after the procedure call
traceback is printed, you may be (if you have the right capabilities)
dropped into DEBUG. This is useful if you know DEBUG and want to use
it (perhaps to display the parameters of one of the called
procedures); if you don't want to do anything in DEBUG, just type
?E@
and the program will terminate as usual. Do NOT just type "?E", since
that will resume executing the program (despite the abort), and will
most probably result in another abort shortly afterwards.
FINDING OUT AT WHICH STATEMENT IN THE PROCEDURE AN ABORT OCCURRED
So far, we've managed to figure out the name of the procedure in
which the abort occurred and the names of all its ancestors (which
might actually be more useful for finding out exactly where the bug
is). However, a procedure can be many lines long -- how can we find
out exactly where in the procedure the abort happened?
The answer is: "with great difficulty". Finding out the procedure
names was simple since MPE keeps track of which procedure starts
where, and even makes it easy for us to get this information (using
the FINDPMAPADDR procedure). Unfortunately, MPE does NOT keep track of
which location each statement starts at.
If you're willing to keep all your source code listings (either on
paper or online), you can refer to them to find out exactly where each
statement in a procedure starts. SPL outputs this information as a
matter of course:
3 00000 1 INTEGER PROCEDURE MIN (I, J);
4 00000 1 VALUE I, J;
5 00000 1 INTEGER I, J;
6 00000 1 BEGIN
7 00000 2 IF I<J THEN
8 00003 2 MIN:=I
9 00003 2 ELSE
10 00006 2 MIN:=J;
11 00010 2 END;
The number in the second column is the starting address of this line's
code, relative to the start of the procedure. Thus, the code for line
7 (IF I<J THEN) starts at location 0 in the procedure; the code for
line 8 (MIN:=I) starts at location 3, the code for line 10 (MIN:=J)
starts at location 6, and the code for the END statement (procedure
return) starts at location 10 (all numbers are in octal).
If our abort message indicates that, say, a bounds violation
occurred at location 5 in procedure MIN (not very likely considering
the code involved), we'd know that the error is in the statement
"MIN:=I".
Similar outputs are provided by FORTRAN's $CONTROL LOCATION control
card, COBOL's $CONTROL VERBS, and PASCAL's $CODE_OFFSETS ON$.
Well, this is all well and good -- IF you keep all your source code
listings! Remember, you need to either print out the entire listing
every time you recompile the program (a listing that's even a little
bit out of date is likely to have incompatible statement starting
addresses) or keep the entire listing online for every program.
You may consider one of the above alternatives worthwhile (keeping
the entire listing online seems preferable); if you do, then all you
need to do to exactly locate an abort location is:
* Use PROGINST to convert the segment number and location into a
procedure name and a procedure-relative address.
* Find the procedure in your source listing.
* Find the line in your source listing whose code offset is LESS
than or equal to the procedure-relative address returned by
PROGINST but the code offset of the next line is GREATER than
PROGINST's procedure-relative address.
* The instruction at which the abort occurred is contained in the
line you've just found.
I personally do not keep all my source listings, either online or
on paper. One reason is disc space; another is that all my procedures
are quite small in any case, a programming style that I find has many
advantages. When the location of the abort in a procedure is not
immediately obvious, I use the following trick:
* I use a decompiler to disassemble the code around the abort
location. Although this requires some knowledge of HP assembly
code, it's not as hard as it seems -- often, there are some
procedure calls (for which a good disassembler will display the
names of the procedures being called) right near the abort
location; by referring back to the source code, we can find the
lines on which these procedures are being called and thus, by
inference, find the approximate location on the abort.
Similarly, if I see an ADD 15 instruction right next to the abort
location and there's a "I+15" in my procedure source code, I can
assume that the source line in which the abort occurred is right
next to the "I+15".
Go to Adager's index of technical papers