BASIC file formats

From Ninerpedia
Revision as of 17:53, 29 March 2015 by Mizapf (talk | contribs) (→‎Header)
Jump to navigation Jump to search

Overview

BASIC programs written for TI BASIC and Extended BASIC are not stored as plain text in memory. This is different with assembler programs which are edited as text files and then assembled to a Tagged Object Code file.

This is not appropriate for BASIC. When the program is started, and it would be stored as plain text, the BASIC interpreter would have to parse the line first, finding out the commands and the arguments, and then execute it. This is typical for script languages of today, but it would be just too slow, and we know well that TI BASIC and Extended BASIC are quite slow, compared with other platforms.

BASIC lines are tokenized. For each command or special character or character sequence that has a meaning in BASIC there is a one-byte code, the token. Example:

Command Token (hex)
NEW 00
SAVE 07
EDIT 09
PRINT 9c
& b8
"..." (quoted string) c7
SEG$ d8
VALIDATE fe

You can find a complete table here.

So let us take a simple BASIC line like

PRINT "HELLO"

There will not be a string like "PRINT" in memory, because the parser recognized this word as a command and replaced it with its token. Second, there is a string following the command, which is enclosed in quotes. The contents can be anything, so the parser must copy it into memory as is.

Finally, the line is converted to the following byte sequence:

09 9c c7 05 48 45 4c 4c 4f 00
line length PRINT "..." string length H E L L O end

Sample program

Let's have a look at a real Extended BASIC program. This is an output of TIImageTool which shows the contents of a PROGRAM file.

000000: 00 3f 37 a7 37 98 37 d7 00 28 37 a9 00 1e 37 ac     .?7.7.7..(7...7.
000010: 00 14 37 b2 00 0a 37 ca 02 8b 00 05 96 52 4f 57     ..7...7......ROW
000020: 00 17 a2 f0 b7 52 4f 57 b3 c8 01 31 b6 b5 c7 04     .....ROW...1....
000030: 54 45 53 54 b4 52 4f 57 00 0e 8c 52 4f 57 be c8     TEST.ROW...ROW..
000040: 01 31 b1 c8 02 32 30 00                             .1...20.

The numbers on the left (xxxxx:) are the offset from the beginning of the file. At the right side we see the ASCII representation of the bytes, where unprintable characters are shown by a dot. The offsets and the ASCII column are not part of the file but added for better readability.

There are no commands to be seen, but we should expect nothing like that, after reading the above paragraphs.

At first we cut away the offsets and the ASCII column, and we add some line breaks so we see the file structure. We join some bytes together as they are parts of words.

003f 37a7 3798 37d7 
0028 37a9 
001e 37ac
0014 37b2 
000a 37ca 
02 8b 00
05 96 52 4f 57 00 
17 a2 f0 b7 52 4f 57 b3 c8 01 31 b6 b5 c7 04 54 45 53 54 b4 52 4f 57 00 
0e 8c 52 4f 57 be c8 01 31 b1 c8 02 32 30 00

Everything is still the same. We can now analyse the contents of the file. The memory location refers to the locations where the portions of the program will reside when we load it into memory with OLD.

Meaning Memory location Contents
Header 003f 37a7 3798 37d7
Line Number Table 3798 - 379b 0028 37a9
379c - 379f 001e 37ac
37a0 - 37a3 0014 37b2
37a4 - 37a7 000a 37ca
Program lines 37a8 - 37aa 02 8b 00
37ab - 37b0 05 96 52 4f 57 00
37b1 - 37c8 17 a2 f0 b7 52 4f 57 b3 c8 01 31 b6 b5 c7 04 54 45 53 54 b4 52 4f 57 00
37c9 - 37d7 0e 8c 52 4f 57 be c8 01 31 b1 c8 02 32 30 00

Header

The file starts with a header, containing four 16-bit words. When you carefully look at the table above, you can already deduce how those numbers are calculated.

  • The second word is the start of the line number table.
  • The third word is the end of the line number table.
  • The fourth word is the end of available memory. Programs are always loaded so that their last byte falls on the highest available address.

The first word is calculated as the XOR of the addresses of the start and end of the line number table:

     37a7 = 0011011110100111
XOR  3798 = 0011011110011000
     -----------------------
     003f = 0000000000111111

If this word is negated, the BASIC program is protected and cannot be listed. This is only available in Extended BASIC.

Line number table

The next block is the line number table (LNT). Again, when you look carefully you see that we have a list of entries, each of which contains two 16-bit words:

  • The first word is the line number.
  • The second word is the location of the BASIC line in memory.

To be precise, the memory location is the second byte of a BASIC line. For example, the fourth entry (000a 37ca) tells us that the line in memory at address 37c9 is BASIC line 10.

Another interesting point is that the LNT is sorted, with the highest line number appearing at the low end, and the lowest number at the high end. In our example, the line numbers are 000a (10), 0014 (20), 001e (30), and 0028 (40). Moreover, the BASIC lines seem to be sorted in the same way, with the contents of line 10 near the memory end, and later lines growing towards lower memory.

BASIC line

What remains are the tokenized lines. Again, here are two things we can quickly find out.

  • The last byte of each line is 00.
  • The first byte is a length byte. The length byte does not count itself, but includes the 00 byte at the end.

Now it is time to find out what the program line consists of. We will now replace the tokens by their respective texts; actually, we do what happens in the computer when we execute the LIST command. Moreover, we assign the line numbers from the LNT.

Memory Line number Length Contents End
37a8 - 37aa 40 02 8b 00
37ab - 37b0 30 05 96 52 4f 57 00
37b1 - 37c8 20 17 a2 f0 b7 52 4f 57 b3 c8 01 31 b6 b5 c7 04 54 45 53 54 b4 52 4f 57 00
37c9 - 37d7 10 0e 8c 52 4f 57 be c8 01 31 b1 c8 02 32 30 00


Next, we replace all tokens in the lines as we find them in the table.


Memory Line number Length Contents End
37a8 - 37aa 40 02 END 00
37ab - 37b0 30 05 NEXT 52 4f 57 00
37b1 - 37c8 20 17 DISPLAY AT ( 52 4f 57 , c8 01 31 ) : c7 04 54 45 53 54 ; 52 4f 57 00
37c9 - 37d7 10 0e FOR 52 4f 57 = c8 01 31 TO c8 02 32 30 00


We already mentioned above that a quoted string is a sequence of characters enclosed in quotes. The quoted string is introduced by the c7 byte, followed by a length byte.

Similarly, an unquoted string is a sequence of characters, but without quotes. The important thing about it is that it also requires a length byte. An unquoted string is introduced by c8. We can now replace the bytes in the table accordingly.


Memory Line number Length Contents End
37a8 - 37aa 40 02 END 00
37ab - 37b0 30 05 NEXT 52 4f 57 00
37b1 - 37c8 20 17 DISPLAY AT ( 52 4f 57 , 1 ) : " T E S T " ; 52 4f 57 00
37c9 - 37d7 10 0e FOR 52 4f 57 = 1 TO 20 00

Only one thing is left. There are sequences of ASCII characters (52 4f 57 = "ROW") that are neither declared as quoted nor unquoted strings nor commands. For some obscure reason, TI decided to indicate variable names simply by the characters alone. No marker, no length byte.

Obviously, our program is

 10 FOR ROW=1 TO 20
 20 DISPLAY AT(ROW,1):"TEST";ROW
 30 NEXT ROW
 40 END

TODO: continue