Difference between revisions of "Geneve video wait states"
(4 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
The Geneve video wait state handling is controlled by the | The Geneve video wait state handling is controlled by the PAL chip on the board. Unfortunately, there are no public specifications on that custom chip, so all we can do is to try by experiment. | ||
Video wait states are created after the access has occured. On that occasion, 14 wait states are created | (Update: We do have the PAL equations, found by reverse engineering, but the effects mentioned below still cannot fully be explained.) | ||
Video wait states are created after the access has occured. On that occasion, 14 wait states are created by the counter in the PAL, pulling down the READY line for that period of time. Before this, one or two wait states are inserted since the video access is an external one. | |||
The long wait state period starts after the access, which causes this very access to terminate, and then to start the period for the next command. This means that those wait states are completely ineffective in the following case: | The long wait state period starts after the access, which causes this very access to terminate, and then to start the period for the next command. This means that those wait states are completely ineffective in the following case: | ||
Line 166: | Line 168: | ||
|} | |} | ||
There is one peculiarity: We now have 15 wait states, not 14. Indeed, as my measurements prove, the complete code requires 29 cycles, and by using commands in the on-chip RAM as shown in the previous example one can see that the second MOVB line requires 9 cycles, including 4 wait states. As it seems, video writing requires ''one wait state more'' than reading. This may be a design issue of the | There is one peculiarity: We now have 15 wait states, not 14. Indeed, as my measurements prove, the complete code requires 29 cycles, and by using commands in the on-chip RAM as shown in the previous example one can see that the second MOVB line requires 9 cycles, including 4 wait states. As it seems, video writing requires ''one wait state more'' than reading. This may be a design issue of the PAL, but it won't cause any trouble. | ||
Finally we should look at the case when we get wait states by different reasons. Assume we set up the system to use video wait states '''and''' extra wait states. | Finally we should look at the case when we get wait states by different reasons. Assume we set up the system to use video wait states '''and''' extra wait states. | ||
Line 222: | Line 224: | ||
What happens to the additional wait states for the SRAM access in line 5? They occur at the same time as the video wait states, and therefore are not effective. As it seems, the counters for the wait states do not add on each other. The SRAM access is still delayed by 3 wait states (not 2) because there is one wait state left from the video access. After this one is over, the access occurs. | What happens to the additional wait states for the SRAM access in line 5? They occur at the same time as the video wait states, and therefore are not effective. As it seems, the counters for the wait states do not add on each other. The SRAM access is still delayed by 3 wait states (not 2) because there is one wait state left from the video access. After this one is over, the access occurs. | ||
== Unsolved questions == | |||
There are some '''unsolved questions''', unfortunately. They are not really a problem, but they make it difficult to understand the actual implementation and the way how this could be used within an emulation. | There are some '''unsolved questions''', unfortunately. They are not really a problem, but they make it difficult to understand the actual implementation and the way how this could be used within an emulation. | ||
=== VDP write causes one more wait state === | |||
* If we put the code in SRAM and access video, the following command should be delayed by 14 WS. However, it turns out to require 15 WS. | |||
Why do VDP write accesses cause 1 more wait state than read accesses? | |||
F040 MOVB R3,@>VDPWD | |||
NOP | |||
NOP | |||
NOP | |||
NOP | |||
MOVB @SRAM,R4 | |||
DEC R1 | |||
JNE F040 | |||
{| class="waitstate" | |||
! 1 !! 2 !! 3 !! 4 !! 5 !! 6 !! 7 !! 8 !! 9 !! 10 !! 11 !! 12 | |||
|- | |||
| MOVB R3,@ | |||
| < R3 | |||
| VDPWD | |||
| class="wait" | > *VDPWD | |||
| wait | |||
|- | |||
| class="wait1" | NOP | |||
| class="wait1" | < PC | |||
| class="wait1" | > PC | |||
|- | |||
| class="wait1" | NOP | |||
| class="wait1" | < PC | |||
| class="wait1" | > PC | |||
|- | |||
| class="wait1" | NOP | |||
| class="wait1" | < PC | |||
| class="wait1" | > PC | |||
|- | |||
| class="wait1" | NOP | |||
| class="wait1" | < PC | |||
| class="wait1" | > PC | |||
|- | |||
| class="wait1" | MOVB @,R4 | |||
| class="wait1" | SRAM | |||
| class="wait1" | wait | |||
| < SRAM | |||
| > R4 | |||
|- | |||
| DEC R1 | |||
| < R1 | |||
| > R1 | |||
|- | |||
| JNE | |||
| < PC | |||
| > PC | |||
|- | |||
|} | |||
When you turn off the video wait states, the above code runs at 27 cycles. With video wait states, it needs 28 cycles. This clearly indicates that all but one wait state have passed in the background, and that last wait state delays the reading of the SRAM memory location. If we assumed that only 14 wait states were produced, the code should not have slowed down by one cycle. | |||
=== Less wait states when next access is write? === | |||
Compare this: | |||
F040 MOVB @>VDPRD,R3 | |||
MOVB @SRAM,R4 | |||
DEC R1 | |||
JNE F040 | |||
{| class="waitstate" | |||
! 1 !! 2 !! 3 !! 4 !! 5 !! 6 !! 7 !! 8 !! 9 !! 10 !! 11 !! 12 !! 13 !! 14 !! 15 | |||
|- | |||
| MOVB @ | |||
| VDPWD | |||
| class="wait" | wait | |||
| < *VDPRD | |||
| class="wait1" | > R3 | |||
|- | |||
| class="wait1" | MOVB @,R4 | |||
| class="wait1" | SRAM | |||
| class="wait1" | wait | |||
| class="wait1" | wait | |||
| class="wait1" | wait | |||
| class="wait1" | wait | |||
| class="wait1" | wait | |||
| class="wait1" | wait | |||
| class="wait1" | wait | |||
| class="wait1" | wait | |||
| class="wait1" | wait | |||
| class="wait1" | wait | |||
| class="wait1" | wait | |||
| < *SRAM | |||
| > R4 | |||
|- | |||
| DEC R1 | |||
| < R1 | |||
| > R1 | |||
|- | |||
| JNE | |||
| < PC | |||
| > PC | |||
|- | |||
|} | |||
with this one: | |||
F040 MOVB @>VDPRD,R3 | |||
MOVB R4,@SRAM | |||
DEC R1 | |||
JNE F040 | |||
{| class="waitstate" | |||
! 1 !! 2 !! 3 !! 4 !! 5 !! 6 !! 7 !! 8 !! 9 !! 10 !! 11 !! 12 !! 13 | |||
|- | |||
| MOVB @ | |||
| VDPWD | |||
| class="wait" | wait | |||
| < *VDPRD | |||
| class="wait1" | > R3 | |||
|- | |||
| class="wait1" | MOVB R4,@ | |||
| class="wait1" | < R4 | |||
| class="wait1" | SRAM | |||
| class="wait1" | > *SRAM | |||
| class="wait1" | wait | |||
| class="wait1" | wait | |||
| class="wait1" | wait | |||
| class="wait1" | wait | |||
| class="wait1" | wait | |||
| class="wait1" | wait | |||
| class="wait1" | wait | |||
| class="wait1" | wait | |||
| wait | |||
|- | |||
| DEC R1 | |||
| < R1 | |||
| > R1 | |||
|- | |||
| JNE | |||
| < PC | |||
| > PC | |||
|- | |||
|} | |||
It seems as if the number of wait states depends on the next memory access type, whether read or write. Remember that the last wait state in the second example must show a high READY line, or the operation will not be complete. Accordingly, in the first example, READY is low for 14 cycles, while in the second example we cannot have more than 13 cycles. | |||
=== One more wait state when acquiring instructions? === | |||
If we put the code in SRAM and access video, the following command should be delayed by 14 WS. However, it turns out to require 15 WS. Have a look: | |||
First we assume that instructions are in on-chip memory, also the registers. | |||
F040 MOVB @>VDPRD,R3 | F040 MOVB @>VDPRD,R3 | ||
Line 234: | Line 383: | ||
JNE F040 | JNE F040 | ||
If this code is executed in the on-chip RAM it takes | {| class="waitstate" | ||
! 1 !! 2 !! 3 !! 4 !! 5 !! 6 !! 7 !! 8 !! 9 !! 10 !! 11 !! 12 | |||
|- | |||
| MOVB @,R3 | |||
| VDPRD | |||
| class="wait" | wait | |||
| < *VDPRD | |||
| class="wait1" | > R3 | |||
|- | |||
| class="wait1" | NOP | |||
| class="wait1" | < PC | |||
| class="wait1" | > PC | |||
|- | |||
| class="wait1" | MOVB @,R4 | |||
| class="wait1" | SRAM | |||
| class="wait1" | wait+ | |||
| class="wait1" | wait | |||
| class="wait1" | wait | |||
| class="wait1" | wait | |||
| class="wait1" | wait | |||
| class="wait1" | wait | |||
| class="wait1" | wait | |||
| class="wait1" | wait | |||
| < SRAM | |||
| > R4 | |||
|- | |||
| DEC R1 | |||
| < R1 | |||
| > R1 | |||
|- | |||
| JNE | |||
| < PC | |||
| > PC | |||
|- | |||
|} | |||
+ At this point the processor attempts to fetch the contents of the memory location at SRAM. The access is delayed until READY goes high. | |||
Tests show that the code as listed above requires 26 cycles. Without video wait states we would get 18 cycles. | |||
Now have a look at this sample. We assume that registers are in on-chip memory and the program code is now in SRAM. This also means that for each command and operand acquisition we need two memory accesses. | |||
A040 MOVB @>VDPRD,R3 | |||
NOP | |||
MOVB @SRAM,R4 | |||
DEC R1 | |||
JNE A040 | |||
{| class="waitstate" | |||
! 1 !! 2 !! 3 !! 4 !! 5 !! 6 !! 7 !! 8 !! 9 !! 10 | |||
|- | |||
| colspan="2" | MOVB @,R3 | |||
| colspan="2" | VDPRD | |||
| class="wait" | wait | |||
| < *VDPRD | |||
| class="wait1" | > R3 | |||
|- | |||
| class="wait1" | wait+ | |||
| class="wait1" | wait | |||
| class="wait1" | wait | |||
| class="wait1" | wait | |||
| class="wait1" | wait | |||
| class="wait1" | wait | |||
| class="wait1" | wait | |||
| class="wait1" | wait | |||
| class="wait1" | wait | |||
| class="wait1" | wait | |||
|- | |||
| class="wait1" | wait | |||
| class="wait1" | wait | |||
| class="wait1" | wait | |||
| class="wait1" | wait | |||
| colspan="2" | NOP | |||
| < PC | |||
| > PC | |||
|- | |||
| colspan="2" | MOVB @,R4 | |||
| colspan="2" | SRAM | |||
| < SRAM | |||
| > R4 | |||
|- | |||
| colspan="2" | DEC R1 | |||
| < R1 | |||
| > R1 | |||
|- | |||
| colspan="2" | JNE | |||
| < PC | |||
| > PC | |||
|- | |||
|} | |||
+ At this point the processor tries to acquire the NOP command. | |||
If this code is executed in the on-chip RAM it takes 25 cycles. When executed in SRAM it requires 39 cycles. This seems at first a good number, as we said reading causes 14 wait states. But we forgot that one wait state occurs while writing to R3 (on-chip), so we should only have 13 effective WS. | |||
The command acquision from SRAM begins in the second line but completes exactly when the READY line goes high again, reading the first and then the second byte of NOP. We must therefore assume 15 WS. | |||
Still, cause comes before effect: the PAL cannot "know" after line 1 that it must set the counter 1 higher. The PAL may possibly know that the current read cycle is an instruction acquisition, as the IAQ line is connected to one of its pins. When acquiring an opcode, the line gets asserted (high), unlike for operand acquisitions where it remains low. | |||
=== Instruction prefetch as a possible reason === | |||
There may also be an effect from the processor's [[instruction prefetch]] capability. As mentioned, the TMS9995 is highly efficient in its usage of clock cycles, which is only possible when it can get the next instruction during internal operations. The additional wait state is probably be caused by the fact that external accesses are still blocked by the READY line, so one more cycle may be required. | |||
__NOTOC__ | |||
[[Category:Geneve]] | [[Category:Geneve]] |
Latest revision as of 00:38, 13 January 2024
The Geneve video wait state handling is controlled by the PAL chip on the board. Unfortunately, there are no public specifications on that custom chip, so all we can do is to try by experiment.
(Update: We do have the PAL equations, found by reverse engineering, but the effects mentioned below still cannot fully be explained.)
Video wait states are created after the access has occured. On that occasion, 14 wait states are created by the counter in the PAL, pulling down the READY line for that period of time. Before this, one or two wait states are inserted since the video access is an external one.
The long wait state period starts after the access, which causes this very access to terminate, and then to start the period for the next command. This means that those wait states are completely ineffective in the following case:
// Assume workspace at F000 F040 MOVB @VDPRD,R3 F044 NOP F046 DEC R1 F048 JNE F040
The instruction at address F040 requires 5 cycles: get MOVB @,R3, get the source address, read from the source address, 1 WS for this external access, write to register 3. Then the next command is read (NOP) which is in the on-chip RAM (check the addresses). For that reason, the READY pin is ignored, which has been pulled low in the meantime. NOP takes three cycles, as does DEC R1, still ignoring the low READY. JNE also takes three cycles. If R1 is not 0, we jump back, and the command is executed again.
Obviously, the counter is now reset, because we measure that the command again takes only 5 cycles, despite the fact that the 14 WS from last iteration are not yet over.
However, if we add an access to SRAM, we will notice a delay:
// Assume workspace at F000 F040 MOVB @VDPRD,R3 * 5 cycles F044 MOVB @SRAM,R4 * 15 cycles (4 cycles + 11 WS) F048 DEC R1 * 3 cycles F04A JNE F040 * 3 cycles
But we only get 11 WS, not 14. This is reasonable if we remember that the access to SRAM does not occur immediately after the VDP access but after writing to R3 and getting the next command. This can be shown more easily in the following way. ADDR is the constant, symbolizing the memory location; *ADDR means an access to the memory location at address ADDR, < is reading, > is writing. PC is the program counter. The opcode, which is read in the first cycle, contains information about the types of source and destination locations, as well as the register numbers.
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MOVB @,R3 | VDPRD | wait | < *VDPRD | > R3 | ||||||||||
MOVB @,R4 | SRAM | wait | wait | wait | wait | wait | wait | wait | wait | wait | wait | wait | < *SRAM | > R3 |
DEC R1 | < R1 | > R1 | ||||||||||||
JNE | < PC | > PC |
There are several activities going on while the wait states are active "in the background", as symbolized by the blue background. Only when the next access to external memory occurs the wait states become effective. As said above, the read operations only terminate when the READY line gets high again, so we put the termination of the read access in the last box. You should keep in mind that the read access starts right after the last activity.
We can spend so much time inside the on-chip address space that all wait states pass without effect. The following code is executed at the same speed, whether the video wait state bit is set of not:
// Assume workspace at F000 F040 MOVB @VDPRD,R3 NOP DEC @>F006 DEC @>F006 MOVB @SRAM,R4 DEC R1 JNE F040
as you can see here:
1 | 2 | 3 | 4 | 5 |
---|---|---|---|---|
MOVB @,R3 | VDPRD | wait | < *VDPRD | > R3 |
NOP | < PC | > PC | ||
DEC @ | F006 | < *F006 | > *F006 | |
DEC @ | F006 | < *F006 | > *F006 | |
MOVB @,R4 | SRAM | < *SRAM | > R3 | |
DEC R1 | < R1 | > R1 | ||
JNE | < PC | > PC |
Now an example on writing to VDP.
F040 MOVB R3,@>VDPWD NOP NOP NOP MOVB @SRAM,@>F000 DEC R1 JNE F040
Again as a diagram. Note that box 5 in the first line is not backgrounded: The READY line must be high, or the command will not be completed. We assume that the data are written to VDPWD in box 4 already.
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|
MOVB R3,@ | < R3 | VDPWD | > *VDPWD | wait | ||||
NOP | < PC | > PC | ||||||
NOP | < PC | > PC | ||||||
NOP | < PC | > PC | ||||||
MOVB @,@ | SRAM | wait | wait | wait | wait | < *SRAM | F000 | > *F000 |
DEC R1 | < R1 | > R1 | ||||||
JNE | < PC | > PC |
There is one peculiarity: We now have 15 wait states, not 14. Indeed, as my measurements prove, the complete code requires 29 cycles, and by using commands in the on-chip RAM as shown in the previous example one can see that the second MOVB line requires 9 cycles, including 4 wait states. As it seems, video writing requires one wait state more than reading. This may be a design issue of the PAL, but it won't cause any trouble.
Finally we should look at the case when we get wait states by different reasons. Assume we set up the system to use video wait states and extra wait states.
F040 MOVB R3,@>VDPWD NOP NOP DEC @>F000 MOVB @SRAM,R4 DEC R1 JNE F040
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|
MOVB R3,@ | < R3 | VDPWD | > *VDPWD | wait | wait | |||
NOP | < PC | > PC | ||||||
NOP | < PC | > PC | ||||||
DEC @ | F000 | < *F000 | > *F000 | |||||
MOVB @,R4 | SRAM | wait | wait | wait | < *SRAM | > R4 | ||
DEC R1 | < R1 | > R1 | ||||||
JNE | < PC | > PC |
First thing to notice is that the first line gets another wait state, since we have requested additional wait states, and VDPWD is an external address. Also, we have 15 wait states from video (blue).
What happens to the additional wait states for the SRAM access in line 5? They occur at the same time as the video wait states, and therefore are not effective. As it seems, the counters for the wait states do not add on each other. The SRAM access is still delayed by 3 wait states (not 2) because there is one wait state left from the video access. After this one is over, the access occurs.
Unsolved questions
There are some unsolved questions, unfortunately. They are not really a problem, but they make it difficult to understand the actual implementation and the way how this could be used within an emulation.
VDP write causes one more wait state
Why do VDP write accesses cause 1 more wait state than read accesses?
F040 MOVB R3,@>VDPWD NOP NOP NOP NOP MOVB @SRAM,R4 DEC R1 JNE F040
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
---|---|---|---|---|---|---|---|---|---|---|---|
MOVB R3,@ | < R3 | VDPWD | > *VDPWD | wait | |||||||
NOP | < PC | > PC | |||||||||
NOP | < PC | > PC | |||||||||
NOP | < PC | > PC | |||||||||
NOP | < PC | > PC | |||||||||
MOVB @,R4 | SRAM | wait | < SRAM | > R4 | |||||||
DEC R1 | < R1 | > R1 | |||||||||
JNE | < PC | > PC |
When you turn off the video wait states, the above code runs at 27 cycles. With video wait states, it needs 28 cycles. This clearly indicates that all but one wait state have passed in the background, and that last wait state delays the reading of the SRAM memory location. If we assumed that only 14 wait states were produced, the code should not have slowed down by one cycle.
Less wait states when next access is write?
Compare this:
F040 MOVB @>VDPRD,R3 MOVB @SRAM,R4 DEC R1 JNE F040
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MOVB @ | VDPWD | wait | < *VDPRD | > R3 | ||||||||||
MOVB @,R4 | SRAM | wait | wait | wait | wait | wait | wait | wait | wait | wait | wait | wait | < *SRAM | > R4 |
DEC R1 | < R1 | > R1 | ||||||||||||
JNE | < PC | > PC |
with this one:
F040 MOVB @>VDPRD,R3 MOVB R4,@SRAM DEC R1 JNE F040
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
MOVB @ | VDPWD | wait | < *VDPRD | > R3 | ||||||||
MOVB R4,@ | < R4 | SRAM | > *SRAM | wait | wait | wait | wait | wait | wait | wait | wait | wait |
DEC R1 | < R1 | > R1 | ||||||||||
JNE | < PC | > PC |
It seems as if the number of wait states depends on the next memory access type, whether read or write. Remember that the last wait state in the second example must show a high READY line, or the operation will not be complete. Accordingly, in the first example, READY is low for 14 cycles, while in the second example we cannot have more than 13 cycles.
One more wait state when acquiring instructions?
If we put the code in SRAM and access video, the following command should be delayed by 14 WS. However, it turns out to require 15 WS. Have a look:
First we assume that instructions are in on-chip memory, also the registers.
F040 MOVB @>VDPRD,R3 NOP MOVB @SRAM,R4 DEC R1 JNE F040
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
---|---|---|---|---|---|---|---|---|---|---|---|
MOVB @,R3 | VDPRD | wait | < *VDPRD | > R3 | |||||||
NOP | < PC | > PC | |||||||||
MOVB @,R4 | SRAM | wait+ | wait | wait | wait | wait | wait | wait | wait | < SRAM | > R4 |
DEC R1 | < R1 | > R1 | |||||||||
JNE | < PC | > PC |
+ At this point the processor attempts to fetch the contents of the memory location at SRAM. The access is delayed until READY goes high.
Tests show that the code as listed above requires 26 cycles. Without video wait states we would get 18 cycles.
Now have a look at this sample. We assume that registers are in on-chip memory and the program code is now in SRAM. This also means that for each command and operand acquisition we need two memory accesses.
A040 MOVB @>VDPRD,R3 NOP MOVB @SRAM,R4 DEC R1 JNE A040
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|
MOVB @,R3 | VDPRD | wait | < *VDPRD | > R3 | |||||
wait+ | wait | wait | wait | wait | wait | wait | wait | wait | wait |
wait | wait | wait | wait | NOP | < PC | > PC | |||
MOVB @,R4 | SRAM | < SRAM | > R4 | ||||||
DEC R1 | < R1 | > R1 | |||||||
JNE | < PC | > PC |
+ At this point the processor tries to acquire the NOP command.
If this code is executed in the on-chip RAM it takes 25 cycles. When executed in SRAM it requires 39 cycles. This seems at first a good number, as we said reading causes 14 wait states. But we forgot that one wait state occurs while writing to R3 (on-chip), so we should only have 13 effective WS.
The command acquision from SRAM begins in the second line but completes exactly when the READY line goes high again, reading the first and then the second byte of NOP. We must therefore assume 15 WS.
Still, cause comes before effect: the PAL cannot "know" after line 1 that it must set the counter 1 higher. The PAL may possibly know that the current read cycle is an instruction acquisition, as the IAQ line is connected to one of its pins. When acquiring an opcode, the line gets asserted (high), unlike for operand acquisitions where it remains low.
Instruction prefetch as a possible reason
There may also be an effect from the processor's instruction prefetch capability. As mentioned, the TMS9995 is highly efficient in its usage of clock cycles, which is only possible when it can get the next instruction during internal operations. The additional wait state is probably be caused by the fact that external accesses are still blocked by the READY line, so one more cycle may be required.