Computer Maintenance Hell!

Back in October 2018, our PDP10-KI went down, and it didn’t want to come back up. I ran all the normal diagnostics, and they all worked, but the TOPS-10 would hang when I tried to boot it. That is the definition of Computer Maintenance Hell, Everything works, but the operating system won’t run!

Running the normal diagnostics sounds like an easy thing, but that isn’t always the case! The first bunch of diagnostics run from paper tape, and that is pretty easy. As we continue past DBKAG, the tapes don’t fit well in the reader, so we switch over to getting them off of DECTape, herein lies the rub: the TD10 DECTape controller on the KI is almost always broken when I need it.

After much gnashing of teeth, and tearing of hair, there was enough blood on the floor for the dust bunnies to leave tracks in that pointed to what was wrong with the TD10, and we were off once again. I ran the rest of the usual diagnostics, and they all passed! Still didn’t boot.

I had plenty of things to keep me occupied, so our poor PDP10-KI didn’t get a lot of my attention. During our group session bringing up the KATIA, we played with the KI some, and found that the KI didn’t like its memory! The KA liked it, but the KI didn’t! It would run the DDMMD memory diagnostic for about 10 or 15 minutes, then fail. The KA would happily run the KIs memory well past where the KI would fail. Looking at the errors, it appeared as if things were getting confused about which particular bit of memory it was talking to. It would always start failing at location 0374000 where either it hadn’t inverted the contents of those locations, or it had done it twice.

Now it did’t fail all the time. The part of the test that failed was going through memory incrementing the address by a more significant bit than the LSB, then wrap around to the LSB. When it started with the LSB, bit 35, everything was fine. It worked when it did bit 34. It had to get up to bit 25 before we had problems, we would fail between bit 25, and bit 21, 20 through 18 worked too.

I spent quite a while trying to write a diagnostic that did what DDMMD was doing, but in a quick and repeatable way. I believe I got pretty close, but nothing I wrote would tickle the problem… bother!

Months have now passed, and I broke down and plugged in the logic analyzer. Most of the time, I use an oscilloscope as my main debug tool. ‘Scopes don’t lie as much as logic analyzers do! If the logic is working, procducing 1’s and 0’s as it should, a logic analyzer is a good tool. When things are broken, sometimes you get a half, or a third instead of a one or zero, and this is where the ‘scope is better about telling the truth, and the logic analyzer will lie. Here the machine was pretty much working, at least the diagnostics thought so.

Here is one of the first logic analyzer traces I took, just showing the logic analyzer sample number, the memory operation, and the address:

1581 wr 626415
1601 wr 626435
1621 wr 626455
1641 wr 626475
1661 wr 626515
1681 wr 626535

I did a bunch of work with PERL to go through the 100MB of data that came out of the logic analyzer, and boil it down to what you see here.

Now it turns out that the way this part of DDMMD worked, is that it would fill memory from the bottom to the top stepping by 1, complement each location using the funny addressing pattern, then verify from bottom to top normally. I added the top 8 bits from the CPU’s MA (Memory Address) register to the logic analyzer:

104873 rd 377774, 376
104903 rd 377775, 376
104933 rd 377776, 376
104963 rd 377777, 376
104989 rd 777000, 400 ***
105019 rd 400001, 400
105047 rd 400002, 400
105075 rd 400003, 400
105103 rd 400004, 400

This is where it is doing the final verification, and you can see something funny here: the upper bits from the MA register incremented like I expected them to, but when a whole bunch of them changed, the address going to the actual memory didn’t follow as quickly! Instead of going from 377777 to 400000, it went to 777000! Here we get into a bit of logic called the “Pager”.

A PDP10 can really only talk to 256K words of memory at one time. How can the KI use 4MWs of memory? That is the Pagers job! The Pagers job is to translate the logical address that the CPU provides into a physical address of a hopefully larger memory. While running diagnostics, the Pager should be turned off, resulting in a maximum of 256KW of memory directly addressed from the MA register to the address lines going to memory. Something was going wrong here!

I added another set of 8 probes from the Logic Analyzer, and started moving backwards from the physical address going to the memory, to where the MA register fed into the Pager. When I got to the output of the CAM’s, there was something I didn’t understand.

What is a CAM you ask? CAM stands for “Content Addressable Memory”. What you do is give it the logical address that you want, and it will tell you if it knows about that, with a single line for each location inside itself. All four of them.

I got lucky, the first group of 8 output bits looked like this:

536691 wr 360631, 360, 400
536793 wr 362631, 362, 100
536844 wr 363631, 362, 040
536896 wr 364631, 364, 020
536947 wr 365631, 364, 010
536999 wr 366631, 366, 004
537052 wr 367631, 366, 006

Near as I can tell, there should be only a single 1 in the right column. It is octal, so we can watch which location in the CAM has the data as the addresses change, and when we get to 367631, we get two ones! I believe that output should have been a 002, not 006!

That output came from board 2PR09, so I swapped it and 2PR08, and I couldn’t run the diagnostic at all due to a “Page Fail Trap Error”! Ah, I think we are very close here! I checked the inventory, and we didn’t have a record for an M260 board, so I stole one from one of the machines that came in in September, and Voila, the memory test passed! It can even run TOPS-10 if we don’t try to initialize its serial ports. This could be correct since we stole a bunch of its serial lines to use on KATIA while the KI was asleep.

OK, Since the KI, the KA, and the CDC are all working, I seem to have made it out of Computer Maintenance Hell for now. Give them a little while, one of them will fail.

Bruce Sherry