Chasing the Pesky Ratio!

It seems like I did something really silly! I had to come up with some goals for 2018. I hate this time of year, I think everybody does. OK, what can I put down that is measurable and achievable? How about keeping the CDC6500 running more than 50% of the time? That might work. Oops, did I hit the send button?

“Hey, Daiyu: How do I tell what users have been on the machine?” Daiyu Hurst is my systems programmer, who lives Back East somewhere. If it is on the other side of Montana from Seattle, it is just “Back East” to me. She lives in one of those “I” states, Indiana, or Illinois, not Idaho, I know where that is. After a short pause, she found the appropriate incantations for me to utter, and we have a list of who was on the machine, and when they logged in. I had to use Perl to filter out those lines, but that was pretty easy.

What is all this other gobble-de-gook in this file:

 03.14.06.AAAI005T. UCCO, 4.096KCHS. 
 03.16.09.AAAI005T. UCCO, 4.096KCHS. 
 03.18.11.AAAI005T. UCCO, 4.096KCHS. 
 03.20.12.AAAI005T. UCCO, 4.096KCHSUCCO, 4.096KCHS. 
 05.00.30.SYSTEM S. ARSY, 1, 98/01/24.
 05.00.30.SYSTEM S. ADPM, 11, NSD.
 05.00.30.SYSTEM S. ADDR, 01, LCM, 40.
 05.00.44.SYSTEM S. SDCI, 46.603SECS.:
 05.00.44.SYSTEM S. SDCA, 0.032SECS.:
 07.32.30.SYSTEM S. ARSY, 1, 98/01/24.
 07.32.30.SYSTEM S. ADPM, 11, NSD.
 07.32.30.SYSTEM S. ADDR, 01, LCM, 40.
 07.33.07.AAAI005T. ABUN, BRUCE, LCM.:
 07.33.37.SYSTEM S. SDCI, 116.108SECS.:
 07.33.37.SYSTEM S. SDCA, 0.078SECS.:
 07.33.37.SYSTEM S. SDCM, 0.005KUNS.:
 07.33.37.SYSTEM S. SDMR, 0.004KUNS.:

The line with “ARSY” in it is when I booted the machine, at 5:00 this morning, from home. It crashed before I got in, and I booted it again at 7:32. Then we get to 7:33:07, and the “ABUN” line, where I login from telnet.

From the first few lines we can see that the machine appeared to still be running and putting things in its accounting log at 3:20, but it crashed before it could print a message about 3:22.

OK from this, I can mutter a few incantations at PERL, and come up with something like:

1054 Booted on 98/01/23 @ 07.39.30
 Previous uptime: 0 days 5 hours 59 minutes
 Down time: 0 days 17 hours 28 minutes
 1065 Booted on 98/01/23 @ 13.38.30
 Previous uptime: 0 days 1 hours 23 minutes
 Down time: 0 days 4 hours 35 minutes
 1068 Booted on 98/01/23 @ 14.12.30
 Previous uptime: 0 days 0 hours 0 minutes
 Down time: 0 days 0 hours 33 minutes
 1392 New Date:98/01/24
 1498 Booted on 98/01/24 @ 05.00.30
 Previous uptime: 0 days 13 hours 7 minutes
 Down time: 0 days 1 hours 40 minutes
 1503 Booted on 98/01/24 @ 07.32.30
 Previous uptime: 0 days 0 hours 0 minutes
 Down time: 0 days 2 hours 31 minutes

Last uptime: 0 days 0 hours 1 minutes

Total uptime: 2 days, 1 hours 37 minutes in: 0 months 7 days 0 hours 14 minutes
Booted 15 times, upratio = 0.29

Here is where the hunt for the Pesky Ratio comes in: See that last line? In the last week, the CDC has been running 29% of the time. That isn’t even close to 50%. I KNOW the 6000 series were not the most reliable machines of their time, but really: 29%?

What has been going on? A week ago, I was having trouble keeping the machine going for more than a couple of minutes. Finally, it occurred to me I might see how the memory was doing, and it wasn’t doing well. It took me a while to find why bit 56 in bank 36 was bad. I had to explore the complete wrong end of the word for a while, before I realized that end worked, and I should have been looking at the other end. I chased it down to Sense Amplifier (PS) module 12M40. When I put it on the extender, the signal would come and go, as I probed different places. I noticed that I had re-soldered a couple of via rivets before, so I re-soldered ALL the via rivets on the module.

What do I mean “via rivets”? In those days, either one of two things were true: either they didn’t have plated through holes in printed circuit boards, or they were too expensive. None of the CDC 6500 modules I have looked at have plated through holes. Most of the modules do have traces on both sides of the two PCBs that the module is made with. How did they get a signal from one side to the other? They put in a tiny brass rivet! Near as I can tell, all the soldering was done from the outside of the module, and most of the time the solder would flow to the top of the rivet somehow. Since I have found many of these rivets not conducting, I have to assume that the process wasn’t perfect.

After soldering all the rivets on this module, I put it back in the machine, and we were off and running. Monday, I booted the machine at 8:11, and it ran till 2:11. When I got in yesterday, the machine wouldn’t boot. Testing memory again found bit 56 in bank 36 bad again! I put module 12M40 on the extender, and the signal wasn’t there. I poked a spot with the scope, and it was there. I poked, prodded, squeezed, twisted and tweaked, and I couldn’t get it to fail.

This is three times for This Module! I like to keep the old modules if I can, but my Pesky Ratio is suffering here! I took the machine back down, and brought it back up with only 64K of memory, and pulled out the offending module:

There are 510 of these PS modules in the machine, three for each of the 170 storage modules, or about 10% of all the modules in the machine. Having a spare would be nice. My next task will be to make about 10 new PS modules.

In the time I have been writing this post, the display on the CDC has gone wonky again. This appears to happen when the Perpheral Processors (PP’s) forget how to skip on zero for a while. Once this happens, I can’t talk coherently to channel 10 or PP11. I have a few little tests that copy themselves to all the PP’s, and they will all work, except the last one: PP11.

I have yet to write a diagnostic that can catch the PP’s making the mistake that I can see on the logic analyzer once a day or so. Right now the solution seems to be to wait a while, and the problem will go away again. This is another reason while the Pesky Ratio is so difficult to hunt, but I fix what I can, when I can.

Onward: One bug at a time!