Adventures in PDP10 land

The PDP10-KI went down sometime in the fall, maybe October. This is the machine just to the right of the CDC6500 as you come in the second floor computer room. I noticed this fairly quickly and tried to reboot it, but it would hang when I tried, several times.

OK, it must be time to run diagnostics, which I proceeded to do. It passed all the diagnostics from DBKAA to DBKAH, which are the ones on paper tape, but the TD10 DECTape controller, just to the right of the console had quit again, as it has done almost every time I need to use it.

This TD10 and I just do not get along very well. It always kicks me around the block several times before it will let me know what is wrong, so I can fix it. Because of this history, it sometimes takes a while for me to generate the gumption to work on it. This time was no exception, since I was busy trying to get the KA working. If you don’t know about gumption and gumption traps, you should read the classic “Zen and the Art of Motorcycle Maintenance”, which doesn’t really talk about motorcycles, or Zen much.

Back to our story: We fixed the KA, moved it down to the second floor computer room, fixed it again, and had a small gathering to boot it the first time. The CDC was being its normal self, but was refusing to be down when I arrived at work, which is my signal to drop everything and work on it.
I was running out of reasons to avoid working on the KI.

Contrary to normal behavior, it only took a few hours to figure out the problem with the TD10, so I could run the diagnostics that are very awkward to load from paper tape.

All the normal diagnostics ran except for DBKAL. I think it took a few days to remember that there is a diagnostic for which the binary we have is wrong, and I have to patch a couple of locations after loading, in order for it to work. Now DBKAL and DBKAM work. On to some more obscure ones. All the CPU ones seem to pass, does it boot now? No, it still hangs after the OS is loaded, and we type “GO” to fire up timesharing.

What else can we test? We ran DDRHA, which tests the RH10 disk controller, but it passed. We then ran DDRPI, which tests the disk drives. Now we don’t really use the disk drives it is expecting to test, we use our MDE (Massbus Disk Emulator). We have been using this MDE for about 5 years now, but there could still be a bug hiding in there somewhere.

DDRPI looked like everything was fine for about 20 minutes, while it was doing register tests, seek tests, all ones and zeros tests. When it got to testing the surface of the disk, things started to go wrong. It would get an error where it looked like the data was misplaced, like it was reading the wrong sector or something like that.

How could that be? This thing has been working fine for over 5 years, in fact when we ran DDRPI from the KA, using the KI’s Memory, RH10s, DAS33, and MDE, EVERYTHING was FINE, even the surface test!

How about the memory? It passes my little MARCH memory test, from the KI or the KA. Dragging out the DECTape again, we loaded up DDMMD, which is one of the memory diagnostics. We fired it up, and it ran fine for about 15 minutes, whereupon it started spewing out errors. We have run this a BUNCH, and the errors usually seem to start at location 374000, and the data seems to be inverted from what it should be. The test complains about address bit 24.

We run the same test from the KA against the KI’s memory, and it works fine! It is handy to have the KA right across the room to enable this kind of testing.

What is really going on here? Let’s look at the console:

OK, TN=2, that means it is doing “Address” test. AS=F24 turns out to mean that it was doing fast addressing on address bit 24. “What does that mean” you may ask? I did! After much grovelling over the DDMMD listing, and consulting with Rich Alderson, I found that they would fill memory with the address and its complement, and then go through reading a location, verifying it was correct, and writing the complement in it, then go back and read and verify that they all had the complement. Ah, but what about that F24 part? When they are reading and complementing the data they start skipping locations, by changing which address bit they increment first. The first time they do this, they use bit 35 as the lsb, but next time through they shift the lsb over to bit 34, then 33 etc. When they get to bit 24, we run into this problem.

How do I figure out what is really going on here? I decided to write a version of my MARCH that does this, MRCHFA. It took a while, but I finally got it to work on the KA, and tried it on the KI: Unfortunately the KI passed it too! What else are they doing differently? OK, more grovelling over the listing: They are stuffing all their inner loops down in the Fast AC’s to speed them up. On to MRCHF3, which pushes the inner loops down to the Fast AC’s. Does the KI fail that one? Nope!

I’m running out of ideas, where do we go from here? I decide to just watch it for a while, and see what happens AFTER it starts to fail. I see it fail bit 24 from 374000 to 377777, both bit 23 and 22 over the same range, then it starts failing location 700000, in the same way. Shortly thereafter, the program gives up in disgust, and stops printing the results, and starts ignoring the errors. Now I just watch the lights on the ARM10 memory.

I got used to the way the lights blink while working on MRCHFA, and MRCHF3, as the LSB moves up the address, the slow blinking address follows it.

But wait, that isn’t what I am seeing: I see the lights increment from the bottom, do the FA thing, then increment from the bottom again, and then shift the LSB over. What is going on here? Is there anything on the ARM10 that will tell me anything? Yes, there are the read and write lights. While writing, both lights come on, but a read only lights the read light. After the FA thing, it just reads! Back to grovelling over the listing some more.

Ok, what they do is: fill memory from the bottom to the top, check and complement using the shifting LSB, and then check from the bottom to the top. Another new test called THEIRS, which of course doesn’t catch the problem either. I am running out of hair to pull out here!

As I write this, both machines are running DDMMD against the others memory, happily, no errors. No happy ending… YET!

A Journey Into the Ether: Debugging Star Microcode

Back in January I unleashed my latest emulation project Darkstar upon the world. At that time I knew it still had a few areas that needed more refinement, and a few areas that were very rough around the edges. The Star’s Ethernet controller fell into that latter category: No detailed documentation for the Ethernet controller has been unearthed, so my emulated version of it was based on a reading of the schematics and diagnostic microcode listings, along with a bit of guesswork.

Needless to say, it didn’t really work: The Ethernet controller could transmit packets just fine but it wasn’t very good at receiving them. I opted to release V1.0 of Darkstar despite this deficiency — while networking was an important part of Xerox’s computing legacy, there were still many interesting things that could be done with the emulator without it. I’d get the release out the door, take a short break, and then get back to debugging.

Turns out the break wasn’t exactly short — sometimes you get distracted by other shiny projects — but a couple of weeks back I finally got back to working on Darkstar and I started with an investigation of the Receiver end of the Ethernet interface — where were things going wrong?

The first thing I needed to do was come up with some way to see what was actually being received by the Star, at the macrocode level. While I lack sources for the Interlisp-D Ethernet microcode, I could see it running in Darkstar’s debugger, and it seemed to be picking up incoming packets, reading in the words of data from these packets and then finally shuffling them off to the main memory. From this point things got very opaque — what was the software (in this case the operating system itself) doing with that data, and why was it apparently not happy with it?

The trickiest part here was finding diagnostic software to run on the Star that could show me the raw Ethernet data being received, and after a long search through available Viewpoint, XDE, and Interlisp-D tools and finding nothing that met my needs I decided to write my own in Interlisp-D. The choice to use Interlisp-D was mainly due to the current lack of XDE compilers, but also because the Interlisp-D documentation covered exactly what I needed to accomplish, using the ETHERRECORDS library. I wrote some quick and dirty code to print out the contents of any packets coming in, and got… crickets. Nothing. NIL, as the Lisp folks say.

Hmm.

So I went back and watched the microcode read a packet in and while it was indeed pulling in data, upon closer inspection it was discarding the packet after the first few words. The microcode was checking that the packet’s Destination MAC address (which begins each Ethernet packet’s header) matched that of the Star’s MAC address and it was ascertaining that the packet in question wasn’t addressed to it. This is reasonable behavior, but the packets it was receiving from my test harness were all Broadcast packets, which use a destination address of ff:ff:ff:ff:ff:ff and which are, by definition, destined for all machines on the network — which is when I finally noticed that hey wait a minute… the words the microcode is reading in for the destination address aren’t all FF’s as they should be… and then I slapped my forehead when I saw what I had done:

Whoops.

I’d accidentally used the “PayloadData” field (which contains just the actual data in the packet) rather than the “Data” field (which contains the full packet including the Ethernet header). So the microcode was never seeing Ethernet headers at all, instead it was trying to interpret packet data as the header!

I fixed that and things were looking much, much better. I was able to configure TCP/IP on Interlisp-D and connect to a UNIX host and things were generally working, except when they weren’t. On rare occasions the Star would drop a single word (two bytes) from an incoming packet with no fanfare or errors:

The case of the missing words. Note the occasional loss of two characters in the above directory listing.

This was puzzling to say the least. After some investigation it became clear that the lost word was randomly positioned within the packet; it wasn’t lost at the beginning or end of the packet due to an off-by-one error or something not getting reset between packets. Further investigation indicated that without fail, the microcode was reading in each word from the packet via the ←EIData function (which reads the next incoming word from the Ethernet controller and puts it on the Central Processor’s X Bus). On the surface it looked like the microcode was reading each word in properly… but then why was one random word getting lost?

It was time to take a good close look at the microcode. I lack source code for the Interlisp-D Ethernet microcode but my hunch was that it would be pretty similar to that used in Pilot since no one in their right mind rewrites microcode unless they absolutely have to. I have some snippets of Pilot microcode, fortunately, and as luck would have it the important portions of it matched up with what Interlisp was using, notably the below loop:

{main input loop}
EInLoop: MAR ← E ← [rhE, E + 1], EtherDisp, BRANCH[$,EITooLong], c1;
MDR ← EIData, DISP4[ERead, 0C], c2;
ERead: EE ← EE - 1, ZeroBr, GOTO[EInLoop], c3, at[0C,10,ERead];
E ← uESize, GOTO[EReadEnd], c3, at[0D,10,ERead];
E ← EIData, uETemp2 ← EE, GOTO[ERCross], c3, at[0E,10,ERead];
E ← EIData, uETemp2 ← EE, L6←L6.ERCrossEnd, GOTO[ERCross], c3, at[0F,10,ERead];

The code starting with the label EInLoop (helpfully labeled “main input loop”) loads the Memory Address Register (MAR) with the address where the next word from the Ethernet packet will be stored; and the following line invokes ←EIData to read the word in and write it to memory via the Memory Data Register (MDR). The next instruction then decrements a word counter in a register named EE and loops back to EInLoop (“GOTO[EInLoop]”). (If this word counter underflows then the packet is too large for the microcode to handle and is abandoned.)

An important diversion is in order to discuss how branches work in Star microcode. By default, each microinstruction has an INIA (InitialNext Instruction Address) field that tells the processor where to find the next instruction to be executed. Microinstructions need not be ordered sequentially in memory, and in fact, generally are not (this makes looking at a raw dump of microcode highly entertaining). At the end of every instruction, the processor looks at the INIA field and jumps to that address.

To enable conditional jumps, a microinstruction can specify one of several types of Branches or Dispatches. These cause the processor to modify the INIA of the next instruction by OR’ing in one or more bits based on a condition or status present during the current instruction. (This is then referred to as NIA, for Next Instruction Address). For example, the aforementioned word counter underflow is checked by the line:

ERead:    EE ← EE - 1, ZeroBr, GOTO[EInLoop], c3, at[0C,10,ERead];

The EE register is decremented by 1 and the ZeroBr field specifies a branch if the result of that operation was zero. If that was the case, then the INIA of the next instruction (at EInLoop) is modified — ZeroBr will OR a “1” into it.

EInLoop:    MAR ← E ← [rhE, E + 1], EtherDisp, BRANCH[$,EITooLong],    c1;

This branch is denoted by the BRANCH[$,EITooLong] assembler macro which denotes the two possible destinations of the branch. The dollar sign ($) indicates that in the case of no branch, the next sequential instruction should be executed, and that that instruction needs no special address. In the case of a branch (indicating underflow) the processor will jump to EITooLong instead.

Clear as mud? Good! So how does this loop exit under normal conditions? In the microcode instruction at EInLoop there is the clause EtherDisp. This causes a microcode dispatch — a multi-way jump — based on two status bits from the Ethernet controller. The least-significant bit in this status is the Attn bit, used to indicate that the Ethernet controller has something to report: A completed packet, a hardware error, etc. The other bit is always zero if the Ethernet controller is installed. (If it’s not present, the bit is always 1).

Just like a conditional branch, a dispatch modifies the INIA of the next instruction by ORing those status bits in to form the final NIA. The instruction following EInLoop is:

MDR ← EIData, DISP4[ERead, 0C],    c2;

The important part to us right now is the DISP4 assembler macro: this sets up a dispatch table starting with the label ERead which it places at address 0x0C (binary: 1100). Note how the lower two bits in this address are clear, to allow branches and dispatches to OR modified bits in. In the case where EtherDisp specfies no special conditions (all bits zero) the INIA of this instruction is unmodified and left as 0x0C and the loop continues. In the case of a normal packet completion, EtherDisp will indicate that the Attn bit is set, ORing in 1, resulting in an NIA of 0x0D (binary: 1101).

This all looked pretty straightforward and I didn’t see any obvious way a single word could get lost here, so I looked at the other ways this loop could be exited — how do we get to the instruction at 0x0E (binary: 1110) from the dispatch caused by EtherDisp? At first this left me scratching my head — as mentioned earlier, the second bit masked in by EtherDisp is always zero! The clue is in what the instruction at 0x0E does: it jumps to a Page Cross handler for the routine.

This of course requires another brief (not so brief?) diversion into Central Processor minutiae. The Star’s Central Processor contains a simple mechanism for providing virtual memory via a Page Map, which maps virtual addresses to physical addresses. Each page is 256 words in size, and the CP has special safeguards in place to trap memory accesses that might cross a page boundary both to prevent illegal memory accesses and so the map can be maintained. In particular, any microinstruction that loads MAR via an ALU operation that causes a carry out of the low 8 bits (i.e. calculating an address that crosses a 256-word boundary) results in any memory access in the following instruction being aborted and a PageCross branch being taken. This allows the microcode to deal with Page Map-related activities (update access bits or cause a page fault, for example) before resuming the aborted memory access.

Whew. So, in the case of to the code in question:

{main input loop}
EInLoop: MAR ← E ← [rhE, E + 1], EtherDisp, BRANCH[$,EITooLong], c1;
MDR ← EIData, DISP4[ERead, 0C], c2;
ERead: EE ← EE - 1, ZeroBr, GOTO[EInLoop], c3, at[0C,10,ERead];
E ← uESize, GOTO[EReadEnd], c3, at[0D,10,ERead];
E ← EIData, uETemp2 ← EE, GOTO[ERCross], c3, at[0E,10,ERead];
E ← EIData, uETemp2 ← EE, L6←L6.ERCrossEnd, GOTO[ERCross], c3, at[0F,10,ERead];

Imagine (if you will) that register E (the Ethernet controller microcode gets two whole CPU registers of its very own and their names are E and EE) contains 0xFF (255) and the processor is running the instruction at EInLoop.  The ALU adds 1 to it, resulting in 0x100 — this is a carry out from the low 8-bits and so a PageCross branch is forced during the next instruction.  A PageCross branch will OR a “2” into the INIA of the next instruction.

The next instruction attempts to store the next word from the Ethernet’s input FIFO into memory via the MDR←EIData operation but this store is aborted due to the Page Cross caused during the last instruction.  And at last, a 2 is ORed into INIA, causing a dispatch to 0x0E (binary: 1110).  So in answer to our (now much earlier) question:  The routine at 0x0E is invoked when a Page Cross occurs while reading in an Ethernet packet.  (How the code gets to the routine at 0x0F is left as an exercise to the reader.)

And as it turns out, it’s the instruction at 0x0E that’s triggering the bug in my emulated Ethernet controller. 

E ← EIData, uETemp2 ← EE, GOTO[ERCross],    c3, at[0E,10,ERead];

Note the E←EIData operation being invoked — it’s reading in the word from the Ethernet controller for a second time during this turn through the loop, and remember that the first time it did this, it threw the result away since the MDR<- operation was canceled.  This second read is done with the intent to store the abandoned word away (in register E) until the Map operation is completed.

So what’s the issue here?  On the real hardware, those two ←EIData operations return the same data word rather than reading the next word from the input packet.  This is in fact one of the more clearly spelled-out details in the Ethernet schematic — it even explains why it’s happening! — one that I completely, entirely missed when writing the emulation:

Seems pretty clear to me…

Microinstructions in the Star’s Central Processor are grouped into clicks of three instructions each; a click’s worth of instructions execute atomically — they cannot be interrupted.  Each instruction in a click executes in a single cycle, referred to as Cycle 1, Cycle 2, and Cycle 3 (or c1, c2, and c3 for short).  You can see these cycles notated in the microcode snippet above.  Some microcode functions behave differently depending on what cycle they fall on.  ←EIData only loads in the next word from the Ethernet FIFO when executed during a c2; an ←EIData during c1 or c3 returns the last word loaded.  I had missed this detail, and as a result, my emulation caused any invocation of ←EIData to pull the next word from the FIFO.  As demonstrated above this nearly works, but causes a single word to be lost when a packet read crosses a page boundary.

I fixed the ←EIData issue in Darkstar and at long last, Ethernet is working properly.  I was even able to connect to one of the machines here at the museum:

The release on Github has been updated; grab a copy and let me know how it works for you!

If you’re interested in learning more about how the Star works at the microcode level, the Hardware Reference and Microcode Reference are a good starting point. Or drop me a line!

Letting the cat out of the bag!

Back in September 2018, we got a new addition to the LCM+L computer collection, but we didn’t talk about it. This was something that we had been looking for for 10 years or more. We knew where a couple of these machines were, but were not able to convince the owner of that collection to part with one. We kept looking, but the ones this collector had were the only ones we could find. Well, that isn’t quite true, Stephen Jones and I went down to the San Francisco Bay area to visit the Computer History Museum, and photograph one they had in their warehouse, but they weren’t willing to part with it. We thought maybe we could build one.

Stephen went to Australia on a hunt, and came back with some bits of one of these, but not a whole machine.

Stephen finally got the owner of the big collection in Sweden on the phone, and got him, Peter Lothberg, to talk to us again about his collection in late spring last year. A bunch more negotiations happened. Paul Allen was brought in, and a contract was signed.

A couple of our Archivists, Cynde Moya and Ameila Roberts, along with Stephen Jones and Jeff Kaylin, from Engineering, went to Stockholm in August to go over the whole collection, and pack up everything Peter was willing to sell us, put it in containers, and get them on a ship.

In September, the containers came in, and around 8 one morning, the first container arrived at our loading dock. Stephen sent Paul Allen a picture of the open container, and he was here by 8:30. He was very excited, because he had been waiting for one of these for over 10 years. He asked Stephen “Can you have it working in 3 weeks?” Stephen said something like “How about by the end of the year?”

Meet KATIA:

KATIA: a PDP10-KA, Serial Number 175.

The first thing we had to do to KATIA to get it running was to upgrade the power supplies. For a lot of our older mainframes, we have decided to leave the old supplies in place, and just move the wiring over to new efficient and reliable switching power supplies. David Cameron and I spent around a week figuring out how to do this, this time around, and I think it looks pretty good!

KATIA Bay 1 power supplies (sideways).

The silver bits, are the new supplies and their mounting plates, mounted either in front of, or next to the original power supply filter capacitors.

We also had to upgrade the power supplies in the memory, where we have designed new modules to replace the old modules, but the parts were scarce, so we just replaced all the old aluminum electrolytic capacitors in the existing modules.

We re-configured the main blowers to run on lazy American electrons (120V), as opposed to those very energetic (240V) European electrons that they were set up for.

We powered up the machine, and I started into the debug process. I started by adjusting all those new power supplies so that all the correct voltages appeared at the correct places, and they all shared the load as best I could.

Now, what can it do? One of the first thing we will need is to be able to read diagnostics on the paper tape reader. With the KI, I had discovered this process requires the adder part of the CPU to work, and so I fired up the Way-Back machine to recover how I had tested the KI back then. It only took two instructions, and ADDI (add immediate) and a JRST (jump). But wait, they have to be stored somewhere! I had to fix some memory.

In PDP10’s there are two kinds of memory, Accumulators (the first 16 locations of memory), and main memory. The Accumulators in most PDP10’s live in special logic inside the CPU, called “Fast Accumulators”. I had to steal a board from one of the other 2 KA’s that came with KATIA, in order to get the Fast AC”s to work.

I then started poking at the MG10 128K word memory box we had hooked up to KATIA. I got 32K of it working, and that was enough to run the paper tape diagnostics. Had to replace the light bulb in the card reader.

Now to check the adder: I loaded the two instructions with the front panel, and no, the adder wasn’t working. It couldn’t propagate a carry past bit 20. After a bunch of poking around I found that there was some kind of a connection problem between cards 2A29 and 2A30, which are in chassis 2 (the middle one), on the top row, about 3/4 of he way from the left. Swapping them and swapping them back seems to fix it.

After about another 4 weeks, with plenty of tearing of hair and gnashing of teeth, lots of board swapping and transistor changing, I finally got KATIA to run the KA versions of diagnostics that we run from paper tape on the KI! From that point the tapes get too big for the paper tape reader to handle easily, so on the KI we run those from the DECtape drives.

It took another couple of weeks to get the DECTape drives and controllers working on both the KA and KI. We needed the KI to work so that we could write the proper diagnostics onto a DECtape for KATIA to read. The problem with that, is that the KI hasn’t been able to boot in a few months. It passes all the paper tape diagnostics, but there is something fishy with the disk interfaces. Bother!

I fixed the rest of the MG10 so we had 128K words of memory working.

While I am working on the KI on a Thursday afternoon, it gets decided we should get KATIA down from the 3rd floor, and on display in 2nd floor computer room before opening on Friday! This involves detaching the DECtape cabinet from one end, the memory from the other end, and disconnecting a whole bunch of cables that go between chassis 3 and the other two chassis’. We also have to make space by moving the IBM 360/30 out of where we want to put KATIA.

We start by disconnecting all the required stuff Thursday afternoon, and getting the 360/30 put back together enough to move, and we move everyting.

By Friday all KATIA’s boxes have power again, and I carefully put all the cables back where they came from the day before. Half of the Museum shows up for powering up KATIA in its (her?) new home. I turn it on, and poke the Read-In button to load up my simple memory diagnostic, and… nothing happens!

Has the adder stopped working again? Well, yes. I wiggle the cards, and voila, KATIA can add again. Yay! I poke the Read-In button again and… not nothing! The tape goes through the reader, but the lights keep incrementing, even after the end of the tape has fallen out of the reader. THAT is not Correct!

Finally on Friday morning two WEEKS later, I have grovelled over the gospel according to DEC, and abased myself before the computer gods to almost understand how Read-In is supposed to work. Hmm, is that a string to pull on?

Read-In works by loading a BLKI instruction into the instruction register of the machine and executing it. PDP10’s are NOT Reduced Instruction Set Computers! The BLKI instruction reads a memory location, a pointer, adds 1 to each half of the data there, puts it back in memory, then does the I/O instruction implied by the “I”, uses the right half of that memory location to point at where in memory to put the data it read from the paper tape reader. Then it looks at the left half of the pointer, and if it is zero, it skips the next instruction, if it wasn’t, it will execute the next instruction.

As I was watching it try to do this with the oscilloscope, it didn’t appear that the part where it was reading the pointer location was taking as long as it should have, considering that pointer was located in core memory at the time, and reading and writing it should have taken a whole micro-second!

The I/O instruction was decoded over in chassis 3, but the instruction timing and memory control was over in chassis 1. Eventually I noticed that the part of the instruction it was skipping was the memory access, which was supposed to be started by a signal called “IOT BLK”, which wasn’t making it to chassis 1. I could see it in chassis 3! The chassis 3 end of the cable had been loose to move the machine, let’s take a look!

Cable 3E06. Note broken resistor near the bottom middle of the card.

I replaced the broken resistor, but it still didn’t work! OK let’s look farther up the cable, did anything else happen in the move?

3E06 to 1L44 cable.

That doesn’t look right! I suspect that somehow, in the move, this cable escaped momentarily enough to get caught under a caster and rubbed away, because it used to work, and the signal in question is one of the top two signals in the cable. The consequences of doing things in a hurry.

Do I pull this cable from one of the other machines we have or do I try to repair it?

Cable repaired.

As soon as this was done, KATIA went back to behaving for me. I ran all the diagnostics we had run, and a few more that we ran from DECtape on the KI, till I got to the one that also failed on the KI. We found the locations in this one that we had to patch for the KI, and it now runs too!

Did I get it done in 3 weeks? Unfortunately no, so Paul Allen didn’t get to see it run, and play with it. He must have known back when he asked about 3 weeks, because it was very close to three weeks from then that he left us.

KATIA: For you Paul! Sorry I didn’t get it working in time.

Bruce Sherry

Bendix G-15 – Solder Degradation

In the process of restoring the Bendix G-15, we have discovered a phenomena that degrades the electrical connections which provide bias and signal flow, rendering the computer non-functional.

Failed Connections

Below is a group of photos which illuminate this failure mode called “electromigration”. This process is caused by a continuous DC potential applied to a metal junction. Metal ions migrate in the direction of current flow. For most new machines, this is not a problem as this process takes quite a number of years to progress to the point where the electrical connection is broken. At LCM+L, we get machines after they had run for a long time. Worse yet, since it is our intent to restore and run the machines for as long as we can, it is necessary to find a solution that allows that maintenance need only be done every decade or so.

A failed connection ( as verified with an ohmmeter ). Please note the circular crack running around and just above the base of the circular conductor.
A similar failed connection.
This one hadn’t quite failed. You can see just a small connection at around 260 degrees. This connection will fail in a fairly short period of time.

Failing Solder

This same phenomena plays out in the metal structure of the solder itself. The photos below show the before and after of solder restoration. In the first photo, the solder looks dull and mottled. This is due to the tin having migrated out leaving only lead in the Tin/Lead solder formulations used until the early 2000’s. The modern formulations are Tin/Silver/Copper and are much less likely to have metal ion migration.

The large resistors at the top show the effects of tin migration.
After removing the old solder and replacing it with a modern formulation, you can see the solder is smooth and bright, indicating good integrity.

Long, Repetitive Work

This restoration process took quite a while. After determining that all the tube modules in the machine were affected in this way, we simply set about removing a module and then removing and replacing the solder in all the high, continuous current sections.

An interesting article on solder, covering some of the topics mentioned in the article can be found at: https://en.wikipedia.org/wiki/Solder

Xerox ALTO – Interesting Issue

In the process of restoring the Xerox ALTO, an interesting issue came up.

Background

We received our first ALTO in running condition and after evaluation and testing, put in on the exhibit floor available to the public. One afternoon about a year later, the machine suddenly froze and stopped functioning. It was taken off the floor and evaluated in one of our labs. When it became clear that power supply current was not flowing into random parts of the backplane, the focus shifted to the power supply rails. It was there we were confronted with this phenomena.

The ALTO Was A Prototype

Certain production and test details were left out of the ALTO. The amount of current running through individual pins supplying regulated DC to the logic is unusually high. Most of the time, power supply current is fed through as many pins as possible to reduce the total current running through any one pin. Because the ALTO was a prototype, the designers only used the minimum number of pins to do the job. This resulted in a phenomenon called “electro-migration”. It is the reverse of the process used for electroplating. In this instance, tin ions migrate away from the solder joints carrying the power supply current. The six pins in the center of the first photo show a mottled (instead of smooth) surface, and one of the pins has a dark ring around it indicating where the solder has totally migrated away from the connection (lower right pin). The second photo show six pins where the tin has migrated away.

Example of electro-migration on Xerox ALTO power supply bus.
Another example. Here all six connections in the middle of this photo are compromised.

Confronted with the preventing this in the future, LCM Engineering increased the surface area of the connections by soldering brass buss rails to all of the ALTO backplane power supply pins. This is shown in the photo below:

Brass buss rails soldered to ALTO backplane to increase power supply current capacity. The buss rails are the vertical elements running through the backplane.

Once this fix was applied, the ALTO was put back in service, and this phenomenon has not repeated itself.

The ALTO will be monitored to see if this phenomenon shows up again. This chapter has also been instructive for some of the other machines we are restoring. In these instances, it is extreme age, rather than something done for a prototype as the causative factor.

Introducing Darkstar: A Xerox Star Emulator

Star History and Development

The Xerox 810 Information System (“Star”)

In 1981, Xerox released the Xerox 8010 Information System (codenamed “Dandelion” during development) and commonly referred to as the Star. The Star took what Xerox learned from the research and experimentation done with the Alto at Xerox PARC and attempted to build a commercial product from it.  It was envisioned as center point of the office of the future, combining high-resolution graphics with the now-familiar mouse, Ethernet networking for sharing and collaborating, and Xerox’s Laser Printer technology for faithful “WYSIWYG” document reproduction.  The Star’s operating system (called “Star” at the outset, though later renamed “Viewpoint”) introduced the Desktop Metaphor to the world.  In combination with the Star’s unique keyboard it provided a flexible, intuitive environment for creating and collaborating on documents and mail in a networked office environment.

The Star’s Keyboard

Xerox later sold the Star hardware as the “Xerox 1108 Scientific Information Processor” – In this form it competed with Lisp workstations from Symbolics, LMI, and Texas Instruments in the burgeoning AI workstation market and while it wasn’t quite as powerful as any of their offerings it was considerably more affordable – and sometimes much smaller.  (The Symbolics 3600 workstation, c. 1983 was the size of a refrigerator and cost over $100,000).

The Star never sold well – it was expensive ($16,500 for a single workstation and most offices would need far more than just one) and despite being flexible and powerful, it was also quite slow. Unlike the IBM PC, which also made its debut in 1981 and would eventually sell millions, Xerox ended up selling somewhere in the neighborhood of 25,000 systems, making the task of finding a working Star a challenge these days.

Given its history and relationship to the Alto, the Star seemed appropriate for my next emulation project. (You can find the Alto emulator, ContrAlto, here). As with the Alto a substantial amount of detailed hardware documentation had been preserved and archived, making it possible to learn about the machine’s inner workings… except in a few rather important places:


From the March 1982 edition of the Dandelion Hardware Manual.  Still waiting for these sections to be written…

Fortunately, Al Kossow at Bitsavers was able to provide extra documentation that filled in most of the holes.  Cross-referencing all of this with the available schematics, it looked like there was enough information to make the project possible.

The Dandelion Hardware

The Star’s Central Processor (CP). Note the ALU (4xAM2901, top middle) and 4KW microcode store (bottom)

Much like the Alto, the Dandelion’s Central Processor (referred to as the “CP”) is microcoded, and, again like the Alto, this microcode is responsible for controlling various peripherals, including the display, Ethernet, and hard drive.  The CP is also responsible for executing bytecode macroinstructions.  These macroinstructions are what the Star’s user programs and operating systems are actually compiled to.  The CP is sometimes referred to as the “Mesa” processor because it was designed to efficiently execute Mesa bytecodes, but it was in no way limited to implementing just the Mesa instruction set: The Interlisp-D and Smalltalk systems defined their own microcode for executing their own bytecodes, custom-tailored and optimized to their environments.

Mesa was a strongly-typed “high-level language.” (Xerox hackers loved their puns…) It originated on the Alto but quickly grew too large for it (a smaller, stripped-down Mesa called “Butte” (i.e. “a small Mesa”) existed for the Alto but was still fairly unwieldy.)  The Star’s primary operating system was written in Mesa, which allowed a set of very sophisticated tools to be developed in a relatively short period of time.

The Star architecture offloaded the control of lower-speed devices (the keyboard and mouse, serial ports, and the floppy drive) to an 8-bit Intel 8085-based I/O processor board, referred to as the IOP.  The IOP is responsible for booting the system: it runs basic diagnostics, loads microcode into the Central Processor and starts it running.  Once the CP is running, it takes over and works in tandem with the IOP.

Emulator Development

The Star’s I/O Processor (IOP). Intel 8085 is center-right.

Since the IOP brings the whole system up, it seemed that the IOP was the logical place to begin implementing the emulator.  I started with an emulation of the 8085 processor and hooked up the IOP ROMs and RAMs.  Since the first thing the IOP does at power up or reset is execute a vigorous set of self-tests, the IOP was, in effect, testing my work as I progressed which was extremely helpful.  This is one important lesson Xerox learned from the Alto and applied to the Star: on-board diagnostics are a good thing.  The Alto had no diagnostic facilities built in so if anything failed that prevented the system from running the only way to determine the fault was to get out the oscilloscope and the schematics and start probing.  On the Star, diagnostics and status are reported through a 4-digit LED display, the “Maintenance Panel” (or MP for short).  If the IOP finds a fault during testing, it presents a series of codes on this panel.  During a normal system boot, various codes are displayed to indicate progress.  The MP was the first I/O device I emulated on the IOP, for obvious reasons.

Development on the IOP progressed nicely for several weeks (and the codes reported in the emulated MP kept increasing, reflecting my progress in a quantitative way) and during this time I implemented a source-level debugger for the IOP’s 8085 code to help me along.  This was invaluable in working out what the IOP was trying to do and why it was failing to do so.  It allowed me to step through the original code, place breakpoints, and investigate the contents of the IOP’s registers and memory while the emulated system was running.

The IOP Debugger

Once the IOP self-tests were passing, the IOP emulation was running to the point where it attempted to actually boot the Central Processor!  This meant I had to shift gears and switch over to implementing an emulation of the CP and make it talk to the IOP. This is where the real fun began.

For the next couple of months I hunkered down and implemented a rough emulation of the CP, starting with system’s 16-bit ALU (implemented with four 4-bit AM2901 ALU chips chained together).  The 2901 (see top portion of the following diagram) forms the nexus of the processor; in addition to providing the processor’s 16 registers and basic arithmetic and logical operations, it is the primary data path between the “X bus” and “Y bus.”  The X Bus provides inputs to the ALU from various sources: I/O devices, main memory, a handful of special-purpose register files and the Mesa stack and bytecode buffer.  The ALU’s output connects to the Y bus, providing inputs back into these same components.

The Star Central Processor Data Paths

One of the major issues I was confronted with nearly immediately when writing the CP emulation was one of fidelity: how faithful to the hardware does this emulation need to be? This issue arose specifically because of two hardware details related to the ALU and its inputs:

  1. The AM2901 ALU has a set of flags that get raised based on the result of an ALU operation (for example, the “Carry” flag gets raised if the result of an operation causes a carry out from the most-significant bits). For arithmetic operations these flags make sense but the 2901 also sets these flags as the result of logical operations. The meaning of the flags in these cases is opaque and of no real use to programmers (what does it mean for a “carry” flag to be set as a result of a logical OR?) and exist only as a side-effect of the ALU’s internal logic. But they are documented in the spec sheet (see the picture below).
  2. With a 137ns clock cycle time, the CP pushes the underlying hardware to its limits. As a result, some combinations of input sources requested by a microinstruction will not produce valid results because the data simply cannot all make it to its destination on time. Some combinations will produce garbage in all bits, but some will be correct only in the lower nibble or byte of the result, with the upper bits being undefined. (This is due to the ALU in the CP being comprised of four 4-bit ALUs chained together.)
Logic equations for the “NOT R XOR S” ALU operation’s flags. What it means is an exercise left to the reader.

I spent a good deal of time pondering and experimenting. For #1, I decided to implement my ALU emulation with the assumption that Xerox’s microcode would not make use of the condition flags for non-arithmetic operations, as I could see no reason to make use of them for logical ops and implementing the equations for all of them would be computationally expensive, making the emulation slower. This ended up being a valid assumption for all logical ops except for OR — as it turns out, some microcode assumed that the Carry flag would be set appropriately for this class of operation. When this issue was found, I added the appropriate operations to my ALU implementation.

For #2 I assumed that if Xerox’s microcode made use of any “invalid” combinations of input sources, that it wouldn’t depend on the garbage portion of the results. (That is, if code made use of microinstructions that would only produce valid results in the lower 4 or 8 bits, the microcode would also only depend on the lower 4 or 8 bits generated.) Thus the emulated ALU always produces a complete, correct result across all 16-bits regardless of input source. This assumption appears to have held — I have encountered no real-world microcode that makes assumptions about undefined results thus far.

The above compromises were made for reasons of implementation simplicity and efficiency. The downside is that it is possible to write microcode that will behave differently on the emulation than on the real hardware. However, going through the time, trouble, and expense of a 100% accurate emulation did not seem worth it when no real microcode would ever require this level of accuracy. Emulation is full of trade-offs like this. It would be great to provide an emulation that is perfect in every respect, but sometimes compromises must be made.

I implemented a debugger and disassembler for the CP similar to the one I put together when emulating the IOP.  Emulation of the various X bus-related registers and devices followed, and slowly but surely the CP started passing boot diagnostics as I fixed bugs and implemented missing hardware.  Finally it reached the point where it moved from the diagnostic stage to executing the first Mesa bytecodes of the operating system – the Star was now executing real code!  At that time it seemed appropriate to implement the Star’s display controller so I could see what the Star was trying to tell me – and a few days and much debugging of the central processor later I was greeted with this display from the install floppy (and there was much rejoicing):

The emulated Star says “Hello” for the very first time

Following this I spent two weeks of late nights hacking — implementing the hard disk controller and fixing bugs.  The Star’s hard drive controller doesn’t use an off-the-shelf controller chip as this wasn’t an option at the time the Star was being developed in the late 1970s. It’s a very clever, minimal design with most of the heavy lifting being done in microcode rather than hardware. Thus the emulation has to work at a very low level, simulating (in a sense) the rotation of the platters and providing data from the disk as it moves under the heads, one word at a time (and at just the right time.)

During this period I also got to learn how Xerox’s hard disk formatting and diagnostic tools worked.  This involved some reverse engineering:  Xerox didn’t want end-users to be able to do destructive things with their hard disks so these tools were password protected.  If you needed your drive reformatted you called a Xerox service engineer and they came out to take care of it (for a minor service charge).  These days, these service engineers are in short supply for some reason.

Luckily, the passcodes are stored in plaintext on the floppy disk so they were easy to unearth.  For future reference, the password is “wizard” or “elf” (if you’re so inclined):

Having solved The Mystery of the Missing Passwords I was at last able to format a virtual hard disk and install Viewpoint, and after waiting nervously for the installation to finish I was rewarded with:

Viewpoint, at long last!

Everything looked good, until the hard disk immediately corrupted itself and the system crashed!  It was very encouraging to see a real operating system running (or nearly so), and over the following weeks I hammered out the remaining issues and started on a design for a real user interface for the emulator. 

I gave it a name: Darkstar.  It starts with a “D” (thus falling in line with the rest of the “D-Machines” produced by Xerox) contains “Star” in the name, and is also a nerdy reference to a cult-classic sci-fi film.  Perfect. 

Getting Darkstar

Darkstar is available for download on our Github site and is open source under the BSD 2-Clause license.  It runs on Windows and on Unix systems using the Mono runtime.  It is still very much a work in progress.  Feedback, bug reports, and contributions are always welcome.

Fun with the Star

You’ve downloaded and installed Darkstar and have perused the documentation – now what?  Darkstar doesn’t come with any Xerox software, but pre-built hard disk images are available on Bitsavers (and for the more adventurous among you, piles of floppy disk images are available if you want to install something yourself).  Grab http://bitsavers.org/bits/Xerox/8010/8010_hd_images.zip — this contains hard disk images for Viewpoint 2.0, XDE 5.0, and The Harmony release of Interlisp-D. 

You’ll probably want to start with Viewpoint; it’s the crowning achievement of the Star and it invented the desktop metaphor, with icons representing documents and folders. 

To boot Viewpoint successfully you will need to set the emulated Star’s time and date appropriately – Xerox imposed a very strict licensing scheme (referred to as Product Factoring) typically with licenses that expired monthly.  Without a valid license code, Viewpoint grants users a 6-day grace period, after which all programs are deactivated. 

Since this is an emulation, we can control everything about the system so we can tell the emulated Star that it’s always just a few hours after the installation took place, bypassing the grace period expiration and allowing you to play with Viewpoint for as long as you like.  Set the date to Nov. 10, 1990 and start the system running.

Now wait.  The system is running diagnostics.

Keep waiting.  Viewpoint is loading.

Go get a coffee.

Seriously, it takes a while for Viewpoint to start up.  Xerox didn’t intend for users to reboot their Stars very often, apparently.  Once everything is loaded a graphic of a keyboard will start bouncing around the screen:

The Bouncing Keyboard

Press any key or click the mouse to get started and you will be presented with the Viewpoint Logon Option Sheet:

The Logon Option Sheet

You can login with user name “user” with password “password”.  Hit the “Next” key (mapped to “Home” on your computer’s keyboard) to move between fields, or use the mouse to click in them.  Click on the “Start” button to log in and in a few moments, there you are:

Initial Viewpoint Desktop

The world is your oyster.  Some things work as you expect – click on things to select them, double-click to open them.  Some things work a little differently – you can’t drag icons around with the mouse as you might expect: if you want to move them, use the “Move” key (F6) on your keyboard; if you want to copy them, use the “Copy” key (F4).  These two keys apply to most objects in the system: files, folders, graphical objects, you name it.  The Star made excellent use of the mouse, but it was also very keyboard-centric and employed a keyboard designed to work efficiently with the operating system and tools.  Documentation for the system is available online – check out the PDFs at http://bitsavers.org/pdf/xerox/viewpoint/VP_2.0/, as they’re worth a read to familiarize yourself with the system. 

If you want to write a new document, you can open up the “Blank Document” icon, click the “Edit” button and start writing your magnum opus:

Plagiarism?

One can change text and paragraph properties – font type, size, weight and all other sorts of groovy things by selecting text with the mouse (use the left mouse to select the starting point, and the right to define the end) and pressing the “Prop’s” key (F8):

Mad Props

If you’re an artist or just want to pretend that you are one, open up the “Blank Canvas” icon on the desktop:

MSPaint.exe’s Great Grandpappy

Need to do some quick calculations?  Check out the Calculator accessory:

I’m the Operator with my Pocket Calculator
Help!

There are of course many more things that can be done with Viewpoint, far too many to cover here.  Check out the extensive documentation as linked previously, and also look at the online training and help available from within Viewpoint itself (check the “Help” tab in the upper-right corner.)

Viewpoint is only one of a handful of systems that can be run on Darkstar. Stay tuned for future installments, covering XDE and Interlisp-D!

Bendix G-15 Vacuum Tubes

Early in the restoration and troubleshooting of the Bendix G-15 it was noted that tube filament failures occur with some regularity. It is not possible to observe working filaments on all the tube modules, as at least half the tubes have what is call a “getter coating” at the top of the tube, obscuring the filaments.
We hosted a subject matter expert to aid with troubleshooting the G-15, and he indicated that tube filament failures were the principal cause of machine downtime, usually about once per week. This invariably entailed up to a day of troubleshooting to find the offending tube(s).
Due to the above information and our own experience, it was decided to engineer a sensor and indicator system which would allow quick identification of the offending tube or tubes.
The configuration decided upon was a hall effect sensor coupled to a passive magnetic field concentrator ( wound ferrite core ) placed in the current path of each individual vacuum tube filament that would light an led when the tube filament was functional. Up to six sensors ( the largest complement of filaments in a tube module ) are packaged on a substrate which fits on each tube module and are powered by the filament voltage entering each module.

We gave the sensor package the acronym “FICUS”. It breaks down to FI = Filament, CU = Current, and S = Sensor.

Here is what a hall effect sensor mated to a wound ferrite core looks like:

Below is a photo of a FICUS module in the process of being assembled:

Four element FICUS Module

Here is a completed FICUS module:

Note the mating connector and wires ready to attach to the vacuum tube module.

Here is the FICUS after the wires have been attached to the vacuum tube module:

Here is the completed vacuum tube/FICUS module ready to be plugged into the Bendix G-15:

Here is the view of the Vacuum Tube/FICUS module oriented as it would be in the G-15:

And finally, a couple of Vacuum Tube/FICUS modules in an operating G-15:

CDC Cooling update

In my last blog on the CDC, we were having a slow refrigerant leak in Bay 1, and we were waiting for parts. The new parts came, but they were wrong, then they came again, and they were wrong again. Eventually the right parts were delivered, so we took the CDC down yesterday morning to work on the cooling system.

Cole, from Hermanson, arrived at 7AM, and we proceeded to shuffle some of the chiller plumbing around, so I can always be able to restart the computer after a power fail, and it took about an hour, and seems to work fine!

The old compressor in the basement, showing where the power goes in.

A little later, Jeff Walcker, also from Hermanson arrived to install the new compressor parts that had arrived. We opened the 3phase 60Hz breaker for Bay 1, and Jeff started un-hooking the power. We had discovered that it was leaking where the compressor power wires go into the case. As Jeff was taking it apart, all the little bits were turning into powder. The compressor is 52 years old. It was decided that we needed a new compressor. Bummer!

Contrary to recent experience, not only was a replacement compressor available, it was in stock, IN SEATTLE! Jeff was back from getting it almost before we were done deciding that was what we should do! Compared to 102 days for the chiller, this is amazing.

He did a bunch of wrestling with the compressors to get the new one in, and we don’t have the motor cooling loop hooked up yet, because the pipe didn’t want to go on the new one quite the same as it came off the old one.

New compressor installed.

You can see the disconnected copper cooling loop wrapped around the compressor in the above picture. Jeff will bring a slightly longer hose when he comes for our quarterly maintenance next month.

The CDC is happy again, and hopefully not leaking!

DEC Computer Power Supply Module Retrofit

In the process of troubleshooting our earliest machines, we had to replace large components called electrolytic capacitors. These are located in all the power supplies for any computer. We successfully replaced these devices and got the machines running. Recently though, we have started to see these devices fail once more. They have a finite life of a maximum of 14 years. That means that we have to replace these devices every 10 to 14 years. Also, the larger capacitors are no longer manufactured, but can still be special ordered. As it is our mission to have our computing hardware last for a lot longer than that, we did our research and engineered a replacement for the power supply modules these capacitors are found in. Our goal was to provide several decades of service without having to service these modules. The photos and descriptions below show the process:

Below is what the original power supply circuit board looked like

When we strip out the circuit board and remove the heat sink, we get this

We created, using a CAD program and a 3D printer, a plastic component mounting for the new components.

As you can see, the plastic mount fit perfectly into the old power module frame.

After populating the mount with all the components it looks like this.

Now we attach the modified heat sink to the original module frame.

Install the assembled component mount in the frame along with the modified heatsink and the new power module is complete.

One of the features of the module is, it has no solder connections, all of them being compression.  Wires are compressed into a square cross section using a stainless steel screw.  This provides very high reliability.

 

The upshot of all this work ( there were 38 modules in various machines ), is power supplies that are more efficient and have a rated MTBF ( mean time before failure ) of 40 years. These power supply modules draw 2/3 less power and produce 2/3 less heat, reducing the heat load on all the components in a machine. In addition, as a result of these changes, the total power savings per year is 250,000 kilowatt hours. Electricity rates in this area of Seattle are about 8 cents/kilowatt hour. That means a direct cost savings on our electric bill of $20,000 a year.

Life with a 51 year old CDC 6500

The CDC 6500 has led a rough life over the last 6 months or so: way back on the afternoon of July 2, 2018, I got an email from the CDC’s Power Control PLC telling me that it had to turn off the computer because the cooling water was too hot! A technician came out and found that the chiller was low on refrigerant. He brought it back up to the proper level, and went away. Next morning it was down again.

After much gnashing of teeth and tearing of hair, it was determined that the compressor was bad in the chiller. “We’ll have a new one in 5 weeks!” The new one turned out to be bad too, and so another was ordered, that was easier to get. Only about 3 weeks, instead of the 8 that the official one took. That worked for a few weeks, and the CDC went down again because the water was too hot.

This time it was very puzzling because as long as the technician was here, it worked fine. He spent most of a morning watching it, decided it was OK, and left, but he didn’t make it to the freeway before it went down again. He came back, and watched for the rest of the afternoon, and found that the main condenser fan would overheat, and shut down, causing the backup fan to come on. The load wasn’t very high, so the backup fan had to cycle on and off, while the main fan motor cooled off. This would go on for a while, till both motors were off at the same time, and then the compressor would go over pressure because the condenser fans were off, and the chiller would stop cooling, resulting in the “Water too HOT” computer shutdown.

Another week went by waiting for replacement fan motors from the chiller manufacturer, with no luck. Eventually we gave up and got new fan motors locally, installed them and the chiller has been working since. While the CDC didn’t seem to mind being off for 102 days for the compressor problem, it didn’t like being off for 3 weeks while we fiddled with the fans.

Both when it was off for 102 days, and this time, we found that Bay 1 was low on refrigerant. The first time we just filled it up, but the second time we looked closer, and found that there is a small leak where the power wires go into the Bay 1 compressor. The compressor manufacturer, the same guys that made the chillers compressor, will gladly sell us a new compressor, but the parts for the 50 year old, R12 compressor, are no longer available. We are working on that, but I haven’t heard that we found the parts yet.

Back to more recent times: now that the chiller is chillin’, and the CDC’s cooling system is coolin’, why isn’t the computer computing?

Let’s run some diagnostics and see what happens: I try to run my CDC diagnostic tape, but the machine complains that Central Memory doesn’t seem to be available. No, I didn’t run the real tape drives, I ran the imaginary one that uses a disk file on a PC to pretend to be  a tape drive. Anyway, that didn’t work, so I flip the zillion or so Dead Start switches in my emulated Dead Start panel, to fire up my Central processor based memory test, and get no display at all! This is distinctly unusual. Let’s try my PP based Central Memory test: That seems to work till it finishes writing all of memory, then the display goes blank. Is there a pattern here?

I put a scope probe on the memory request line inside the memory controller in Chassis 3, and find that someone is requesting memory every possible cycle. There are four possible requests: the Peripheral Processors as a group get a request line, each Central Processor gets a request line, and the missing Extended Core Memory gets a request. Let’s find out who it is: the PP’s aren’t doing it, neither of the CP’s are doing it, and the non-existent ECM isn’t doing it. Huh? Nobody is wants it, but ALL of it is getting requested!

I am going to step back a little bit, and try to explain why it sometimes takes me a while to fix this beast. This machine was designed before there were any standards about logic diagrams. Every manufacturer had to come up with their own scheme for schematics. Here is one where I found a problem, but we will get to that in a bit.

Now when there are two squares with one above the other, and arrows from each going to the other, those are flip-flops. When you have a square, or a circle with multiple arrows going into it, that is a gate. Which one is an “or” gate, and which one is an “and” gate? Sorry, you have to figure out that for your self, because the CDC documentation says either one can be either one. The triangle with a number in it would be a test point on the edge of the module. The two overlapping circles, kind of like an elipsis, indicate that is a coax cable receiver, as opposed to a regular twisted pair signal. A “P” followed by a number indicates a pin of the module.

This module receives the PP read and write signals from the PP’s in chassis 1, on pins P19 and P24. On the right side of the diagram, you can see where all the pins connect. If we look at pin 24, we can see it connects to W07 wire 904, and pin 19 is connected to W07-903. The W “Jack”s are coax cables, the other letter signals go somewhere inside this chassis.

Really, what we are looking at here, is that a circle, or a square is the collector pull-up resistor of one or more silicon NPN transistors. the arrow heads are the base of the transistors, and the line coming into the head has a base resistor in it. If there are three arrows coming into a square, like at the bottom, those three 2n2369 transistors all have their collectors tied together, with one pull-up resistor. I could be slow, because it took about 6 months before I felt I was at all fluent in reading these logic diagrams.

Now we have to talk about the Central Memory Architecture a bit. The CDC has 32 banks of 16K words of memory. Each of these banks is separate, and they can be interleaved any way the 4 requestors ask for them. At the moment, I am only running half of them, because there is something wrong with the other half. Each of these banks does a complete cycle in 1uS. The memory controller in chassis 3 can put out an address every 100nS, along with whether it is for a read or a write. This request goes to all banks in parallel. If the bank that it points to is available, he will send back an “accept” pulse, and that requestor is free to ask for something else. If the controller doesn’t get an “accept” he will ask again in about 400nS. There is a bunch of modules involved in this dance, and it is a big circle.

A little more background: This machine was designed before there was such a thing as plated through holes on printed circuit boards. The two boards in each module were double sided. What they did when they needed to get to the other side of a PCB, was they would put a tiny brass rivet in the via hole, and solder both sides.

What I eventually found was that the signal from P23 of the module in 3L34, wasn’t making it to pin 15! There was a via rivet that wasn’t making its connection to the other side of the board. I re-soldered all the vias on that module, and now we were only requesting memory when someone wanted it!

Now that we can request memory and have a reasonable chance of it responding correctly, it is on to testing memory. I loaded up my CP based test, and it ran… for a while. Then it quit, with  a very strange error. The test uses a single bit, and its complement to check the existence of every location of memory. It will read a location, and compare it with what should be there, and put the difference in a second register. Normally I would expect a single bit error, or maybe 12 bits if a module failed that way. The result looked like 59 bad bits, or the error being exactly the same as what it read. Usually this is because the CPU that is running the test is miss-executing the compare instruction.

While I was thinking about that, I ran Exchange Jump Test to see what that said. A PP can cause a CP to swap all its registers, including the Program Counter with the contents of some memory that the PP points to. This is called an Exchange Jump. The whole process happens in about 2.6uS as it requests 16 banks of memory in a sequence. This works the memory pretty hard. Exchange Jump Test (EJT) would fail after a while, and as I looked at the results, I noticed that it was usually failing a certain bit in bank 7. I checked, and noticed it was an original memory module, so I looked at my bench and found I didn’t have any new ones assembled, so I had to put the sides on a couple of finished PCB assemblies, and test them. I then swapped out the old memory in bank 7 with a new semiconductor memory, and EJT passed!

I then checked to see if my CP based memory test worked, and it did too. We are back in Business after over 5 months. I am keeping my fingers crossed in the hope that the chiller stays alive for a while.

Bruce Sherry