Taming an Inclinometer

A project we are working on requires a MEMs inclinometer to assure coplanarity between a flat surface and a sensor.

The Problem

The sensor has a high gain section before an analog-to-digital converter feeds data to the microprocessor. Unless the sensor is located in a Faraday cage, the 120 hz hum generated causes severe instability in the digital data. It was determined that the center-line of this signal is the stable signal output.

Showing Sensor Level

At the extreme high and low outputs, the AC signal is compressed, but the center-line signal is still correct.

Sensor Tilted Back
Sensor Tilted Forward

The code takes 20 asynchronous X and Y axis readings, finds the highest and and lowest, determines their difference, and divides by two to get the current X and Y inclination.

This code snippet just covers the search for highest and lowest value. We have left off the hardware code as it is different for each platform.

    tiltDataXhiAccum = tiltDataX[0];
    tiltDataYhiAccum = tiltDataY[0];
    tiltDataXloAccum = tiltDataX[0];
    tiltDataYloAccum = tiltDataY[0];

  //this one finds the highest value
   for ( j = 1; j < 20; j++ )
  {
       if ( tiltDataXhiAccum < tiltDataX[j] )
     {
            tiltDataXhiAccum = tiltDataX[j];
     }
       if ( tiltDataYhiAccum < tiltDataY[j] )
     {
            tiltDataYhiAccum = tiltDataY[j];
     }
  }

    //this one finds the lowest value
   for ( j = 1; j < 20; j++ )
  {
       if ( tiltDataXloAccum > tiltDataX[j] )
     {
            tiltDataXloAccum = tiltDataX[j];
     }
       if ( tiltDataYloAccum > tiltDataY[j] )
     {
            tiltDataYloAccum = tiltDataY[j];
     }
  }

  tiltDataXAccum = ( tiltDataXhiAccum - tiltDataXloAccum ) / 2;
  tiltDataYAccum = ( tiltDataYhiAccum - tiltDataYloAccum ) / 2;

This solution allowed us to go from a realtime bobble rate of over 5 per second with differences of 10 or more, to a stable reading that might change every few seconds with a worst case difference of 1.

At Home With Josh Part 8: Lisp System Installation

In our last installment our intrepid adventurer had gotten the Interphase 2181 SMD controller running again, and had used it to do a low-level format of a gigantic 160mb Fujitsu hard drive. This left him with all the ingredients needed to put together a running LMI Lambda system, at least in theory.

Tapes and Tape Drives

I had intended to wait until I’d found the proper 9-track tape drive for the system before attempting to go through the installation process. As you might recall, the Qualstar drive I have on the system is functional but extremely slow; it takes several minutes to find and load tiny diagnostic programs from tape. As system installation requires copying 20-30 megabytes from tape (i.e. a lot of data), it seemed to me that doing an installation from the Qualstar would simply take too long to be practical.

But on the other hand, the drive was functional and it occurred to me that possibly it was just the SDU “tar” utility’s simplicity that might be causing the extremely slow transfer rate: if it was overly conservative in its reads from tape, on an unbuffered drive like the Qualstar it might end up being very inefficient. Maybe the “load” tool would be a bit more intelligent in its tape handling. Or perhaps not — but there’s no harm in trying, right? And while I’d tracked down a proper Cipher F880 tape drive, it would require waiting until the current quarantine was lifted to go and pick it up. I demanded instant gratification so off I went. Except…

Shiny new 9-track tapes! Ok, they’re still twenty years old. That’s pretty new!

The other pressing issue was one of tapes. I have a small pile of blank (or otherwise unimportant) 9-track tapes here at home but all of them were showing signs of shedding and none of them worked well enough for me to write out a complete Lambda Install tape. Despite a few cleaning passes, eventually enough oxide would shed off the tape to gum up the heads and cause errors. Clearly I would need to find some better tapes, so I hit up eBay and found a stack of 5 tapes, new-old stock (apparently from NASA). And waited patiently for them to arrive.

[About a week passes…]

The Actual Installation

With new tapes in hand I was finally able to write out the “Install” tape without errors. And thus, with my fingers crossed and a rabbit’s foot in my pocket I started the installation process. The “load” utility is used to set up new hard disks and can copy files to and from tape to do installation and maintenance tasks. Here’s a transcription of the operation:

SDU Monitor version 102 
>> disksetup 
What kind of disk do you have? 
Select one of { eagle cdc-515 t-302 micro-169 cdc-9766 }: micro-169

>> /tar/load
using 220K in slot 9
load version 307
(creating block-22 mini-label)
(creating mini-label)
Disk is micro-169
Loading "/tar/bigtape"
Loading "/tar/st2181"
Disk unit 0 needs to be initialized:
Disk has no label, or mini-label is wrong.
Create new unit 0 label from scratch? (y/n) y
Creating lisp label from scratch.
How many LAMBDA processors: 1
Type "?" for command list.
load >

The initial steps above tell the SDU that I have a “micro-169″ disk (the 8-inch equivalent of the giant 14” Fujitsu I actually have installed). This is necessary to allow the load program to know the characteristics of the system’s disk. /tar/load is then executed and since it finds an empty disk, it sets up the disk’s label, the LMI’s equivalent of a partition table — information written to the beginning of the disk that describes the disk and slices the its space into partitions that can be used to hold files or entire filesystems. Even though this Lambda is a “2X2” system (with two LAMBDA processors) it would be a tight squeeze to run both of them in the the 160mb capacity of the drive, so for now I will only be running one of the two processors. Or trying to, anyway. (Oooh, foreshadowing!)

Continuing on:

load > install

*****************************************************
The new backup label track number is 16340.
Record this number and keep it with the machine.
*****************************************************
Writing unit 0 label
Using half-inch tape
Installing track-0 disk driver ...
copying 10 blocks from "/tar/disk" to "disk"
copy done

Tape-ID = "FRED gm 7/23/86 12:33:34 522520414 "
File is "SDU5 3.0 rev 14"; 1500 blocks.
"SDU5 3.0 rev 14" wants to be loaded into UNX6.
reading 1500 blocks into UNX6.
copying 1500 blocks from "bigtape" to "UNX6"
copy done

Next file ...

File is "ULAMBDA 1764"; 204 blocks.
Default partition to load into is LMC3
reading 204 blocks into LMC3.
copying 204 blocks from "bigtape" to "LMC3"
copy done

Next file ...

File is " 500.0 (12/8)"; 23189 blocks.
Default partition to load into is LOD1
reading 23189 blocks into LOD1.
copying 23189 blocks from "bigtape" to "LOD1"
copy done
Next file ...

End of tape.
Writing unit 0 label
load >

There are three tape files that the install process brings in; you can see them being copied above. The first (“SDU5 3.0 rev 14”) contains a set of tools for the SDU to use, diagnostics and bootstrap programs. The second (“ULAMBDA 1764″) contains a set of microcode files for use by the Lambda processor. The Lambda CPU is microcoded, and the SDU must load the proper microcode into the processor before it can run. The final file (cryptically named ” 500.0 (12/8)” is a load band. (The Symbolics analogue is a “world” file). This is (roughly) a snapshot of a running Lisp system’s virtual memory. At boot time, the load band is copied to the system’s paging partition, and memory-resident portions are paged into the Lambda’s memory and executed to bring the Lisp system to life.

Loading from the installation tape on the Qualstar

As suspected the tape drive’s throughput was higher during installation than during diagnostic load. But not by much. The above process took about two hours and as you can see it completed without errors, or much fanfare. But it did complete!

Time now for the culmination of the last month’s time and effort: will it actually boot into Lisp? Nervously, I walk over to the LMI’s console, power it on, and issue the newboot command:

The “newboot” herald, inviting me to continue…

Newboot loaded right up and prompted me for a command. To start the system, all you need to do is type boot. And so I did, and away it went, loading boot microcode from disk and executing it, to bring the Lisp system in from the load band. Then the breaker tripped. Yes, I’m still running this all off a standard 15A circuit in my basement, and the addition of the Fujitsu drive has pushed it to its limit. Don’t do this at home, people.

I unplugged the tape drive to reduce the power load a bit, reset the breaker and turned the Lambda on again. Let’s have us another go, shall we?

(I apologize in advance for the poor quality of the videos that follow. One of the side-effects of being stuck at home is that all I have is a cellphone camera…)

And awayyyyyy we go!

(Warning, the above video is long, and also my phone gave out after 3:12. Just watch the first 30 seconds or so and you’ll get the gist of it.)

Long story short: about two minutes after the video above ended, the screen cleared. This normally indicates that Lisp is starting up, and is a good sign. And then… nothing. And more nothing. No disk activity. I gave it another couple of minutes, and then I pinged my friend Daniel Seagraves, the LMI expert. He told me to press “META-CTRL-META-CTRL-LINE” on the keyboard (that’s the META and CTRL keys on both the left and right side of the keyboard, and the LINE key, all held down at once). This returns control to the SDU and to newboot; at this point the “why” command will attempt to provide context detailing what’s going on with the Lambda CPU:

Tell me why, I gotta know why!

Since Daniel knows the system inside and out, he was able to determine exactly where things were going off the rails during Lisp startup. The error being reported indicated that a primitive operator expected an integer as an operand and was getting some other type. This hints at a problem inside the CPU logic, that either ended up loading a bogus operand, or that reported a valid operand as having a bogus type.

Out of superstition, I tried rebooting the system to see if anything changed but it failed identically, with exactly the same trace information from “why.”

In the absence of working diagnostics, schematics, or even detailed hardware information, debugging this problem was going to be an interesting endeavor.

But all was not lost. This is a 2×2 system, after all. There’s a second set of CPU boards in the chassis just waiting to be tested…

This time, after the screen clears (where the video above starts) you can see the “run lights” flashing at the bottom of the screen. (These tiny indicators reflect system and CPU activity while the system is running). Then the status line at the bottom loaded in and I almost fell over from shock. Holy cow, this thing is actually working after all this time!

I have one working Lambda CPU out of the two. I’m hoping that someday soon I can devise a plan for debugging the faulty processor. In particular, I think the missing “double-double” TRAM file opined about in Part 6 of this series has turned up on one of the moldy 9-track tapes I rescued from the Pennsylvania garage — this should hopefully allow me to run the Lambda CPU diagnostics, but it will have to wait until I have a larger disk to play with, as this file resides in a UNIX partition that I don’t currently have space for.

In the meantime since I have a known working set of CPU boards (recall from Part 2 that the Lambda processor consists of four boards), it was a simple matter to isolate the fault to a single board by swapping boards between the sets one at a time. The issue turns out to be somewhere on the CM (“Control Memory”) board in CPU 0.

Meanwhile, not everything is exactly rosy with CPU 1… what’s with the system clock?

Time keeps slippin’ into the future…

System beeps are high-pitched squeaks and the wall clock on the status line counts about 4x faster than it should. Daniel and I are unsure exactly what the cause is at this time, but we narrowed it down to the RG (“ReGisters”) board. In many systems there is a periodic timer, sometimes derived from the AC line frequency (60Hz in the US) that is used to keep time and run the operating system’s process scheduler. The LMI uses something similar, and clearly it is malfunctioning.

Another fairly major issue is the lack of a working mouse. Way back in Part 2 I noted that the RJ11 connector had corroded into a green blob. This still needs repair and as it turns out, getting a working mouse on this system ended up being a journey all its own…

But that’s for my next installment. Until then, keep on keepin’ on!

Lookin’ good, LMI. Lookin’ good.

At Home With Josh Part 7: Putting the “Mass” in “Mass Storage”

Continuing from the conclusion of my last post, I had gotten to the point of testing the LMI’s Interphase SMD 2181 disk controller, but was getting troubling looking diagnostic output:

SDU Monitor version 102
>>/tar/2181 -C
Initializing controller
2181: error 3 test 0 Alarm went off - gave up waiting for IO completion
2181: error 3 test 0 Alarm went off - gave up waiting for IO completion
2181: error 10 test 0 no completion (either ok or error) from iopb status
iopb: cyl=0 head=0 sector=0 (TRACK 0)
87 11 00 00 00 00 00 00 00 00 00 00 10 00 c5 62 00 40 00 00 00 00 c5 3a

My immediate suspicion was that this was truly indicating a real failure with the controller. The “gave up waiting for IO completion” message was the canary in the coal mine here. The way a controller like this communicates with the host processor (in this case the SDU) is via a block of data in memory that the controller reads, this is the “iopb” (likely “I/O Program Block”) mentioned in the output above. The iopb contains the command to the controller, the controller executes that command then returns the status of the operation in the same iopb, and may interrupt the host processor to let it know that it’s done so. (More on interrupts later.)

What the above diagnostic failure appears to be indicating is that the SDU is setting up an initialization command in the iopb and waiting for the 2181 to return a result. And it waits. And it waits. And it waits. And then it gives up after a few milliseconds because the response has taken too long: the 2181 is not replying, indicating a hardware problem.

But the absence of any real documentation or instructions for these diagnostics or the 2181 controller itself left open other possibilities. The biggest one was that I did not at that time have an actual disk hooked up to the controller. The “-C” option to the 2181 diagnostic looked like it was supposed to run in the absence of a disk, but that could be an incorrect assumption on my part. It may well be that the 2181 itself requires a disk to be connected in order to be minimally functional, though based on experience with other controllers this seemed to me to be unlikely. But again: no documentation, anything could be possible.

The lack of a disk was a situation I could rectify. The Lambda’s original disk was a Fujitsu Eagle (model M2351), a monster of a drive storing about 470mb on 10.5″ platters. It drew 600 watts and took up most of the bottom of the cabinet. At the time of this writing I am still trying to hunt one of these drives down. The Eagle used the industry-standard SMD interface, so in theory another SMD drive could be made to work in its stead. And I had just such a drive lying dormant…

If the Eagle is a monster of a drive, its predecessor, the M2284 is Godzilla. This drive stores 160MB on 14″ platters and draws up to 9.5 Amps while getting those platters spinning at 3,000 RPM. The drive itself occupies the same space as the Eagle so it will fit in the bottom of the Lambda. It has an external power supply that won’t, so it’ll be hanging out the back of the cabinet for awhile. It also has a really cool translucent cover, so you can watch the platters spinning and the heads moving:

The Fujitsu M2284, freshly installed in the Lambda.

The drive is significantly smaller in capacity than the Eagle, but it’s enough to test things out with. It also conveniently has the same geometry as another, later Fujitsu disk that the SDU’s “disksetup” program knows about (the “Micro-169”), which makes setup easy. I’d previously had this drive hooked up to a PDP-11/44 and was working at that time. With any amount of luck, it still is.

Only one thing needed to be modified on the drive to make it compatible with the Lambda — the sector size. As currently configured, the drive is set up to provide 32 sectors per track; the Lambda wants 18 sectors. This sector division is provided by the drive hardware. The physical drive itself provides storage for 20,480 bytes per track. These 20,480 bytes can be divided up into any number of equally sized sectors (up to 128 sectors per track) by setting a bank of DIP switches inside the drive. Different drive controllers or different operating systems might require a different sector size.

The 32 sector configuration was for a controller that wanted 512-byte sectors — but dividing 20,480 by 32 yields 640. Why 640? Each sector requires a small amount of overhead: among other things there are two timing gaps at the beginning and end of each sector, as well as an address that uniquely identifies the sector, and a CRCs at the end of the sector. The address allows the controller to verify that the sector it’s reading is the one it’s expecting to get. The CRC allows the controller to confirm that the data that was read was valid.

What a single sector looks like on the Fujitsu.

The more sectors you have per track, the more data space you lose to this overhead. The Lambda wants 1024-byte sectors, which means we can fit 18 sectors per track. 20,480 divided by 18 is approximately 1138 bytes — 114 bytes are used per sector as overhead. The configuration of the DIP switches is carefully described in the service manual:

Everyone got that? There will be a quiz later. No calculators allowed.

Following the instructions and doing the math here yields: 20,480 / 18 = 1137.7777…, so we truncate to 1137 and add 1, yielding 1138. Then we subtract 1 again (Fujitsu enjoys wasting my time, apparently) and configure the dip switches to add up to 1137. 1137 in binary is 10 001 110 001 (1024 + 64 + 32 + 16 + 1), so switches SW1-1, SW1-5, SW1-6, SW1-7 are turned on, along with SW2-4. Simple as falling off a log!

With that rigamarole completed, I hooked the cables up, powered the drive up and set to loading the Interphase 2181 diagnostic again:

SDU Monitor version 102
>>/tar/2181 -C
Initializing controller
2181: error 3 test 0 Alarm went off - gave up waiting for IO completion
2181: error 3 test 0 Alarm went off - gave up waiting for IO completion
2181: error 10 test 0 no completion (either ok or error) from iopb status
iopb: cyl=0 head=0 sector=0 (TRACK 0)
87 11 00 00 00 00 00 00 00 00 00 00 10 00 c5 62 00 40 00 00 00 00 c5 3a

Darn. Looks like having a drive present wasn’t going to make this issue go away.

About that time, a local friend of mine had chimed in and let me know he had a 2181 controller in his collection. It had been installed in a Sun-1 workstation at some point in its life, and was a slightly different revision. I figured that if nothing else, comparison in behavior between his and mine might shed a bit of light on my issue so I went over to his house to do a (socially distanced) pickup.

Annoyingly, the revisional differences between his 2181 and mine were fairly substantial:

You can see the commonality between the two controllers, but there are many differences, especially with regard to configuration jumpers — and since (as I have oft repeated) there is no documentation, I have no idea how to configure the newer board to match the old.

So this is a dead end, the revisional differences are just too great. I did attempt to run diagnostics against the new board, but it simply reported a different set of failures — though at least it was clear that the controller was responding.

Well it was well past the time to start actually thinking about the problem rather than hoping for a deus ex machina to swoop in and save the day. I wasn’t going to find another 2181, and documentation wasn’t about to fall out of the sky. As with my earlier SDU debugging expedition, it seemed useful to start poking at the 2181’s processor, in this case an Intel 8085. This is an 8-bit processor, an update of the 8080 with a few enhancements. Like with the SDU’s 8088, looking at power, clock and reset signals was a prudent way to start off.

Unlike with the SDU, all three of these looked fine — power was present, the clock was counting out time, and the processor wasn’t being reset. Well, let’s take a look at the pinout of the 8085 and see what else we might be able to look at:

8085 pinout, courtesy Wikimedia Commons (https://commons.wikimedia.org/wiki/File:Anschlussbelegung_8085.gif)
Oscillation overthruster

The AD0 through AD7 and A15 pins are the multiplexed address/data bus: When the 8085 is addressing memory, AD0-AD7 plus the A8-A15 pins form the 16-bit memory address; when a read or write takes place, AD0-AD7 contain the 8-bits of data being read or written. Looking for activity on these pins is a good way to see if the CPU is actually running — a running CPU will be accessing and addressing memory constantly — and sure enough, looking with an oscilloscope showed pulsing on these pins.

The TRAP, RST7.5, RST6.5, RST5.5, and INTR signals are used to allow external devices to interrupt the 8085’s operation and are typically used to let software running on the CPU know that a hardware event has occurred: a transfer has completed or a button was pushed, for example. When such an interrupt occurs, the CPU jumps to a specific memory location (called an interrupt vector) and begins executing code from it (referred to as an interrupt service routine), then returns to where it was before the interrupt happened. If any of these signals were being triggered erroneously it could cause the software running on the CPU to behave badly.

Probing the RST7.5, 6.5 and 5.5 signals revealed a constant 3.5V signal at RST7.5, a logic “1” — something connected to the 8085 was constantly interrupting it! This would result in the CPU running nothing but the interrupt service routine, over and over again. No wonder the controller was unable to respond to the Lambda’s SDU.

Now the question is: what’s connected to the RST7.5 signal? It could potentially come from anywhere, but the most obvious source to check on this controller is one chip, an Intel 8254 Programmable Interval Timer. As the name suggests, this device can be programmed to provide timing signals — it contains three independent clocks that can be used to provide precise timing for hardware and software events. The outputs of these timers are often connected to interrupt pins on microprocessors, to allow the timers to interrupt running code.

The Intel 8254 Programmable Interrupt Timer

And, as it turns out, pin 17 (OUT 2) of the 8254 is directly connected to pin 7 (RST7.5) of the 8085. OUT 2 is the data output for the third counter, and goes high (logic “1”) when that timer elapses. Based on what I’m seeing on the oscilloscope, this signal is stuck high, likely indicating that the 8254 is faulty. Fortunately it’s socketed, so it’s easy to test that theory. I simply swapped the 8254s between my controller and the one I’m borrowing from my friend and…

Success! Probing RST7.5 on the 8085 now shows a logic “0”, the CPU is no longer constantly being pestered by a broken interval timer and is off and running. The diagnostic LEDs on the board reflect this change in behavior — now only one is lit, instead of both. This may still indicate a fault, but it’s at least a different fault, and that’s always exciting.

Well, the controller is possibly fixed, and I already have a disk hooked up and spinning… let’s go for broke here and see if we can’t format the sucker. The “-tvsFD” flags tell the controller to format and test the drive, doing a one-pass verify after formatting. Here’s a shaky, vertically oriented video (sorry) of the diagnostic in action:

Look at that text!

And here’s the log of the output:

SDU Monitor version 102
>> reset
>> disksetup
What kind of disk do you have?
Select one of { eagle cdc-515 t-302 micro-169 cdc-9766 }: micro-169
>> /tar/2181 -tvsFD
Initializing controller
2181: status disk area tested is from cyl 0 track 0 to cyl 822 track 9
2181: status format the tracks

Doing normal one-pass format ...
2181:at test 0 test reset                 passed
2181: test 1 test restore                 passed
2181: test 2 test interrupt               passed
failedginning of cyl 159 ...              at beginning of cyl 0 ...
2181: error 18          test 4 header read shows a seek error
iopb: cyl=0 head=0 sector=0 (TRACK 0)
00 00 82 12 00 00 00 00 00 00 00 12 10 00 c5 62 00 40 00 00 00 00 c5 3a
2181: error 18          test 4 header read shows a seek error
The 1 new bad tracks are:...
bad: track 1591; cyl=159 head=1
        ... mapped to track 8229; cyl=822 head=9

There were 1 new bad tracks
Number of usable tracks is 8228 (822 cyls).
(creating block-10 mini-label)
Disk is micro-169
2181: test 5 read random sectors in range   passed
2181: status read 500 random sectors
2181: test 6 write random sectors in range  passed
2181: status write to 500 random sectors
2181: test 8 muliple sector test            passed
2181: test 9 iopb linking test              passed
2181: test 10 bus-width test                passed
2181: test 0 test reset                     0 errors
2181: test 1 test restore                   0 errors
2181: test 2 test interrupt                 0 errors
2181: test 4 track verify                   2 errors
2181: test 5 read random sectors in range   0 errors
2181: test 6 write random sectors in range  0 errors
2181: test 8 muliple sector test            0 errors
2181: test 9 iopb linking test              0 errors
2181: test 10 bus-width test                0 errors
>>

And some video of the drive doing its thing during the verification pass:

Look at those random seeks!

As the log indicates, one bad track was found. This is normal — there is no such thing as a perfect drive (modern drives, both spinning rust and SSD have embedded controllers that automatically remap bad sectors from a set of spares, providing the illusion of a flawless disk). Drives in the era of this Fujitsu actually came with a long list of defects (the “defect map”) from the factory. A longer verification phase would likely have revealed more bad spots on the disk.

Holy cow. I have a working disk controller. And a working disk. And a working tape drive. Can a running system be far off? Find out next time!

Designed to Wear Out / Designed to Last – Pt. 3

Here we are going to take a look at current technologies’ strengths and weaknesses in relation to the past.

Increasing Active Elements

For perspective, there has been a monumental increase in the number of active elements available in an integrated circuit compared to decades ago. This has allowed an increase in device complexity and speed.

Because of this complexity, there has been a necessary increase in the complexity of testing and test tools. Modern chips have many redundant operating sections and a JTAG ( test bus ). This allows all sections on a chip to be 100% functional tested. Sections found to be bad are mapped out. This has become necessary because, at current densities, not all of the elements on a newly manufactured IC are functional. Failed functional units are mapped out using methods such as read-only-memory elements or other means. Without this practice, the cost of a working chip would be too high. Above a certain threshold of failed sections, the chip is simply discarded. This practice started as densities rose in memory chips. There were always failed banks of memory elements, so building redundant elements and mapping them in or out made sense. Otherwise the chip yield would be too low. Chips with mapped out sections could be sold as lower density chips, assuring some revenue, especially at the time a product was new and yields were low.

Increasing Pin Count

An increase in complexity also means an increase in die size and the number of pins to a single IC package. When I started in this business, 40 pins was the largest number. The largest number these days is the 3647 FCLGA package. That indeed is 3,647 pins !

Decreasing Repairability

Inevitably, this pin count increase means that replacement of a single XSI (extreme scale integration) chip is nearly impossible, and certainly not feasible. As far as chip lifetime or reliability, this is unknown, other than the MTBF claimed in the datasheet. Electromigration is the likely cause of long term chip failure in these instances. The circuit board on which you find high count ICs is a throwaway item.

The Pause in Moores’ Law

No doubt, something will come along to allow Moores’ law to again provide speed and density improvements, but for the last 5 years (as of April, 2020), we have been stuck at a particular minimum feature size of a transistor on a modern processor chip. There is a good deal of work on making more efficient designs of logic elements, but currently nothing close to the multiplicative improvements of Moores’ law.

I mention Moores’ Law in the faint hope that cooler heads will prevail upon us to, at least, make complex ICs reusable/reprogrammable.

Note: A short list of possible improvements

  1. 3D Stacking of elements
  2. New materials
  3. Innovative logic configurations

Designed to Wear Out / Designed to Last – Pt. 2

Designed to Last

There are many examples of technologies which are designed to function for decades into multiple centuries.

What goes into an item that can make it last decades or even centuries ? Here at LCM, a lot of the original logic we are running in our collection, was not rated as long life when originally used in our machines. Expected lifetime data had the best of them lasting for a few decades, maximum.

This has generated a set of rules we abide by: If it looks like stone or amber, it can last a long time. If it looks like plastic (or is plastic), it is not going to last. One particular exception is epoxy packages. Certain epoxy resins are similar in structure to tree resins, which so far, hold the record ( millions of years ) for preserving ancient insects, pollen, and plants. )

  1. Semiconductors and Integrated Circuits – Life data in databooks for IC expected lifetimes from the era in which they were created, typically have these devices failing more than a decade ago. Yet here, they are still performing their function. Staying functional tends to favor ceramic packages (stone) and specific epoxy packages (amber)(See “Notes on Epoxy Packages for Semiconductors” below.
  2. How Much Heat – Systems designed to have a higher internal ambient operating temperature, tended to have a much higher failure rate. Datasheets at the time ( and today ) show a direct correlation between ambient operating temperature and operating life.
  3. Cooling Fans – These sit right in the middle of the life curve. Although the bearing has a low wear rate, 30 years seem to be the upper limit. ( It would be cool if someone could come up with a frictionless magnetic bearing.)
  4. A Special Note About Fans and Heat: A lot of our power supplies have an intimate relationship with fans, meaning, if the fan fails, the power supply will fail, as well. When we re-engineer a power supply to replace an older or failed unit, we specify that the new supply can keep operating even though the fan has failed. This is accomplished by the fact that the replacement power supply components have a higher efficiency ( thus generating less heat performing their function ) and can tolerate heat better than the old power supply components. ( This has the added advantage of lower air conditioning costs )

Winners In The Longevity Game

The absolute winners we have found and utilized in our systems are what are known as “bricks”. These are power supply modules which are fully integrated. They come in various sizes which determine their power ratings. Full bricks top out around a kilowatt. Half-Bricks are around 500 watts. Quarter-Bricks around 250 watts.

Lifetimes (MTBF) at full load and temperature for “bricks” go from around 40 years to around an astounding 500 years. ( That is not a typo. The part in question is a Murata UHE-5/5000-Q12-C. The whole UHE series has this rating. Price $61.90) These devices, as you may have already guessed, are epoxy encapsulated.

Designed-In Longevity

This refers to what we have encountered upon restoring the machines in our collection. There is a definite intent at work when one examines the component choices. For example, DECs PDP-10 KL series power supplies have filter capacitors at four times the necessary capacitance. The amount of capacitance declines with age pretty linearly till the end of lifetime (around 14 years). This means these particular components will still allow power supply function 4 times longer. That’s at least 3 times beyond the machine’s commercial life rating ( 5 -7 years). We got these machines 15 to 20 years after their last turn-on, and they ran for most of a year before we had cascading failures of the filter capacitors.

Notes on Epoxy Packages for Semiconductors

Epoxy packaging for semiconductors became popular in the mid-1960’s. It replaced ceramic, as it was less expensive. Epoxy, unfortunately, can be made with different resins and other ingredients that give it different material characteristics. There is a correlation between cost and moisture intrusion. Lower cost, more moisture. This gave epoxy a bad name as it was used to make ICs’ more competitive in the market. This led to a number of market loss moments for certain manufacturers, as the moisture intrusion occurred at a predictable rate depending on ambient humidity for a particular region of the country.

It was a multi-faceted problem. The moisture intrusion occurred where the IC lead connects with the package. Moisture intrusion into the epoxy and poor metal quality ( tin alloy ) of the lead frame causes corrosion of the lead, which allows moisture into the IC cavity and changes the bulk resistivity of the IC die. The electrical specs go off a cliff and the IC fails.

( Note: If the lead frame had been made from a different alloy or the epoxy was a higher grade, this failure had little or no chance of occurring. I find it hard to justify the cost differential given the ultimate cost to the end users and the manufacturer )

(I was a field engineer in the mid to late 1970’s and spent many an hour replacing ICs with this problem.)

The only epoxy packages that have made it to the present day used better materials and thus are still functional. There are thousands functioning ICs on circuit boards in the Living Computers collection heading toward their 50th anniversary and a spares inventory of thousands.

Mechanical Switches

Whether toggle, pushbutton, slide, micro, or rotary, mechanical switches are all over the lifetime map. In the commercial world, switches are rated at the maximum number of actuations at a specified current. For the most part, the switches I have encountered meet or exceed the actuation specification. There is a limitation, though.

Aging of Beryllium Copper

If the internal mechanical design use a beryllium copper flat or coil spring, it has almost surely failed by the time we at the museum have encountered this type of switch. Beryllium copper goes from supple and springy to brittle after 30 years or so.

The result, as you may have guessed, is a non-functioning machine due to switches that operate intermittently or not at all. ( We had a whole line of memory cabinets, with hundreds of bright shiny toggle switches for memory mapping, that wouldn’t function till we replaced the switches.

Circuit Breakers

These fall into the beryllium copper spring family, so they are pretty much failed, or in addition to not closing, their trip point has typically shifted so you get a premature trip or a trip well above the trip point ( sometimes no trip and the protected circuit burns up ).

Slide Switches

We’ve had one surprising winner in the switch longevity department, and that is slide switches. With few exceptions ( usually due to mechanical damage to the sliding element ), an intact slide switch can be quickly resurrected with cleaning and a little light oil.

Rotary Switches

The runner up in the longevity game is the rotary switch. Typically all a wonky rotary switch needs is a spray from an alcohol cleaner and contact lubricant, and it functions, no matter what the age. ( we have hardware going back to the 1920’s whose rotary switches are still functional ) You can consider the switch failed if the contact wafer is cracked or broken. My guess as to longevity involves the phenolic wafer getting significant mechanical strength by being riveted.

Relays

These devices are essentially an electrically actuated switch. Their most common configuration is some kind of leaf spring. If the spring is beryllium copper based, you of course, have a failed relay 30 years hence. Rotary relays tend to be fairly reliable, but require a little more maintenance to keep them going. Mercury wetted relays are fairly reliable ( we still have some running in a couple of pieces of hardware ), but are not recommended because of their mercury content along with the mercury being contained in a fragile glass envelope.

Designed to Wear Out / Designed to Last – Intro

This is an article series whose purpose is to shine a comprehensive light on one important aspect of technology that only gets passing mention: Our ability to determine how long a component or system can function based on the engineering decisions made in the interests of monetary consideration and/or reliability. This is especially germane today, as decisions about discarding a technology item at “end of life” now impinge on how much toxic waste we are loading the environment with. ( I have yet to find an “obsolete” cellphone or computer that didn’t work perfectly when discarded, for other than mechanical or liquid immersion damage. ) The definition of obsolete depends on who you ask. The new smartphone you buy today is only a few percent actual new technology compared to the old smartphone your are discarding.

So we seem to have uncovered the operating model that most companies producing and selling technology have adopted:

The model for tech from approximately the 1920’s to today involves having key components in your product which have a known MTBF ( mean time before failure ). Thus you can predict a known replacement rate for your product. Start a second product stream offering replacement parts for the ones you know are going to predictably fail.

Unless or until the majority of your users are technically savvy enough to realize this is a way to keep selling products in a saturated market, one is assured of continuing and predictable sales and profit

Vacuum Tubes

We start with old technology.

One could argue that vacuum tubes were inherently unreliable, so they made them as robust as possible and just accepted their limitations. This turns out to be a bit wide of the truth.

In a step-wise fashion, the manufacturing process evolution for the vacuum tube went something like this:

  1. The first vacuum tubes were handmade with limited production.
  2. As soon as production ramped up to feed demand, it was inevitable that the manufacturer would tweek the processes used to make a vacuum tube to minimize the amount of time it takes to assemble the finished product from raw materials, to ready to assemble, and assemble (using hands, jigs, and other production equipment).
  3. As soon as all of relevant costs have been driven out of the system of producing vacuum tubes, the manufacturer rightly looks to other means to make a profit. (Process tweaks and labor input reductions) Sooner or later the ecology of manufacturers who produce vacuum tubes reaches an equilibrium which, despite their best efforts, doesn’t allow any greater profitability.
  4. Product lifetime now looms large. Sell more product over time, make more profit.

( Note: this describes the manufacturing process evolution for most “tech” products. )

Ways To Sell More Vacuum Tubes

  1. I only learned about this particular lifetime determinant recently. It seems that in order to get the filament temperature high enough for electron emission, the tungsten had to be alloyed with thorium, which raises the melting point to that of a tungsten/thorium alloy (well above the temperature where the alloy emits electrons. The amount of alloying is directly proportional to the time a DC current-over-time applied in a plating tank. The tube life is directly proportional to the time it takes for the thorium to boil off. More thorium – More lifetime. As the final boil-off occurs, the filament temperature is rises above the melting point of tungsten which results in the filament overheating and burning out. Vacuum tubes manufactured for the US Military were larded up with thorium, and thus met the extended lifetime specifications the US military demanded.
  2. getter is a deposit of reactive material that is placed inside a vacuum system, for the purpose of completing and maintaining the vacuum. ( https://en.wikipedia.org/wiki/Getter ) The quality and amounts of various elements is adjustable by the manufacturer and will determine vacuum tube life. I addition, you can get a short lifetime tube just by eliminating the getter. (We have plenty of examples without getters)
  3. In order assure that parts were available as they wore out “normally”, the tube manufacturers put a tube tester (stocked with replacement tubes ) in every convenient location throughout a geographic area. (Drug and hardware stores typically)
  4. Make and sell more end products using the same (or slightly improved) technology. This is accomplished by the “Longer, Lower, Wider” paradigm used by the auto industry (and now applied to radios, televisions, etc)starting around the 1940’s. This involves making a greater variety of the same (or slightly modified) product in a different package year to year. Anyone who has read those old ads can see this methodology in action.

Some of the environmental results of the vacuum tube era were:

  1. Mounds of glass along with refined metals (tungsten, thorium, steel, and others) and minerals (mica) added to landfills. It is unclear if any of the glass was recycled.
  2. Some of the metals and minerals were leached into the ground by rain.
  3. Large amounts of scrap-metal ( chassis ) and wood ( cabinets ) ended up in landfill and were either recycled or rotted into the ground.

At Home With Josh Part 6: Diagnostic Time!

In our last exciting episode, after a minor setback I got the Lambda’s SDU to load programs from 9-track tape. Now it’s time to see if I can actually test the hardware with the available diagnostics.

Tape Images

Tape images of the Lambda Release and System tapes are available online. Daniel Seagraves has been working on updating the system and has his latest and greatest are available here. A tape image is a file that contains a bit-for-bit copy of the data on the original tape. Using this file in conjunction with a real 9-track drive allows an exact copy of the original media to be made. In my case, I have an HP 7980S 9-track drive connected to a Linux PC for occasions such as these. At the museum we have an M4 Data 9-track drive set up to do the same thing. The old unix workhorse tool “dd” can be used to write these files back to tape, one at a time:

$ dd if=file1 of=/dev/nst0 bs=1024

(Your UNIX might name tape devices differently, consult your local system administrator for more information.)

Data on 9-track tapes is typically stored as a sequence of files, each file being separated by a file mark. The Lambda Release tape contains five such files, the first two being relevant for diagnostics and installation, and the remainder containing Lisp load bands and microcode that get copied onto disk when a Lisp system is installed.

The first file on tape is actually an executable used by the SDU — it is a tiny 2K program that can extract files from UNIX tar archives on tape and execute them. Not coincidentally, this program is called “tar.” The second tape file is an actual tar archive that contains a variety of utility programs and diagnostics. Here’s a rundown of the interesting files we have at our disposal:

  • 3com – Diagnostic for the Multibus 3Com Ethernet controller
  • 2181 – Diagnostic for the Interphase 2181 SMD controller
  • cpu – Diagnostic for the 68010 UNIX processor
  • lam – Diagnostic for the Lambda’s Lisp processors
  • load – Utility for loading a new system: partitioning disks and copying files from tape.
  • ram – Diagnostic for testing NuBus memory
  • setup – Utility for configuring the system
  • vcmem – Diagnostic for testing the VCMEM (console interface) boards.

The unfortunate thing is: there is no documentation for most of these beyond searching for strings in the files that might reveal secrets. Daniel worked out the syntax for some of them while writing his LambdaDelta emulator, but a lot of details are still mysterious.

In case you missed it, I summarized the hardware in the system along with a huge pile of pictures of the installed boards in an earlier post — it might be helpful to reacquaint yourself to get some context for the following diagnostic runs. Plus pictures are pretty.

I arbitrarily decided to start by testing the NuBus memory boards, starting with the 16mb board in slot 9 (which I’d moved from slot 12 since the last writeup). The diagnostic is loaded and executed using the aforementioned tar program as below. The “-v” is the verbose flag, so we’ll get more detailed output. the “-S 9” indicates to the diagnostic that we want to test the board in slot 12.

SDU Monitor version 102
>> reset
>> /tar/ram -v -S 9
ram: error 6 test 1 bad reset state, addr=0xf9ffdfe0, =0x1, should=0x4
ram: error11 test 3 bad configuration rom
ram: error 1 test 6 bad check bits 0xffff, should be 0xc, data 0x0
ram: error 1 test 7 bad check bits 0xffff, should be 0xc, data 0xffffffff
ram: error 7 test 8 for dbe w/flags off, DBE isn't on
ram: error 7 test 9 for dbe w/flags off, DBE isn't on
ram: status fill addr 0xf9000000
ram: status fill addr 0xf9002000
... [elided for brevity] ...
ram: status fill addr 0xf903c000
ram: status fill addr 0xf903e000
ram: status fill check addr 0xf9000000
ram: status fill check addr 0xf9002000
... [elided for brevity] ...
ram: status fill check addr 0xf903c000
ram: status fill check addr 0xf903e000

Well, the first few lines don’t look exactly promising what with all the errors being reported. The test does continue on to fill and check regions of the memory but only up through address 0xf907e000 (the first 512KB of memory on the board, that is). Thereafter:

ram: status fill check addr 0xf907c000
ram: status fill check addr 0xf907e000
ram: status block of length 0x4000 at 0xf9000000
ram: status stepsize 4 forward
ram: error 4 test 16 addr 0xf9000004 is 0xffffffff sb 0x0 (data f/o)
ram: error 4 test 16 addr 0xf9000008 is 0xffffffff sb 0x0 (data f/o)
ram: error 4 test 16 addr 0xf900000c is 0xffffffff sb 0x0 (data f/o)
ram: error 4 test 16 addr 0xf9000010 is 0xffffffff sb 0x0 (data f/o)
ram: error 4 test 16 addr 0xf9000014 is 0xffffffff sb 0x0 (data f/o)
ram: error 4 test 16 addr 0xf9000018 is 0xffffffff sb 0x0 (data f/o)
ram: error 4 test 16 addr 0xf900001c is 0xffffffff sb 0x0 (data f/o)
ram: error 4 test 16 addr 0xf9000020 is 0xffffffff sb 0x0 (data f/o)
ram: error 4 test 16 addr 0xf9000024 is 0xffffffff sb 0x0 (data f/o)
ram: error 4 test 16 addr 0xf9000028 is 0xffffffff sb 0x0 (data f/o)

And so on and so forth, probably across the entire region from 0xf9000000-0xf907ffff. This would take a long time to run to completion (remember, this output is coming across a 9600bps serial line — each line takes about a second to print) so I wasn’t about to test this theory. The output appears to be indicating that memory reads are returning all 1’s (0xffffffff) where they’re supposed to be 0 (0x0).

So this isn’t looking very good, but there’s a twist: These diagnostics fail identically under Daniel’s emulator. After some further discussion with Daniel it turns out these diagnostics do not apply to the memory boards I have installed in the system (or that the emulator simulates). The Memory boards that were available at the time of the Lambda’s introduction were tiny in capacity: Half megabyte boards were standard and it was only later that larger (1, 2, 4, 8, and 16mb boards) were developed. The only memory boards I have are the later 4 and 16mb boards and these use different control registers and as a result the available diagnostics don’t work properly. If there ever was a diagnostic written for these newer, larger RAM boards, it has been lost to the ages.

This means that I won’t be able to do a thorough check of the memory boards, at least not yet. But maybe I can test the Lisp CPU? I slotted the RG, CM, MI and DP boards into the first four slots of the backplane and started up the lam diagnostic program:

SDU Monitor version 102
>> reset
>> /tar/lam -v
/tar/lam version 6
compiled by wer on Wed Mar 28 15:24:02 1984 from machine capricorn
setting up maps
initializing lambda
starting conreg = 344
PMR
passed ones test
passed zeros test
TRAM-ADR
passed ones test
passed zeros test
TRAM
passed ones test
passed zeros test
loading tram; double-double
disk timed out; unit=0x0 cmd=0x8F stat=0x0 err=0x0
disk unit 0 not ready
can't open c.tram-d-d
SPY:
passed ones test
passed zeros test
HPTR:
Previous uinst destination sequence
was non-zero after force-source-code-word
during lam-execute-r
Previous uinst destination sequence
was non-zero after force-source-code-word
during lam-execute-r
Previous uinst destination sequence
was non-zero after force-source-code-word
... [and so on and so forth] ...

Testing starts off looking pretty good — the control registers and TRAM (“Timing RAM”) tests pass, and then it tries to load a TRAM file from disk. Aww. I don’t have a disk connected yet, and even if I did it wouldn’t have any files on it. And to add insult to injury, as it turns out even the file it’s trying to load (“double-double”) is unavailable — like the later RAM diagnostics, it is lost to the ages. The TRAM controls the speed of the execution of the lisp processor and the “double-double” TRAM file causes the processor to run slowly enough that the SDU can interrogate it while running diagnostics. Without a running disk containing that file I won’t be able to proceed here.

So, as with the memory I can verify that the processor’s hardware is there and at least responding to the outside world, but I cannot do a complete test. Well, shucks, this is getting kind of disappointing.

The vcmem diagnostic tests the VCMEM board — this board contains the display controller and memory that drives the high-resolution terminals that I restored in a previous writeup. It also contains the serial interfaces for the terminal’s keyboard and mouse. Perhaps it’s finally time to test out the High-Resolution Terminals for real. I made some space on the bench next to the Lambda and set the terminal and keyboard up there, and grabbed one of the two console cables and plugged it in. After powering up the Lambda, I was greeted with a display full of garbage!

Isn’t that the most beautiful garbage you’ve ever seen?

This may not look like much, but this was a good sign: The monitor was syncing to the video signal, and the display (while full of random pixels) is crisp and clear and stable. The garbage being displayed was likely due to the video memory being uninitialized: Nothing had yet cleared the memory or reset the VCMEM registers. There is an SDU command called “ttyset” that assigns the SDU’s console to various devices; currently I’d been starting the Lambda up in a mode that forces it to use the serial port on the back as the console, but by executing

>> ttyset keytty

The SDU will start using the High-Resolution terminal as the console instead. And, sure enough, executing this caused the display to clear and then:

It lives!

There we are, a valid display on the screen! The keyboard appeared to work properly and I was able to issue commands to the SDU using it. So even without running the vcmem diagnostic, it’s apparent that the VCMEM board is at least minimally functional. But I really wanted to see one of these diagnostics do its job, so I ran it anyway:

SDU Monitor version 102
/tar/vcmem -v -S 8
vcmem: status addr = 0xf8020000
vcmem: status fill addr 0xf8020000
... [elided again for brevity] ...
vcmem: status fill addr 0xf803e000
vcmem: status fill check addr 0xf8020000
vcmem: status fill check addr 0xf8022000
vcmem: status fill check addr 0xf8024000
vcmem: status fill check addr 0xf8026000
...
vcmem: status fill check addr 0xf8036000
vcmem: status fill check addr 0xf8038000
vcmem: status fill check addr 0xf803a000
vcmem: status fill check addr 0xf803c000
vcmem: status fill check addr 0xf803e000

As the test continued, patterns on the screen slowly changed, reflecting the memory being tested. Many different memory patterns are tested over the next 15 minutes.

vcmem: status movi block at 0xf803c000
vcmem: status movi stepsize 2 forward
vcmem: status movi checking 0x0000 writing 0xffff
vcmem: status movi checking 0xffff writing 0x0000
vcmem: status movi stepsize 2 backward
vcmem: status movi checking 0x0000 writing 0xffff
vcmem: status movi checking 0xffff writing 0x0000
vcmem: status movi stepsize 4 forward
vcmem: status movi checking 0x0000 writing 0xffff
vcmem: status movi checking 0xffff writing 0x0000
vcmem: status movi stepsize 4 backward
vcmem: status movi checking 0x0000 writing 0xffff
vcmem: status movi checking 0xffff writing 0x0000
vcmem: status movi stepsize 8 forward
vcmem: status movi checking 0x0000 writing 0xffff
vcmem: status movi checking 0xffff writing 0x0000
vcmem: status movi stepsize 8 backward
vcmem: status movi checking 0x0000 writing 0xffff
vcmem: status movi checking 0xffff writing 0x0000
... [elided] ...
vcmem: status movi stepsize 4096 forward
vcmem: status movi checking 0x0000 writing 0xffff
vcmem: status movi checking 0xffff writing 0x0000
vcmem: status movi stepsize 4096 backward
vcmem: status movi checking 0x0000 writing 0xffff
vcmem: status movi checking 0xffff writing 0x0000
vcmem: status movi stepsize 8192 forward
vcmem: status movi checking 0x0000 writing 0xffff
vcmem: status movi checking 0xffff writing 0x0000
vcmem: status movi stepsize 8192 backward
vcmem: status movi checking 0x0000 writing 0xffff
vcmem: status movi checking 0xffff writing 0x0000

And at last the test finished with no errors reported, leaving a test pattern on the display. How about that, a diagnostic that works with the hardware I have.

Not your optometrist’s eye chart…

Looking crisp, clear, and nice and straight. This monitor is working fine — what about the other one? As you might recall, I got two High-Resolution Terminals with this system and pre-emptively cleaned and replaced all the capacitors in both of them. The second of these would not display anything on the screen when powered up (unlike the first) though I was seeing evidence that it was otherwise working. Now that I’d verified that the VCMEM board was working and producing a valid video signal, I thought I’d see if I could get anything out of the second monitor.

Well, what do you know? Note the cataracts in the corners.

Lo and behold: it works! I soon discovered the reason for the difference in behavior between the two monitors: The potentiometer (aka “knob”) that controls the contrast on this display is non-functional; with it turned up on the first monitor you can see the retrace, with it turned down it disappears. Interestingly the broken contrast control doesn’t seem to have a detrimental effect on the display, as seen above.

So that’s a VCMEM board, two High-Resolution Terminals, and the keyboard tested successfully, with the CPU and Memory boards only partially covered. I have yet to test the Ethernet and Disk controllers. The 3com test runs:

SDU Monitor version 102
>> /tar/3com -v
3com: status Reading station address rom start addr=0xff030600
3com: status Reading station address ram start addr=0xff030400
3com: status Transmit buffer: 0xff030800 to 0xff030fff.
3com: status Receive A buffer: 0xff031000 to 0xff0317ff.
3com: status Receive B buffer: 0xff031800 to 0xff031fff.
3com: status Receive buffer A - 0x1000 to 0x17ff.
3com: status Receive buffer B - 0x1800 to 0x1fff.
>>
Hex editors to the rescue!

No errors reported and the test exits without complaining so it looks like things are OK here. Now onto the disk controller. I don’t have a disk hooked up at the moment, but after a bit of digging into the test’s binary, it looks like the “-C” option should run controller-only tests:

SDU Monitor version 102
>>/tar/2181 -C
Initializing controller
2181: error 3 test 0 Alarm went off - gave up waiting for IO completion
2181: error 3 test 0 Alarm went off - gave up waiting for IO completion
2181: error 10 test 0 no completion (either ok or error) from iopb status
iopb: cyl=0 head=0 sector=0 (TRACK 0)
87 11 00 00 00 00 00 00 00 00 00 00 10 00 c5 62 00 40 00 00 00 00 c5 3a
2181: error 3 test 0 Alarm went off - gave up waiting for IO completion
2181: error 3 test 0 Alarm went off - gave up waiting for IO completion
2181: error 10 test 0 no completion (either ok or error) from iopb status
iopb: cyl=0 head=0 sector=0 (TRACK 0)
87 11 00 00 00 00 00 00 00 00 00 00 10 00 c5 62 00 40 00 00 00 00 c5 3a
2181: error 3 test 0 Alarm went off - gave up waiting for IO completion
2181: error 3 test 0 Alarm went off - gave up waiting for IO completion
2181: error 10 test 0 no completion (either ok or error) from iopb status
iopb: cyl=0 head=0 sector=0 (TRACK 0)
87 11 00 00 00 00 00 00 00 00 00 00 10 00 c5 62 00 40 00 00 00 00 c5 3a
2181: error 3 test 0 Alarm went off - gave up waiting for IO completion

This portends a problem. The output seems to indicate that the test is asking the controller to do something and then report a status (either “OK” or “Error”) and the controller isn’t responding at all within the allotted time, so the diagnostic gives up and reports a problem.

This could be caused by the lack of a disk, perhaps the “-C” option isn’t really doing what it seems like it should, but my hacker sense wass tingling, and my thought was that there was a real problem here.

Compounding this problem is a lack of any technical information on the Interphase SMD 2181 controller. Not even a user’s manual. The Lambda came with a huge stack of (very moldy) documentation, including binders covering the hardware: “Hardware 1” and “Hardware 3.” There’s supposed to be a “Hardware 2” binder but it’s missing… and guess which binder contains the 2181 manual? Sigh.

There are two LEDs on the controller itself and at power-up they both come on, one solid, one dim. In many cases LEDs such as these are used to indicate self-test status — but lacking documentation I have no way to interpret this pattern. I put out a call on the Interwebs to see if I could scare up anything, but to no avail.

Looks like my diagnostic pass at the system was a mixed bag: Outdated diagnostics, meager documentation, and what looks like a bad disk controller combined with the success of the consoles and at least a basic verification of most of the Lambda’s hardware.

In my next installment, I’ll hook up a disk and see if I can’t suss out the problem with the Interphase 2181. Until then, keep on chooglin’.

SUBLEQ is Hard!

It all started so innocently! The LCM+L Employee Engagement Committee wanted something for employees who had gone beyond the call of duty, to be able to wear that would get people to ask why only they they had this thing. We thought maybe it might blink to catch folks attention. Sure, I can do that, no problem!

When I asked my boss: Stephen Jones about it, he said “Sure, but it has to be a fully programmable 9 bit computer with a complete front panel!”. Oh, he says, it should be able to make sounds too. Before it was “falling off a log”, for me, but now the log is 8 feet in the air, one must fall off carefully, and with forethought!

I made something the size of a business card, and it had a few problems: 1) I had messed up the multiplexing, and whenever you pushed on one of the data bit buttons both the data LED, and the address LED above it came on, whether I wanted them to or not. 2) The bit ordering on the LEDs and switches was backwards, requiring bit fiddling in the software. 3) It was almost all surface mount. This shouldn’t be a problem, right? Somewhere in here, someone asked, could we sell it as a kit? Could we teach a soldering class with it? A lot of the parts on this one were 0805 SMT parts, that means they are 0.08″ x 0.05″. That is the outline of the part, which is pretty small! Working with them requires tweezers and, at least for me, magnification.

The board grew a little bit to 2.5″ x 4.25″, and the LED’s and buttons went to through hole, to be big enough for beginners to solder. That is what the photo above shows, but I hadn’t mounted the speaker in that upper right circle when I took the picture. I did fix the multiplexing, and bit ordering problems in the process.

Some about the Bit Token: The processor on the board is the ATMEGA328, that is the heart of the Arduino Uno. I was a little short on pins, and brain cycles to use the ATMEGA32U2 or a CP2102 usb to serial chip to make it actually Arduino compatible. One can get a programmer widget for under $10 on Amazon.

The first thing I had to do, starting with the original board, was test the hardware. I started by writing code to blink the lights. Remember blinking the lights? That used to be the target! Pressing the “run” button while holding one or more of the data buttons will cause it to blink its lights in 10 different patterns. Pressing the “run” button again will stop that pattern, so you can try another.

Any computer needs to communicate, with more than just the lights, so the Bit Token has a TTL serial port. There is yet another widget to hook USB to a TTL serial port, and holding a couple of data buttons and poking the “run” button will cause it to send “Hello World!” to your terminal.

Now is where the log leaps off the ground, new target: A fully programmable 9 bit processor, with a complete working front panel! Everything till now has been playing with the ATMEGA, now I have to teach the ATMEGA to pretend to be a 9 bit computer. All the software I have written has been in “C”, using AtmelStudio 7, available from the Microchip website. ( Microchip bought Atmel a few years back.)

What kind of a processor do I create? I could do a 9 bit PDP-8, but HP already did a 16 bit PDP-8 called the HP-1000 series. I poked around a bit, and eventually decided to roll my own, which I call “NineBitter”, kind of a mish-mash of the German word for No, and the taste. Anyway a bunch of coding and debugging ensued, but eventually I have a computer with:

  • 4 registers
  • 512 9 bit words of memory
  • 38 instructions
  • 3 register instructions
  • 4 I/O ports

I guess I need some programs to demonstrate that it works. I did some programming in binary, but that got pretty tedious fast, so I implemented an assembler in PERL. It has warts, but it mostly works. Its weakest point is error reporting. The first serious program I wrote with the assembler is the boot loader, and I found a way for it to be there in RAM whenever you power up or reset the board! To test it, I had it load another copy of itself, and used the front panel to verify that it was correct.

The next program I wrote was a game: Kill the Bit! That was one of the earliest games folks played on the Altair 8800. The idea is there is a bit rotating through the data lights, and your goal is to press the button below the bit to stop it. If you miss, you get another bit lit, chasing the first. Here is the program:

;*****************************************************
; Kill The Bit:
; Author: Bruce Sherry
; Date: 3/31/2020 2:41:19 PM
;
; Stomp the bit with the buttons as it rotates around!
;*****************************************************
start:  org     0120
        xor     R0,R0,R0        ; clear register R0
        addi    R0,R1           ; make a Bit
loop:   out     R0,leds         ; display the results
        addi    R2,19           ; load a delay count
deloop: subi    R2,1            ; wait a while
        jnz     deloop
        in      R1,switches     ; Read the switches into register 1
        xor     R0,R0,R1        ; flip the bits
        rl      R0              ; rotate the bit
        jnz     loop            ; go back for more
        halt                    ; you win!

A theme we have at LCM+L is “Hello World!” which comes from the “C Programming Language” manual by Kernighan and Ritchie. The first program in the book merely types “Hello World!” when you run it, so that program had to be written too:

;***********************************************************
; Hello World!:
; Author: Bruce Sherry
; Date: 4/21/2020 10:59:24 AM
;
; Say hi to the world on the other side of the serial port.
;***********************************************************
NL:     equ     012
CR:     equ     015
LF:     equ     012
start:  org     0120
        xor     R0,R0,R0            ; clear out the reg
        xor     R3,R3,R3
        addi    R3,message          ; get the start of the message
charloop:
        loadn   R1,R3,0             ; use reg 1 as temp
        addi    R3,1                ; step to the next location
        or      R2,R1,R1            ; set the flags, and save it in 2
        jz      done
        subi    R2,NL
        jz      newline
        out     R1,serial           ; send the character
        jmp     charloop
newline:
        sub     R2,R2,R2            ; clear 2
        ori     R2,CR
        out     R2,serial           ; send a carriage return
        out     R1,serial           ; and the line feed
        jmp     charloop
done:   halt
        jmp     start
message: string "Hello World!\n"

Most people (at least the ones that have seen “2001, A Space Odessy”) know that the first song a computer learns is “Daisy, Daisy”, so I taught NineBitter to play that.

You are probably wondering where the title of this blog entry came from. Another of the targets for this project to hit came from the idea that we could teach some computer science with this little thing. Besides having the ATMEGA and NineBitter, we needed Another architecture to play with.

A friend of mine, Bryan, sent me a link to an article called “Surprisingly Turing-Complete”, which sent me down a very deep rabbit hole, where I found a few pages about something called an OISC, or a One Instruction Set Computer. A computer with ONLY ONE instruction. It seems to be possible to make a computer without any instructions, but we will leave that as an excersise for the reader.

On one of the OISC pages I found http://mazonka.com/subleq/ which is about SUBLEQ, which stands for SUBtract and Branch if Less than or EQual to Zero. There is a Challenge! A SUBLEQ instruction uses 3 memory locations: <address1> <address2> <jump target>. The contents of the location pointed to by <address1>, is subtracted from the location pointed to by <address2>. The result is stored in <address2>, then if the result is less than or equal to zero the next instruction is taken from the <jump target> otherwise it is just the next location.

Here is the first program the papers describe:

0: 3 4 6
3: 7 7 7
6: 3 4 0

The number on the left, with the colon, is the address in memory, then we get the contents of 3 locations. The first instruction takes the contents of location 3, which is a 7, subtracts it from the contents of location 4, which is also a 7, leaving 0, which gets stored in location 4. Since the result is zero, it jumps to location 6.

Location 6 does the same thing, but instead jumps back to location 0, but this result is -7. Location 0 does it again with the result being -14, then location 6 changes it to -21. As long as the result is negative, everything is fine.

Most folks seem to implement a 32 bit version of SUBLEQ, so they don’t run into what happens when numbers change sign. In order to get it to run properly on my SUBLEQ processor, called “NinetyOne”, I had to change it to:

In SUBLEQ assembly:

# Loop Minus 7 Fixed
# had to add the two z z instructions to handle positive numbers
begin:  x y again
        z z again
        x:7 y:7 z:0
again:  x y begin
        z z begin

Which assembles to:

 0: 6 7 9
 3: 8 8 9
 6: 7 7 0
 9: 6 7 0
12: 8 8 0

I will let you peruse the various InterNet pages about SUBLEQ to understand what all that gobbledegook means. Another good page to look at is: https://esolangs.org/wiki/Subleq

Now the HARD part: I want to write KillTheBit, HelloWorld, and DaisyDaisy in SUBLEQ. For KillTheBit, I need an XOR operation, and getting boolean operations out of only Subtract is hurting my head! I need the I/O ports too, and I probably don’t have the emulation correct for that. I guess I just have to put my nose to the grindstone, my shoulder to the wheel, my eye on the ball, and try to figure this out in that position.

Let us know if you think owning your own Bit Token computer would be fun. It is looking like the kit might cost around $54.95.

If you have any suggestions or solutions, please let me know,

Bruce Sherry

At Home With Josh Part 5: Tape Drives and EPROMS And Whiskers on Kittens

After working on the Lambda’s monitors as described in my last writeup, my next plan of action was to see if I could get diagnostics loaded into the SDU via 9-track tape.

ROM Upgrade Time

Monitor version 8

But first, I wanted to upgrade the SDU’s Monitor ROM set. The SDU Monitor is a program that runs on the SDU’s 8088 processor. It provides the user’s interface to the SDU where it provides commands for loading and executing files, and booting the system. It also communicates with devices on the Multibus and the NuBus. As received, my Lambda has Version 8 of the monitor which is as far as I know the last version released to the public at large. However, the Lambdas that Daniel Seagraves owns came with an internal-only Monitor in their SDUs, designated Version 102. This version adds a few convenient features: it can deal more gracefully with loss of CMOS RAM (important since I don’t have a backup battery anymore) and adds a few commands for defining custom hard drive types.

One of the 27128A EPROMs

A week or so prior, Daniel had sent me a copy of the Version 102 ROMs so all I had to do was write (“burn”) the copy onto real EPROMs and install them in the SDU. I had spare EPROMs (four Intel 27128A’s) at the ready but the thing about EPROMs is that they need to be erased before they can be programmed with new data. To do that, you need an EPROM eraser — a little box with a UV lamp in it and a timer — and after searching the house for mine, I came to the realization that I’d taken it to work a few months back for and I’d never brought it back home. And due to present circumstances, it was going to be stuck there for awhile.

Bah.

So with much wailing and gnashing of teeth I ordered a replacement off the Internet and began waiting patiently for it to arrive in 5-7 days. Meanwhile I decided to start documenting this entire process for some kind of blog thing and so I went off, took pictures, and started writing long-winded prose about Lisp Machines and restorations.

Also during this time I decided to test the old wives’ tale about using sunlight to erase EPROMs. At that time Seattle was experiencing an extremely lovely bout of sunny weather, so I took four 27128’s outside, put them on the windowsill so as to gather as much sun as possible, and left them there for the next four days.

[Four Days Pass…]

They’re still not erased. So much for that idea. Only a few more days until my real EPROM eraser arrives anyway…

[A Few More Days Pass…]

At last my dirt-cheap EPROM ERASER arrived on my doorstep bearing dire warnings of UV exposure and also of the overheating of this fine precision instrument. Ignoring the 15 minute time-limit warning, I put four EPROMs into the drawer, cranked the timer up to 30 minutes and turned it on. And once again I found myself waiting.

[Thirty Minutes Pass…]

The Faithful DATA I/O 280 Gang Programmer

I pulled out my trusty DATA I/O 280 programmer and ran its “blank check” routine to ensure that the EPROMs were indeed as blank as they ought to be, and the programmer said “BLANK CHECK OK.”

It’s then a simple matter to hook the programmer up to my PC and program the new ROMs and soon enough all four were ready to get installed in the SDU. But before I did that I wanted to double-check that the Lambda was still operating — it’d been a couple of weeks since I had last powered it up and things can go wrong sometimes. Best not to introduce a new variable (i.e. new ROMs) into the equation before I can verify the current state.

Uh Oh

And so I hooked things back up to the Lambda and turned it on. And… nothing. No SDU prompt on the terminal and all three LEDs on the front panel are stuck on solid. (As we learned in my second post in this series, this indicates that the SDU is failing its self tests.) I pressed the Reset button a couple of times. Nothing. Power cycled the system just for luck. NIL.

“Well, fiddle-dee-dee!” I said. (I may have used slightly more colorful language than this, but this is a family-friendly blog). “Gosh darn it all to heck.”

I retraced my steps — had I changed anything since the last time I’d powered it on? Yes — I’d installed an Ethernet board that Daniel had graciously sent me (my system apparently never had an Ethernet interface, which is an odd choice for a Lisp Machine). Maybe the Ethernet board was causing some problem here? Pulling the board made no difference in behavior. I checked the power supply voltages at the power supply and at the backplane and everything was dead on. I pulled the SDU out and inspected it, and double-checked socket connections and everything looked OK.

Well, at this point I’m frustrated and my tendency in situations like this is to obsess about whether I broke something and so I run in circles for a bit when what I really need to do is take a step back: OK — it’s broken. How is it broken? How do I go about answering that question? Think, man, think!

Well, I know that the three LEDs are on solid — this would indicate that the SDU’s self-test code either wasn’t running or wasn’t getting very far before finding a failure. So: let’s assume for now that the self-test code isn’t running — how do I confirm that this is the case?

The SDU uses an Intel 8088 16-bit microprocessor to do its business, and it’s a relatively simple matter to take a look at various pins on the chip to see if there’s activity, or lack thereof. The most vital things to any processor (and thus good first investigations while debugging a microprocessor-based system) are power, clock, and reset signals. Power obviously makes the CPU actually, you know, do things. A clock signal is what drives the CPU’s internal logic, one cycle at a time, and the reset signal is what tells the CPU to clear its state and restart execution from step 0. A lack of the first two or an abundance of the latter could cause the symptoms I was seeing.

i8088 pinout, from the datasheet.

Time to get out the oscilloscope; this will let me see the signals on the pins I’m probing. Looking at the Intel 8088 pinout (at right) the pins I want to look at are pin 40 (Vcc), Pin 21 (RESET) and pin 19 (CLK). Probing reveals immediately that Vcc and CLK are ok. Vcc is a nice solid 5 volts and CLK shows a 5Mhz clock signal. RESET however, is at 3.5V — a logic “1” meaning that the CPU is being held in a Reset state, preventing it from running!

So that’s one question answered: The SDU is catatonic because for some reason RESET is being held high. Typically, RESET gets raised at power-up (to initialize the CPU among other things) and might also be attached to a Reset button or other affordance. In the SDU, there is also power monitoring signal attached to the RESET line designated as DCOT (DC Out of Tolerance) — if the +5 voltage goes out of range the CPU is reset:

Power supply status signals, from the “SDU General Description” manual.

It seemed possible (though unlikely) that the Lambda’s Reset switch or the cabling associated with it had failed, causing the symptoms I was seeing, but as expected the cabling tested out OK.

SDU Paddlecard. The cable carrying the DCOT signal is the bundle 2nd from the right.

I then checked the DCOT signal and even though the power supply voltages were measuring OK, I was reading 8V on the DCOT pin at the paddleboard. 8V is high for a normal TTL signal (which are normally between 0 and 5V) and this started me wondering. When I disconnected the DCOT wire from the paddleboard, the DCOT signal measured at the power supply was 0V while the signal at the paddleboard remained at 8V… suggesting some sort of failure between the power supply and the SDU for this signal. It also explains the the odd 8V reading– it’s likely derived from a 12V source with a pull-up resistor; the expectation being that the DCOT signal from the power supply would normally pull the signal down further into valid TTL range.

But what could have failed here? Clearly the power supply itself thinks things are OK (hence the 0V reading there). The difference in reading at one end versus the other can really only point to a problem in the wiring between the power supply and the SDU paddleboard.

Connectors just above the power supply. Connector on the left carries actual power, connector on the right contains the power supply status signals.

There is a small three-conductor cable that runs from the SDU paddlecard down to a connector just above the power supply (pictured at the right). A second three-conductor cable is plugged into this and runs to the power supply itself. Checking these signals for continuity revealed that none of the three wires were continuous from the SDU back to the power supplies. The cable from the connector to the power supply tested fine — so what happened to the cable that runs from the connector to the SDU?

I pulled out the power supply tray to get a look at the cabling, and one glance below the card cage revealed the answer:

Oh.

“Aw, nut bunnies,” I may have been heard to remark to myself. Those three wires had apparently been ripped from the connector (quite neatly, I might add) the last time I had pushed the power supply drawer back in. (Likely while I was taking pictures of the power supplies for my blog writeups…) Quite how it got caught on the tray I’m not sure.

This was easy enough to fix — the wires were reinserted into the pins, and the cable itself rerouted so it would hopefully never get snagged on the power supply tray again. I reconnected everything, held my breath and flipped The Switch one more time.

[Several long seconds pass…]

SDU Monitor version 8
CMOS RAM invalid
>>

greeted me on the terminal. Yay. Whew.

New SDU Monitor, At Last

OK. So at last I’m back to where I’d started this whole exercise, after an evening of panic and frenzied investigation. What was it I was going to do when I’d started out? Oh yeah, I had these new SDU ROMs all ready to go, let’s put ’em in:

SDU Monitor version 102
>>
>> help
r usage: r [-b][-w][-l] addr[,n]
w usage: w [-b][-w][-l] addr[,n] d
x usage: x [-b][-w][-l] addr[,n]
dev usage: dev
reset usage: reset [-m] [-n] [-b]
enable usage: enable [-x] [-m] [-n]
init usage: init
ttyset usage: ttyset dev
setbaud usage: setbaud portnum baudrate
disktype usage: disktype type heads sectors cyls gap1 gap2 interleave skew secsize badtrk
disksetup usage: disksetup
setdr usage: setdr name file [ptr]

>>

Ah, much better. So now the SDU was functional and upgraded, and I was ready to move onto the next phase: running system diagnostics.

9-Track Mind

The SDU has the capability to run programs off of 9-track tape. This is how an operating system is loaded onto a new disk and it’s how diagnostics are loaded into the system to test the various components. The Lambda uses a Ciprico Tapemaster controller, which is normally hooked up to a Cipher F880 tape drive mounted in the top of the Lambda’s chassis.

Qualstar 1052 9-Track Tape Drive

My Lambda’s F880 was missing when I picked it up, but the Tapemaster should in theory be able to talk to any tape drive with a Pertec interface. I’m still trying to track down an actual F880 drive, but in the meantime I have one potentially compatible drive in my collection — a Qualstar 1052. This was a low-cost, no-frills drive when it was introduced in the late 1980s but it’s simple and well documented and best of all: it has no plastic or rubber parts, so no worries about parts of the transport turning into tar or becoming brittle and breaking off.

It’s also really slow. The drive has no internal buffer so it can’t read ahead, which means that depending on how it’s accessed it may have to “shoeshine” (reverse the tape, then read forward again) the tape frequently. But speed isn’t really what I’m after here — will it work with the Lambda or won’t it?

I have a tape containing diagnostics (previously written on a modern Unix system with a SCSI 9-track drive attached) ready to go. So I cabled up the Qualstar to the Lambda’s Pertec cabling (as pictured in the above photograph) and attempted to load a program from the tape using the “tar” program:

>> /tar/load

The tape shoeshined (shoeshone?) once (yay!) and stopped (boo!), and the SDU spat back:

tape IO error 0xD
>>

Well, that’s better than nothing, but only barely. But what does IO error 0xD mean? The unfortunate reality is that there is little to no documentation available on the SDU or the associated diagnostics. But I do have the Ciprico Tapemaster manual, thanks to bitsavers.org:

Relevant snippet from the Ciprico Tapemaster manual

Error 0xD indicates a data parity error: the data being transmitted over the Pertec cabling isn’t making it from the drive to the Tapemaster intact, so the controller is signalling a problem. The SDU stops the transfer and helpfully provides the relevant error code to us.

So where are the parity errors coming from? It could be a controller fault but given this system’s history I decided to take a closer look at the cabling first. A Pertec tape drive is connected to the controller via two 50-pin ribbon cables designated “P1” and “P2.” While I’d previously checked the cables for damage, I hadn’t actually checked the edge connectors at the ends of the cables, and well, there you go:

Crusty Connectors
It’s cleaner now, trust me.

It’s a bit difficult to discern in the above picture but if you look closely at the gold contacts you can see that there’s greenish-white corrosion on many of them. Dollars to donuts that this is the problem. For cleaning out edge connectors like this, I’ll usually spray the insides with contact cleaner and then, to apply a bit of abrasion to the pins, I wipe a thin piece of cardboard soaked in isopropyl alcohol in and out of the slot. I used this technique here and pulled out a good quantity of crud and dirt, leaving the connector nice and clean. Or at least clean enough to function, I hoped. Rinse and repeat for the second Pertec cable and let’s try this again:

>> /tar/load

And the tape shoeshines once… and shoeshines again… and again… hm. Is it actually reading anything or is there some other problem and it’s just reading the same block over and over? Let’s let it run for a bit…

A graphic portrayal of tape shoeshining!
>> /tar/load
no memory in main bus
Initializing SDU
 
SDU Monitor version 102
>>

No more parity errors, and the “load” program did eventually load. It then complained about a lack of memory. It looks like the tape drive, the cable, and the controller all work! (Thanks to the Qualstar’s slowness, it took about five minutes between the “/tar/load” and the “no memory in main bus” error, so this is going to be a time-consuming diagnostic process going forward.)

The “no memory in main bus” error is not unexpected since at that moment the only boards installed in the Lambda’s backplane were the SDU and the tape controller. I have a few memory boards at my disposal, and I opted to re-install the 4mb memory board that normally resides in slot 9. Let’s run that again:

>> /tar/load
no memory in main bus
Initializing SDU

SDU Monitor version 102
>>

Well, hm. Maybe that memory board doesn’t work — let’s try the 16mb board normally in slot 12:

>> /tar/load
using 220K in slot 12
load version 307
Disk unit 0 is not ready.

/tar/loadbin exiting
Initializing SDU

SDU Monitor version 102
>>

Huzzah! The LMI has memory that works well enough to respond to the SDU, and it has a functional tape subsystem. It’s going to be awhile before I have a functioning disk, and as per the error message in the output, /tar/load expects one to be present. This is completely rational, since “load” is the program that is used to load Lisp load bands onto the disk from tape.

That’s enough for now — in the next installment, since the Lambda is now capable of loading diagnostics from tape, we’ll actually run some diagnostics! Thrills! Chills! Indecipherable hexadecimal sludge! See you next time!

The XKL Toad-1 System

The XKL Toad-1 System (hereafter “the Toad”) is an extended clone of the DECSYSTEM-20, the third generation of the PDP-10 family of computers from Digital Equipment Corporation. What does that mean? To answer that requires us to step back into the intertwined history of DEC, BBN,1 SAIL,2 and other parts of Stanford University’s computing community and environs.

It’s a long story. Get comfortable. I think it will be worth your time.

The PDP-10

The PDP-10 family (which includes the earlier PDP-6) is a typical mainframe computer of the mid-1960s. Like many science oriented computers prior to IBM’s System/360 line, the PDP-10 architecture addressed binary words which were 36 bits long, rather than individual characters as was common in business oriented systems. In instructions, memory addresses took up half of the 36 bit word; 18 bits is enough to address 262,144 locations, or 256KW, a very large memory in the days when each bit of magnetic core cost about $1.00.3 Typical installations had 64KW or 96KW attached. The KA-10 CPU in the first generation PDP-104 could not handle any more memory than that.

Another important feature of the PDP-10 was timesharing, a facility by which multiple users of the computer each was given the illusion that each was alone in interacting with the system. The PDP-6 was in fact the first commercially available system to feature interactive timesharing as a standard facility rather than as an added cost item.

TENEX

In the late 1960s, virtual memory was an important topic of research: How to make use of the much larger, less expensive capacity of direct access media such as disks and drums to extend the address space of computers, instead of the very expensive option of adding more core.5 One company which was looking at the issue was Bolt, Beranek, & Newman, who were interested in demand paged virtual memory, that is, viewing memory as made up of chunks, or pages, accessed independently, and what an operating system which had access to such facilities would look like.

To facilitate this research, BBN created a pager which they attached to a DEC PDP-10, and began writing an operating system which they called TENEX for “PDP-10 Executive”6 TENEX was very different from Tops-10, the operating system provided by DEC, but was interactive in the same way as the older OS. The big difference was that more programs could run at the same time, because only the currently executing portions of each program needed to be present in the main (non-virtual) memory of the computer.

TENEX was popular operating system, especially in university settings, so many PDP-10s had the BBN pager attached. In fact, the BBN pager was also used on a PDP-10 system which ran neither TENEX nor Tops-10, to wit, the WAITS system at SAIL.7

The DECsystem-10

The second generation of the PDP-10 underwent a name change, to the DECsystem-10, as well as gaining a faster new processor, the KI-10. This changed the way memory was handled, by adding a pager which divided memory up into 512 word blocks (“pages”). Programs were still restricted to 18 bits of address like previous generations, but the CPU could now handle 22 bits of address in the pager, so the physical memory could be up to four megawords (4MW), which is 16 times as much as the KA-10.

This pager was not compatible with, and was much less capable than, the BBN device, although DEC provided a version of TENEX modified to work with the KI pager for customers willing to pay extra. Some customers considered this to be too little, too late.

SAIL and the Super FOONLY

In the late 1960s, computer operating systems were an object of study in the broader area of artificial intelligence research. This was true of the Stanford Artificial Intelligence Laboratory, for example, where the PDP-6 timesharing monitor8 had been heavily modified to make it more useful for AI researchers. When the PDP-10 came out three years later, SAIL acquired one, attached a BBN pager, and connected it to the PDP-6, modifying the monitor (now named Tops-10) to run on both CPUs, with the 10 handing jobs off to the 6 if they called for equipment attached to the latter. By 1972, the monitor had diverged so greatly from Tops-10 that it received a new name, WAITS.

But the hardware was old and slow, and a faster system was desired. The KI-10 processor was underpowered from the perspective of the SAIL researchers, so they began designing their own PDP-10 compatible system, the Super FOONLY.9 This design featured a BBN style pager and used very fast semiconductors10 in its circuitry. It also expanded the pager address to 22 bits, like the KI-10, so was capable of addressing up to 4MW of memory. Finally, unlike the DEC systems, this system was build around the use of a fast microcoded processor which implemented the PDP-10 architecture as firmware rather than as special purpose hardware.

DECSYSTEM-20 and TOPS-20

DEC was aware of the discontent with their new system among customers; to remedy the situation, they purchased the design of the SuperFOONLY from Stanford, and hired a graduate student from SAIL to install and maintain the SUDS drawing system at DEC’s facilities in Massachusetts. The decision was made to keep the KI-10 pager design in the hardware, and implement the BBN style pager in microcode.

Because of the demand for TENEX from a large part of their customer base, DEC also decided to port the BBN operating system to the new hardware based on the SAIL design. DEC added certain features to the new operating system which had been userland code libraries in TENEX, such as command processing, so that a single style of command handling was available to all programmers.

When DEC announced the new system as the DECSYSTEM-20, with its brand new operating system called TOPS-20, they fully expected customers who wanted to use the new hardware would flock to it, and would port all of their applications from Tops-10 to TOPS-20, even though the new OS did not support many older peripherals on which the existing applications relied. The customers rebelled, and DEC was forced to port Tops-10 to the new hardware, offering different microcode to support the older OS on the new KL-10 processor.

Code Name: Jupiter

DEC focused on expanding the capabilities of their flagship minicomputer line, the PDP-11 family, for the next few years, with a planned enhancement to take the line from 16 bit mini to 32 bit supermini. The end result was an entirely new family, the VAX, which offered virtual memory like the PDP-10 mainframes in a new lower cost package.

But DEC did not forget their mainframe customer base. They began designing a new PDP-10 system, intended to include enhanced peripherals, support more memory, and run much faster than the KL-10 in the current Dec-10/DEC-20 systems. As part of the design, codenamed “Jupiter”, the limited 18 bit address space of the older systems was upgraded to 30 bits, that is, a memory size of one gigaword (1GW = 1024MW), which was nearly 2.5 times the size of the equivalent VAX memory, and far larger than the memory sizes available in the IBM offerings of the period.

Based on the promise of the Jupiter systems, customers made do with the KL-10 systems which were available, often running multiple systems to make up for the lack of horsepower. Features were added to the KL, by changes to the microcode as well as by adding new hardware. The KL-10 was enhanced with the ability to address the new 30-bit address space, although the implementation was limited to addressing 23 bits (where the hardware only handled 22); thus, although a system maxed out at 4MW, virtual memory could make it look like 8MW.

DEC also created a minicomputer sized variant of the PDP-10 family, which they called the DECSYSTEM-2020. This was intended to extend the family into department sized entities, rather than the corporation sized mainframe members of the family.11 There was also some interest in creating a desktop variant; one young engineer was well known for pushing the idea of a “10 on a desk”, although his idea was never prototyped at DEC.

DEC canceled the Jupiter project, apparently destined to be named the DECSYSTEM-40, in May 1983, with an announcement to the Large Systems customers at the semiannual DECUS symposia. Customer outrage was so great that DEC agreed to continue hardware development on the KL-10 until 1988, and software development across the family until 1993.

Stanford University Network

In 1980, there were about a dozen sites at Stanford University which housed PDP-10 systems, mostly KL-10 systems running TOPS-20 but also places like SAIL, which had attached a KL-10 to the WAITS dual processor. Three of the TOPS-20 sites were the Computer Science Department (“CSD”), the Graduate School of Business (“GSB”), and the academic computing facility called LOTS.12

At this time, local-area networking was seen as a key element in the future of computing, and the director of LOTS (whom we’ll call “R”) wrote a white paper on the future of Ethernet13 on the campus. R also envisioned a student computer, what today we would call a workstation, which featured a megabyte of memory, a million pixels on the screen, a processor capable of executing a million instructions per second, and an Ethernet connection capable of transferring a millions bits of data per second, which he called the “4M machine”.

Networking also excited the director of the CSD computer facility, whom we’ll call “L”.14 L designed an Ethernet interface for the KL-10 processors in the DEC-20s which were ubiquitous at Stanford. This was dubbed the Massbus-Ethernet Interface Subsystem, or MEIS,15 pronounced “maze“.

The director of the GSB computer facility, whom we’ll call “S”, was likewise interested in networking, as well as being a brilliant programmer herself. (Of some importance to the story is the fact that she was eventually married to L.) S assigned one of the programmers working for her to add code to the TOPS-20 operating system to support the MEIS, using the PUP protocols created at PARC for the Alto personal computer.16

The various DEC-20 systems were scattered across the Stanford campus, each one freestanding in a computer room. R, L, and S ran miles of 50ohm coaxial cable, the medium of the original Ethernet, through the so-called steam tunnels under the campus, connecting all the new MEISes together. Now, it was possible to transfer files between DEC-20s from the command line rather than by writing them to a tape and carrying them from one site to another. It was also possible to log in from one DEC-20 to another–but using one mainframe to connect to another seemed wasteful of resources on the source system, so L came up with a solution.

R’s dream of a 4M machine had borne fruit: While still at CSD, he had a graduate student create the design for the Stanford University Network processor board. L repurposed the SUN-1 board17 at the processor in a terminal interface processor (“EtherTIP”), in imitation of the TIPs used by systems connected to the ARPANET and to commercial networks like Tymnet and Telenet. Now, instead of wiring terminals directly to a single mainframe, and using the mainframe to connect from one place to another, the terminals could be wired to an EtherTIP and could freely connect to any system on the Ethernet.

A feature of the PUP protocols invented at PARC was the concept of internetworking, connecting two or more Ethernets together to make a larger network. This is done by using a computer connected to both networks to forward data from each to the other. At PARC, a dedicated Alto acted as the router for this purpose; L designated some of the SUN-1 based system as routers rather than as EtherTIPs, and the Stanford network was complete.

Stanford University also supported a number of researchers who were given access to the ARPANET as part of their government sponsored research, so several of the PDP-10s on campus were connected to the ARPANET. When the ARPANET converted to using the TCP/IP protocols which had been developed for the purpose of bring internetworking to wide area networks, our threesome were ready, and assigned programmers from CSD, GSB, and LOTS to make L’s Ethernet routers speak TCP/IP as well as PUP. TOPS-20 was also updated to use TCP/IP, by Stanford programmers as well as by DEC.

S and L saw a business opportunity in all this, and began a small company to sell the MEIS and the associated routers and TIPs to companies and universities who wanted to add Ethernet to their facilities. They saw this as a way to finance the development of L’s long-cherished dream of a desktop PDP-10. They eventually left Stanford as the company grew, as it had tapped the exploding networking market at just the right time. The company grew so large in fact that the board of directors discarded the plan to build L’s system, and so the founders left Cisco Systems to pursue other opportunities.

XKL

L moved to Redmond in 1990, where he founded XKL Systems Corporation. This company had as its business plan to build the “10 on a desk”. The product was codenamed “TOAD”, which is what L had been calling his idea for a decade and a half because “Ten On A Desktop” is a mouthful. He hired a small team of engineers, including his old friend R from Stanford, to build a system which implemented the full 30-bit address space which DEC had abandoned with the cancelled Jupiter project, and which included modern peripherals and networking capabilities.18 R was assigned as Chief Architect; his job was to insure that the TOAD was fully compatible with the entire PDP-10 family, without necessarily replicating every bug in the earlier systems.

R also oversaw the port of TOPS-20 to the new hardware, although some boards19 had a pair of engineers assigned: One handled the detailed design and implementation of the board, while the other worked on the changes to the relevant portion of the operating system. R was responsible for the changes which related to the TOAD’s new bus architecture, as well as those relating to the much larger memory which the TOAD supported and the new CPU.20

The TOAD was supposed to come to market with a boring name, the “TD-1”, but ran into trademark issues. By that time, I was working at XKL, officially doing pre- and post-sales customer advocacy, but also working on the TOPS-20 port.21 Part of my customer advocacy duties was some low-key marketing; when we lost the proposed name, I pointed out that people had been hearing about L’s TOAD for years, and we should simply go with it; S, considered the unofficial “Arbiter of Taste” at XKL, agreed with me.22 We officially introduced the XKL Toad-1 System at a DECUS trade show in the spring of 1995.